Voice Streaming API

This API allows to programmatically initiate an outgoing voice stream call, connecting a user's mobile number to a media WebSocket for real-time interaction.

We offer two kind of voice streaming services to be integrated with bots:

Outgoing Voice Streaming

The service consists of two main components:

  1. A REST API for initiating calls and checking service health

  2. A WebSocket interface for real-time bidirectional audio streaming

Important: In this architecture, your application hosts the WebSocket server, and Alohaa our Voice Stream service hosts web socket client.

API Reference

POST https://voice-stream.alohaa.ai/v1/voice-stream/call

Parameters

Request Headers

Field
Value
Description
Mandatory

Content-type

application/json

Specifies the content type of the request

Yes

x-metro-api-key

*************************

Your API key for authentication purposes.

Yes

Request Body

Field
Data Type
Description
Mandatory

mobile_number

String

Phone number of the user to be called. Must be a valid 10 digit mobile number.

Yes

did

String

Direct Inward Dialing (DID) number to be used for placing the call. Must be a valid 10 digit DID number.

Yes

ws_url

String

WebSocket URL where the call's audio will be streamed in real-time.

Yes

webhook_details

Object

Webhook configuration to receive call lifecycle events. Must be a stringified JSON. url and request_type are mandatory. api_key and api_value are optional.

No

did_shuffle

Boolean

Default value is false . When set to true, leave the did_number field empty. The system will automatically randomize calls using numbers from your existing DID pool.

No

Sample Request

Responses

Success Response

Failure Response

Please go through the diagram to understand the overall flow:

Status Codes

Code

200 OK

Request successful. Call initiation in progress.

400 Bad Request

Invalid parameters

500 Internal Server Error

Server encountered an error


WebSocket Protocol

Connection Architecture

Important: In this architecture, your application hosts the WebSocket server, and our Voice Stream service connects to it as a client.

  1. You host a WebSocket server at a publicly accessible URL

  2. You provide this URL in the wsUrl parameter when initiating a call

  3. Our Voice Stream service connects to your WebSocket server as a client

  4. Real-time bidirectional audio streaming occurs through this WebSocket connection

WebSocket Server Requirements:

  • Must be accessible via the public internet

  • Must support secure WebSocket connections (WSS)

  • Must implement the protocol defined in this document

Message Format: All messages sent and received through the WebSocket connection use JSON format, except for binary audio data which is encoded according to the audio specification.

Protocol Flow

1

You host a WebSocket server at a publicly accessible URL.

2

You initiate a call via our REST API, providing your WebSocket server URL

3

Our service connects to your WebSocket server

4

Our service registers with your server by sending a "connected" event

5

Your server confirms registration by responding with a "connected" event

6

Greeting event: the service plays the greeting audio file at the start of the call. (Optional)

7

Exchange Media (Bidirectional)

8

Connection Termination (When either party ends the call)

Events

1. Connected [Websocket Client(Alohaa application) → WebSocket Server (Customer application)]

Registers a new voice session with a unique callId and the mobile number to be dialed.

2. Connected [WebSocket Server (Customer application) → Websocket Client(Alohaa application)]

Acknowledges the successful registration of the session.

3. Greeting event (Optional)

A greeting event indicates that the customer application (Websocket Server) needs to send Websocket client (Alohaa application) a greeting audio to be played when the call is answered.

4. Media (Bidirectional)

Streams raw audio data in real-time, including a timestamp for synchronization or logging.

From our Service to your server (Voice from the phone call):

This is continuous, even when the customer is not speaking. This ensures uninterrupted data flow and silent packets must still be transmitted to maintain the temporal integrity of the session.

From your server to our service (Voice to be transmitted to the phone call):

This is not continuous. The server accumulates the audio data and sends it as a complete voice chunk after detecting speech boundaries (e.g., via silence detection). This chunk is sent in a batched format for downstream STT → LLM → TTS processing.

  1. Interrupt (Client Server → Alohaa Application)

Signals the client to stop sending audio — typically triggered when the system detects end of user speech.

6. Client → Server: Close WebSocket Connection

Closes the active session. Typically called once the conversation is complete.

Audio Configuration:

The audio data sent through the WebSocket must meet these requirements for compatibility:

  • Audio data should match the format expected by the system

  • The underlying protocol uses G.711 μ-law (PCMU) codec

  • Sample rate: 8kHz

Client shall use TTS with the configurations:

WebSocket Audio Data Guidelines

  • When receiving audio from our service, the RTP header is present for each packet. RTP header is 12 bytes in size. This helps to ordering the packets.

  • When sending audio to our service, you should send audio data μ-law format with .wav headers.

Connection Errors

The WebSocket connection may close with specific close codes:

Code
Interpretation

1000

Normal closure (call ended)

1001

Server going down or client navigating away

1002

Protocol error

1008

Policy violation (e.g., authentication failure)

1011

Server error

Limitations and Constraints

  • Maximum WebSocket message size: 1MB

  • Inactive connections (no messages for 5 minutes) are automatically terminated

  • All WebSocket connections must use secure WebSockets (WSS)

  • WebSocket connections without registration confirmation within 10 seconds are automatically closed


Integration Guide

Prerequisites

  • API credentials (contact support to obtain these)

  • A publicly accessible WebSocket server

  • Basic understanding of REST APIs and WebSockets

Integration Steps

1

Set up your WebSocket server to handle the protocol described above

2

Initiate a call using our REST API, providing your WebSocket server URL

3

Handle the registration when our service connects to your WebSocket server

4

Process incoming audio from the phone call

5

Send outgoing audio to be transmitted to the phone call

6

Send interrupt signals when needed

7

Handle connection closure when the call ends

Code Snippet

1

Initiate API Call

2

WebSocket Server Implementation

Error Handling

Implement robust error handling in your WebSocket server:

Incoming Voice Streaming

An Incoming Voicebot allows your system to automatically handle incoming calls through a WebSocket connection. When an incoming call hits your DID, it is routed to the configured bot, which connects to the specified WebSocket URL to handle call media and logic.

Step 1: Create a Voicebot

1

Navigate to Building Blocks → Voicebot in the left-hand menu.

2

Click Add Voicebot. Enter the following details:

Bot Name: A unique name to identify your bot (e.g., Sales Assistant Bot, Support IVR Bot).

WebSocket URL: The WebSocket endpoint where call media and events will be sent.

3

Click Save to finalize the bot.

Step 2: Assign Voicebot to a Number (DID)

After creating your bot, you need to link it to a specific DID (phone number) so that incoming calls can be routed correctly.

1

Navigate to Building Blocks → Numbers.

2

Against a DID, click Assign next to that DID.

3

A pop-up will show a list of available voicebots. Select your Voicebot from the list.

4

Click Assign to complete the assignment.

Last updated