Advanced

15 min

updated extension guide working with the stream and voicebot applet overview this guide provides a comprehensive overview of how to use exotel’s stream applet (unidirectional) and voicebot applet (bidirectional) for media streaming applications these applets enable real time audio streaming only between exotel and external bot platforms using websocket based integration note exotel does not provide inbuilt stt (speech to text) or tts (text to speech) capabilities you must use your own bot platform to handle media decoding and ai processing this article is an extension of working with the stream and voicebot applet for stream metadata, logging, and observability, refer to the companion article here why choose websocket based integration? websockets are ideal for low latency, bi directional audio streaming compared to sip or polling mechanisms, websocket integration offers real time bidirectional media transfer persistent low latency connection lightweight and scalable protocol for ai integrations simplified firewall/nat traversal compared to sip rtp this makes it the preferred method for modern bot platforms such as dialogflow, azure bot framework, gupshup, yellow\ ai, haptik, google ccai, amazon lex, and in house nlu/llm based bots that support real time websocket based audio streaming key technical concepts from the core stream/voicebot applet article to ensure compatibility with the base setup, this section re emphasises the following from the core guide websocket endpoint url must be publicly reachable and must support ws\ // or wss\ // protocol with base64 encoded audio frames maximum connection time streaming can last up to 60 minutes per session (check limits based on plan) timeout handling if the bot server does not respond within 10 seconds, the session will fail connection retry exotel will attempt one automatic retry if the websocket handshake fails ensure redundancy in bot endpoints security for wss\ // endpoints, ensure valid tls certificates are installed self signed certs may be rejected payload format use base64 decoded linear pcm payloads (raw/slin 16 bit, 8khz mono pcm) to feed your stt engines these constraints and protocols apply to both stream and voicebot applets and should be adhered to during bot deployment applet differences applet streaming direction primary use case stream applet unidirectional transcribe user audio, e g , stt voicebot applet bidirectional full voice ai interaction (bot speaks + listens) note exotel only handles the media relay the bot must provide transcription, nlu, tts, etc events sent over websocket each applet emits events during the call session the communication over the websocket is bi directional, with distinct roles for exotel and the bot exotel to bot transmits session events (connected, start, media, dtmf, stop, clear) and the audio stream in base64 encoded chunks bot to exotel returns media (for voicebot applet only) and may trigger control markers (e g , mark) the bot must gracefully handle session events and terminate the websocket connection when the bot conversation ends ending the wss connection triggers exotel to move to the next applet there is no explicit stop event sent from the bot to exotel each event type exotel emits over the websocket serves a distinct purpose in orchestrating the media stream and enabling seamless bot interaction understanding these events allows for optimised bot design, real time feedback loops, and intelligent call flow transitions best practices & use cases per event connected confirms websocket handshake use this to initialise your bot pipeline (e g , stt/tts service initialization, session state allocation) best practice log and correlate with call sid for session tracing use case trigger bot intro prompt preparation start indicates that audio streaming is beginning best practice start buffering/streaming audio to the stt engine use case sync with a user facing prompt like "how can i help you today?" media continuous chunks of base64 encoded pcm audio best practice ensure your stt engine handles 100ms pcm blocks efficiently use case real time speech transcription or voice intent detection dtmf keypress detection on the caller's side best practice use to branch hybrid ivr+bot logic without stt latency use case press 1 to speak to agent → exotel hears dtmf → triggers escalation mark bot sent event to indicate a logical milestone best practice use to sync analytics or inject debugging hooks use case mark when the bot completes a sub flow (e g , "address collected") stop triggered by exotel when the customer's leg is disconnected use case to identify when the customer's leg has disconnected clear resets session context mid call sent by voicebot voicebot sends this event to indicate that the current conversational context should be flushed and re initialized this is useful in scenarios where the bot needs to start a fresh session mid call, such as when the caller says “start over” or the previous context is corrupted best practice ensure your bot wipes session memory and re establishes a clean state when a clear event is received use case caller says "start again" → clear received → bot resets all entities, replays welcome prompt supported chunk size ensure the bot handles minimum 100ms audio chunks (approx 3 2 kb base64 payload per frame) for seamless reset and session realignment during mid call context switches best practice use to reinitialise bot memory (e g , context drops) use case caller says "start over" → bot requests clear → reset the conversation these events should be monitored in real time and mapped to your backend decision logic and conversation orchestration flow connected indicates the websocket connection is successfully established (initialize bot session, allocate stt/tts/llm resources) start voice media stream is starting (start decoding audio for transcription or detection) media base64 encoded pcm audio payload (feed to stt or analytics pipeline) dtmf dtmf tones detected (capture digit input in ivr scenarios) mark developer defined markers (sync checkpoints in bot logic) stop media streaming session ends (escalate, save transcript, or trigger routing) clear reset session context (re initiate bot logic mid call) streaming audio format codec 16 bit linear pcm (s16le) sample rate 8000/16000/24000 hz channels mono encoding base64 in websocket frames dynamic url and custom parameters you can configure a dynamic websocket url using placeholders and custom parameters custom parameter rules max 3 parameters total param length (after ?) must not exceed 256 characters sample ws\ //127 0 0 1 5001/media?param1=value1\&param2=value2\&param3=value3 dynamic resolution from http(s) applet must also return a valid ws(s) url in this format recording (optional) recordings, if enabled, are returned via passthru as a recordingurl use cases train stt/llm models compliance and qa reviews escalation audits voice sentiment labeling datasets routing to agent or contact center after voicebot applet when the websocket connection is closed —either due to the bot disconnecting after completing its interaction or a network level termination—exotel automatically moves to the next applet in the flow there is no explicit stop event sent by the bot to exotel instead, the bot must close the websocket session once the conversation ends upon disconnection, exotel internally emits a stop event and transitions to the next applet, typically a passthru applet this passthru makes a get request to your endpoint, and based on your response (e g , escalate=true), exotel decides how to route the call scenarios connect applet → route to exotel agent sip connect via vsip trunk → route to enterprise contact center hangup applet → gracefully end the call example user "talk to human" → bot finishes → wss disconnects → exotel emits stop → passthru get call → response escalate=true → switchcase → vsip trunk over connect applet passthru integration always place a passthru applet immediately after voicebot/stream applet to fetch session metadata log streaming stats (sid, duration, recording url) detect disconnects read escalation flags from response best practices place passthru immediately after streaming applet use clear/mark events for context handling and observability use active streams api for concurrency limits design passthru logic to interpret escalation/disconnect follow websocket timeout, reconnect, and handshake guidelines strictly keep custom params concise and secure ensure bot sends stop to gracefully close stream advanced use cases stt pipeline using openai whisper, google stt bot+llm conversations via voice ivr replacement with nlp flows dtmf + speech combo journeys hybrid fallback to live agents via sip trunk summary exotel’s voicebot and stream applets power modern voice automation by offering developer first websocket based audio streaming infrastructure these applets act as programmable media bridges, streaming audio in real time between exotel’s telephony platform and any compliant bot platform capable of understanding linear pcm data by adopting this architecture, enterprises gain low latency streaming that supports responsive ai driven conversations vendor neutral integration with stt, tts, llms, and custom nlp systems dynamic routing and escalation via passthru applets and intelligent fallback logic secure, observable, and resilient media sessions with active stream monitoring enterprise grade call flow design that integrates exotel cc, sip trunks, and external agents this document is a production ready extension to the stream/voicebot applet basic guide and complements the passthru streaming metadata guide for deployment, ensure compliance with session lifecycle best practices, chunk size handling (100ms), parameter limits, recording strategies, and escalation routing logic through websocket termination events

Have a question?

Our super-smart AI, knowledgeable support team and an awesome community will get you an answer in a flash.

To ask a question or participate in discussions, you'll need to authenticate first.