One Agent, Two Voice Paths: Telephony Bridge vs. Browser-Direct
The same AI voice agent, reachable two ways: a server-held telephony bridge or a browser-direct WebSocket. Here's the architecture trade-off.
You build one agent — its persona, its tools, its memory, its knowledge. Then a contact reaches it by voice. The interesting question is how the audio gets there, because there are two completely different answers, and the right ai voice agent architecture depends on which one you pick.
In Matrix, the same agent is reachable two ways:
- Telephony bridge — someone dials a phone number. The server holds a WebSocket to the model and bridges it to the carrier.
- Browser-direct — someone opens a
/voicepage. The browser holds the WebSocket straight to the model, and the server is nowhere in the audio path.
Both run on the consumer Gemini Live API. Both produce the identical persona, because the prompt is composed the same way regardless of channel. What differs is who holds the socket — and that one decision cascades into latency, security requirements, where tools run, and how you operate the thing. This post is the trade-off, laid out for engineers choosing between them.
Path one: the telephony bridge
A phone call can't speak WebSocket-to-Gemini on its own. The carrier — Exotel, in our case — speaks its own streaming protocol over its own socket. So the server has to sit in the middle and bridge two sockets: Exotel on one side, Gemini Live on the other.
That bridge is a single object per call: CallSession. One CallSession owns one Exotel ⇄ Gemini Live connection for the lifetime of one call. Audio frames flow carrier → server → model and model → server → carrier, with the server transcoding and relaying in between.
The subtle part is how the call learns who it is. An inbound call arrives as bytes on a socket; it doesn't carry a persona. So when Exotel's stream connects, onExotelStart adopts the existing Session row by its callSid and derives everything from it:
- direction — inbound vs. outbound
- agent — which persona answers
- campaign — if this call belongs to an outbound campaign
- objective — the per-call instructions for this contact
That Session row is the unification point. A Session is the platform's single interaction entity (TEXT_CHAT / VOICE_REALTIME / future WHATSAPP) — a chat and a phone call are the same concept, one row. So by the time the model's setup message goes out, the bridge has assembled the full system instruction: persona + objective + contact context + recalled memory.
caller ──dial──▶ Exotel ──WS──▶ CallSession ──WS──▶ Gemini Live
│
onExotelStart: adopt Session by callSid
→ direction · agent · campaign · objective
Because the server is in the path, it can do things the browser path structurally cannot: run tools server-side with full tenant context, call external HTTP tools without browser CORS in the way, and capture both audio legs for recording.
Path two: browser-direct
Now the opposite design. A contact opens the agent's /voice page. Their browser opens a WebSocket directly to Gemini Live. The server's only job is to mint an ephemeral token so the browser can authenticate that socket. After that, the server is out of the loop entirely.
The browser holds the WS direct to Gemini; Spring only mints ephemeral tokens. Zero server in the audio path.
This is the lowest-latency option available, because the audio never makes a round trip through your backend. The trade-off is that the browser is now a first-class participant, which brings two hard requirements.
First, microphone capture requires a secure context. getUserMedia only works over HTTPS (or localhost). LAN IPs over plain HTTP won't grant mic access — which is why local development runs everything behind Caddy with tls internal so any hostname or IP gets a cert.
Second, the browser can't reach every tool the server can. Server-side, an agent's external HTTP tools just work. In the browser-direct path, those same tools are subject to the page's CORS policy — an external endpoint that doesn't send permissive CORS headers will be blocked by the browser before the request ever leaves. So the browser-direct voice flow advertises the agent's own tools and proxies the ones that need server execution (like knowledge search) back through the backend, rather than letting the browser call arbitrary external HTTP endpoints directly.
contact's browser ──WS──▶ Gemini Live
▲
└── one POST to backend: mint ephemeral token, then step aside
The trade-off table
| Dimension | Telephony bridge | Browser-direct |
|---|---|---|
| Reached by | Dialing a phone number (Exotel) | Opening a /voice page |
| Who holds the WS | The server (CallSession) | The browser |
| Server in audio path? | Yes — bridges both legs | No — only mints tokens |
| Latency | One extra hop through your backend | Lowest — browser ↔ model directly |
| Secure-context need | N/A (phone audio) | Required — getUserMedia needs HTTPS |
| Where tools execute | Server-side, full tenant context | Agent's own tools; server-proxied for the rest |
| External HTTP tools | Work directly server-side | Subject to browser CORS |
| Call recording | Captures both legs server-side | Browser captures + uploads |
| Best for | Real phone calls, outbound campaigns | In-app / web voice, lowest latency |
The headline: browser-direct is faster and cheaper to run; the telephony bridge is more capable and is the only path that touches the actual phone network. Neither is "better" — they answer different questions.
Where call recording fits
Recording is available on both channels but ships dark behind a flag (MATRIX_RECORDING_ENABLED, off by default — turn it on deliberately, per your compliance posture). The two paths capture differently because of who holds the socket:
- Telephony bridge: the server already sees both audio legs, so it mixes them into one mono WAV and uploads it to object storage.
- Browser-direct: the server isn't in the path, so the browser captures the audio and uploads it.
Either way the recording lands in the same place and shows up in the same dashboard — the path it came from is an implementation detail by the time you go to play it back.
Operating the telephony stream: isolation matters
A server-held bridge has an operational property a stateless API doesn't: it's a long-lived WebSocket. Every time you redeploy the service holding it, that socket is torn down and the live call drops. Worse, carriers free a streaming slot only when the socket closes cleanly — an ungraceful kill can strand the slot.
That matters more than it sounds, because a basic carrier account may permit only one concurrent streaming connection. With a single slot, one stranded connection can block every subsequent inbound call until it clears. (This is also why outbound campaigns on such an account pace at maxConcurrent=1 — you can't run more concurrent calls than you have stream slots, regardless of what your dialer would like to do.)
Matrix addresses this with two layers of defense. First, a graceful-shutdown handler that, on SIGTERM, cleanly closes active sessions so the carrier releases the slot even when you do redeploy. Second — as groundwork — the option to run telephony as its own rarely-redeployed service (deploy-voicebot.sh), so daily backend deploys never touch live calls at all. Both services run the same image; the split is deployment topology, not a code fork. The campaign scheduler must run in exactly one service to avoid double-dispatching the same contact, so it moves with telephony.
The browser-direct path has none of this operational weight — there's no server socket to drop. That's the flip side of "the server is in the audio path": the bridge buys capability, and it costs you a thing you have to operate carefully.
One agent, composed once
The reason both paths produce the same agent is that voice is a transport, not a brain. The persona, tools, knowledge, and memory are composed the same way no matter how the contact arrives — that's a deliberate parity invariant across chat, voice, and autonomous tasks. The telephony bridge and the browser-direct page differ in who carries the bytes, not in who the agent is. Pick the carrier based on the trade-off table; the agent doesn't change.
Takeaway
If you're choosing a voice architecture: use browser-direct when the contact is already in a browser and you want the lowest possible latency with the least operational overhead — the server only mints a token. Use the telephony bridge when you need a real phone number, server-side tool execution, or outbound campaigns — and budget for the operational care a long-lived carrier socket demands. The same agent answers either way.
For the war stories of getting Gemini Live to work over a phone line at all, read We Put Gemini Live on a Phone Line. For turning the telephony bridge into a paced dialer, read Outbound AI Calling Campaigns That Don't Sound Like Robocalls.
Build a voice agent both ways. Spin up a workspace, create one agent, then reach it from the /voice page and over the phone — same persona, two transports. Start at the agents dashboard or POST /api/orgs/{slug}/agents, and see docs/ARCHITECTURE.md for the full voice layer.
Build your first agent on Matrix
Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.