Voice AI Agents 2026: Building Real-Time Speech Agents
Posted on: 6/5/2026 1:14:29 AM
Table of contents
- 1. Why voice agents exploded in 2026
- 2. Two architectures: cascading vs speech-to-speech
- 3. The latency budget — where the time actually goes
- 4. Turn-taking and barge-in: teaching a machine to converse
- 5. Inside a production voice agent
- 6. Function calling during a live call
- 7. The 2026 voice tooling landscape
- 8. The hard problems that aren't the LLM
- 9. Measuring a voice agent: the metrics that matter
- 10. The role shift: from script writers to conversation designers
- 11. How voice agents evolved
- 12. Common mistakes to avoid
- 13. Conclusion
You call a support line. You finish your sentence, and for one full second there is silence — then the agent starts talking, right as you begin to add "...oh, and one more thing." It talks over you. You both stop. It keeps going anyway, finishing a thought you already abandoned. The conversation feels broken, and you can tell instantly that you are talking to a machine. Now imagine the same call where the reply lands in under 300 milliseconds, the agent stops the moment you cut in, and it picks up exactly where you redirected it. The difference between those two calls is not the language model — it is voice engineering, and in 2026 it has become its own discipline.
Text agents have the luxury of time. A voice agent does not. Human conversation runs on a turn-taking rhythm measured in milliseconds, and the brain notices delay long before it can name it. This article dissects how real-time voice agents actually work in 2026: the two competing architectures, where the latency really goes, how machines learn to take turns and handle being interrupted, the production tooling stack, and why the hardest problems in voice AI have almost nothing to do with the LLM.
1. Why voice agents exploded in 2026
Voice interfaces are not new — IVR phone trees have existed for decades, and voice assistants since the 2010s. What changed is that the three things voice always lacked finally arrived together. First, LLMs made open-ended conversation possible: a voice agent can now handle "actually, can you check if my other order shipped too?" instead of "press 2 for billing." Second, latency dropped below the perceptual threshold: streaming models, faster inference, and speech-native models pushed round-trip time under the ~300ms where a conversation stops feeling robotic. Third, orchestration frameworks matured: Pipecat reached v1.0 in April 2026, and LiveKit Agents shipped adaptive interruption handling — the plumbing that used to take a team months is now a library.
The result is that voice agents moved from gimmick to infrastructure: appointment scheduling, outbound sales qualification, healthcare intake, drive-through ordering, technical support. Anywhere a phone call or a microphone sits between a human and a system, a voice agent can now stand in.
The fundamental constraint: conversation is real-time
A text agent can think for three seconds and nobody minds. A voice agent that pauses three seconds before answering feels broken — humans interpret silence as confusion, disconnection, or rudeness. Every architectural decision in voice AI is downstream of one brutal fact: you are racing a 300ms clock on every single turn, and the clock starts the instant the user stops talking.
2. Two architectures: cascading vs speech-to-speech
There are exactly two ways to build a voice agent in 2026, and choosing between them is the most consequential decision you will make.
The cascading pipeline (also called turn-based) chains three separate models: speech-to-text (STT/ASR) transcribes what the user said, an LLM reasons over the transcript and produces a text reply, and text-to-speech (TTS) speaks it back. The speech-to-speech (S2S) approach uses a single multimodal model that ingests audio and emits audio directly, with no intermediate text — preserving tone, emphasis, and prosody that text throws away.
flowchart LR
subgraph C[Cascading Pipeline]
direction LR
U1[User audio] --> VAD1[VAD +
endpointing]
VAD1 --> STT[STT / ASR]
STT --> LLM[LLM
reasoning]
LLM --> TTS[TTS]
TTS --> O1[Agent audio]
end
subgraph S[Speech-to-Speech]
direction LR
U2[User audio] --> M[Single multimodal
S2S model]
M --> O2[Agent audio]
end
style C fill:#16213e,stroke:#fff,color:#fff
style S fill:#0f3460,stroke:#fff,color:#fff
style LLM fill:#e94560,stroke:#fff,color:#fff
style M fill:#e94560,stroke:#fff,color:#fff
Cascading chains three swappable models with text in the middle; speech-to-speech collapses everything into one audio-native model.
The trade-off is real and it does not have a universal winner. Speech-to-speech wins on naturalness and latency — it hears laughter, hesitation, and sarcasm, and it can respond in ~320ms because there is no pipeline to traverse. Cascading wins on control, observability, and cost — you choose exactly which LLM reasons, you can read and log the transcript, you can inject business logic between transcription and response, and you can swap any component without re-architecting.
| Dimension | Cascading (STT→LLM→TTS) | Speech-to-Speech (S2S) |
|---|---|---|
| Latency | Higher — sum of three models (600ms–1.7s naive, ~500ms tuned) | Lower — single model (~320ms best in class) |
| Naturalness | Loses prosody/emotion at the text bottleneck | Preserves tone, emphasis, laughter, hesitation |
| Control over reasoning | Full — pick any LLM, inject logic mid-pipeline | Limited — reasoning is baked into the model |
| Observability | High — text transcript at every stage | Low — no intermediate text to log/audit |
| Vendor lock-in | Low — mix and match providers | High — tied to one vendor's S2S model |
| Best for | Telephony, compliance, complex tool use, cost control | Consumer conversation, naturalness-first UX |
What most production teams actually choose
In 2026, the majority of production deployments still run cascading pipelines, for one reason: control. Teams need to decide which LLM handles reasoning, which voice the user hears, and what compliance/business logic runs between transcription and response — especially in regulated domains like healthcare and finance. Speech-to-speech is winning consumer-facing, naturalness-first products, but "I need to read the transcript and route this to a tool" still pushes most enterprises toward the cascade.
3. The latency budget — where the time actually goes
The single most useful mental model in voice AI is the latency budget: a fixed amount of time — roughly 300ms for a snappy experience — that every component must share. The counterintuitive truth is that STT and TTS are not where the time goes. The two real culprits are turn-taking (deciding the user actually finished) and LLM time-to-first-token.
| Stage | Typical cost | Notes |
|---|---|---|
| Network round-trip | 30–80ms | WebRTC; PSTN telephony eats far more |
| Speech-to-text | 100–300ms | Streaming STT is fast; runs while the user speaks |
| Endpoint / turn detection | 500–1000ms+ | The silent killer — waiting to be sure the user stopped |
| LLM time-to-first-token | 350–1000ms | The biggest controllable variable |
| Text-to-speech (first audio) | 90–200ms | Streaming TTS emits the first chunk fast |
Notice that endpoint detection can cost more than every other stage combined. If you wait a full second of silence to be certain the user is done, you have already blown the budget before the LLM even starts. This is why turn detection is the hardest problem in voice AI — and the next section is dedicated to it.
The transport layer quietly decides your budget
With WebRTC you have roughly 240–270ms left for STT + LLM + TTS after transport overhead. On a PSTN phone call, transport can eat the entire budget, leaving 0–100ms — making the 300ms target physically impossible. WebRTC saves 150–700ms versus a traditional phone call. If you are building voice over the telephone network, you are not playing the same game; you must relax your latency target (≈800ms is tolerable for healthcare, <600ms for outbound sales) and design around it.
4. Turn-taking and barge-in: teaching a machine to converse
Humans are astonishingly good at knowing when it is their turn to speak. We use silence, intonation, grammar, and breathing as cues, and we overlap and interrupt gracefully. Machines have none of this for free. Two problems define conversational voice: knowing when the user is done (endpointing) and handling being cut off (barge-in).
4.1. End-of-turn detection
The naive approach is a silence timer: wait 800–1200ms of silence after the last word, then assume the user is done. But pure silence is a terrible signal — people pause mid-sentence to think ("I'd like to book... a table for four"), and a dumb timer will interrupt them. The 2026 answer is semantic endpointing: combine a Voice Activity Detection (VAD) silence threshold with a model that judges whether the sentence is semantically complete. "I'd like to book a table for" is grammatically unfinished — wait. "I'd like to book a table for four" is complete — respond. Good endpointing is the difference between an agent that feels patient and one that constantly cuts you off.
4.2. Barge-in: handling interruptions
When the agent is speaking and the user starts talking, the agent must stop instantly — just like a polite human. This sounds trivial and is brutally hard, because four things must happen near-simultaneously the moment barge-in is detected.
flowchart TD
A[Agent is speaking] --> B{VAD detects user
speech above threshold
for min window?}
B -- No --> A
B -- Yes: BARGE-IN --> C[1. Stop TTS playback
immediately]
C --> D[2. Cancel in-flight
TTS generation]
D --> E[3. Cancel LLM
generation in progress]
E --> F[4. Reset stream state
discard the old turn]
F --> G[Listen to the
user's new input]
G --> H[Start a fresh turn]
style A fill:#e94560,stroke:#fff,color:#fff
style B fill:#fff3e0,stroke:#ff9800,color:#2c3e50
style C fill:#16213e,stroke:#fff,color:#fff
style D fill:#16213e,stroke:#fff,color:#fff
style E fill:#16213e,stroke:#fff,color:#fff
style F fill:#16213e,stroke:#fff,color:#fff
style H fill:#4CAF50,stroke:#fff,color:#fff
Barge-in is a four-step teardown. Miss any step and the agent talks over the user or "finishes the old thought" after being interrupted.
If any of those four steps is missing, the failure is immediately audible: the agent keeps talking over the user, or it goes silent then suddenly resumes a sentence the user already moved past. Modern frameworks now ship tuned barge-in — LiveKit Agents reports adaptive interruption handling at roughly 86% precision and 100% recall — but the tuning matters: too sensitive and a cough cancels the agent mid-sentence; too lax and it ignores genuine interruptions.
Why barge-in needs cancellation, not just muting
A naive implementation just mutes the speaker. But the LLM is still generating, the TTS is still synthesizing, and tokens are still being buffered. If you only mute, the moment the user finishes their interruption the agent dumps the entire pre-interruption response — confusing and robotic. True barge-in cancels the in-flight LLM and TTS work and discards the buffered audio, so the agent genuinely abandons its old turn and responds to what the user actually just said.
5. Inside a production voice agent
Assemble the pieces and a real voice agent has a clear anatomy. The orchestrator sits at the center, coordinating a tight real-time loop between the transport layer and the model stack.
flowchart TB
subgraph T[Transport Layer]
WRTC[WebRTC / WebSocket
streaming audio in-out]
end
subgraph ORCH[Orchestrator - the real-time loop]
VAD[VAD + semantic
endpointing]
INT[Interruption /
barge-in handler]
CTX[Conversation state
+ context]
end
subgraph MODELS[Model Stack]
STT[STT streaming]
LLM[LLM + tool calling]
TTS[TTS streaming]
end
subgraph TOOLS[Tools / Backend]
API[CRM / DB / booking
function calls]
end
WRTC --> VAD
VAD --> STT
STT --> CTX
CTX --> LLM
LLM --> TTS
TTS --> WRTC
LLM <--> API
INT -. cancels .-> LLM
INT -. cancels .-> TTS
WRTC -. user speech .-> INT
style T fill:#2c3e50,stroke:#fff,color:#fff
style ORCH fill:#16213e,stroke:#fff,color:#fff
style MODELS fill:#0f3460,stroke:#fff,color:#fff
style TOOLS fill:#e94560,stroke:#fff,color:#fff
The orchestrator owns timing and interruption; the model stack does the heavy lifting; tools connect the call to real business systems.
- Transport layer: streams audio bidirectionally over WebRTC (browser/app) or a WebSocket bridge (telephony). This is where your latency floor is set.
- Orchestrator: the brain of timing — runs VAD and semantic endpointing, detects barge-in, holds conversation state, and decides when to hand off between listening and speaking. This is the part frameworks like Pipecat and LiveKit Agents give you.
- Model stack: streaming STT, an LLM with tool-calling, and streaming TTS — or a single S2S model replacing all three.
- Tools / backend: the LLM calls real functions mid-conversation — look up an order, book a slot, check inventory — which is what separates a useful agent from a chatbot that just talks.
6. Function calling during a live call
A voice agent that only talks is a podcast. The value comes when it acts — and tool calls during a live voice conversation introduce a latency problem text agents never face. When the user says "is my order shipped?", the agent must call your backend, wait for the response, and reply — but it cannot just go silent for two seconds while the API responds.
The standard 2026 pattern is the filler-while-fetching technique: the moment a tool call starts, the agent emits a short natural acknowledgment ("Let me check that for you...") to fill the dead air, runs the function call in parallel, and stitches the real answer in when it returns. This mirrors exactly what a human agent does when they say "one second while I pull that up." Combined with streaming the LLM's response token-by-token into TTS, it keeps the conversation feeling alive even when real work is happening underneath.
Stream everything, buffer nothing you don't have to
The golden rule of voice latency: never wait for a complete output when you can start emitting a partial one. STT streams partial transcripts as the user speaks. The LLM streams tokens as it reasons. TTS starts speaking the first sentence while the LLM is still generating the second. Each stream overlaps the next, so the user hears the first words of the reply long before the full response exists. A pipeline that processes each stage to completion before starting the next will always feel sluggish, no matter how fast each individual model is.
7. The 2026 voice tooling landscape
The ecosystem split into clear layers: orchestration frameworks, realtime transport platforms, and model providers (STT, TTS, S2S).
| Layer | Examples | What it gives you |
|---|---|---|
| Orchestration framework | Pipecat (v1.0, Apr 2026), LiveKit Agents | The real-time loop: VAD, endpointing, barge-in, pipeline wiring, provider plug-ins |
| Realtime transport | LiveKit, WebRTC infra, telephony bridges | Low-latency audio streaming, phone-number integration, scaling concurrent calls |
| Speech-to-speech APIs | Vendor realtime/S2S models | Single-model audio-in/audio-out at ~320ms, prosody-preserving |
| STT / TTS providers | Streaming ASR + neural TTS vendors | The swappable components of a cascading pipeline |
| Voice agent platforms | Hosted end-to-end products | Build/deploy/monitor with less plumbing; trade control for speed-to-market |
Build or buy?
Buy a hosted platform if you need a working voice agent in days and your use case is standard (scheduling, FAQ, qualification). Build on an orchestration framework like Pipecat or LiveKit Agents when you need fine control over the model stack, custom tool integrations, on-prem/compliance constraints, or per-call cost optimization at scale. The framework layer is the sweet spot for most engineering teams in 2026: it hands you the brutal real-time plumbing (barge-in, endpointing) while leaving the model and business-logic choices in your hands.
8. The hard problems that aren't the LLM
The counterintuitive lesson of building voice agents is that the language model is rarely the bottleneck. The failures that wreck production are almost all in the audio and timing layers.
1. Transcription errors poison everything downstream
If STT mishears "I want to cancel" as "I want to council," the LLM reasons over garbage and confidently does the wrong thing. Accents, background noise, crosstalk, and domain jargon (drug names, product SKUs) all degrade ASR. A voice agent is only as good as its worst transcription.
2. Hallucinated transcripts on silence
Some ASR models hallucinate words during silence or background noise — inventing a phrase the user never said, which the LLM then dutifully responds to. Guarding against phantom input is a real production concern.
3. Endpointing is never "solved"
Every domain has different pause patterns. An elderly caller speaks with long pauses; a frustrated one talks fast and interrupts. A single endpointing threshold cannot serve both, and getting it wrong means either cutting people off or feeling sluggish.
4. Cost compounds per minute
Unlike a one-shot text query, a voice call burns STT + LLM + TTS continuously for its entire duration. A ten-minute call is ten minutes of three models running. At scale, per-minute economics — not per-query — decide whether the product is viable, which is a major reason teams stay on controllable cascading pipelines.
9. Measuring a voice agent: the metrics that matter
Three numbers define voice agent quality, and none of them is "LLM accuracy" in isolation.
| Metric | What it measures | Why it matters |
|---|---|---|
| TTFT (time to first token/audio) | Latency from end-of-user-speech to start-of-agent-speech | The single most felt number — this is "responsiveness" |
| WER (word error rate) | Transcription accuracy of the STT stage | Sets the ceiling on everything downstream |
| RTF (real-time factor) | Processing time relative to audio duration | Determines whether you can keep up with the stream live |
| Interruption precision/recall | How accurately barge-in fires (and doesn't) | Decides whether the agent feels polite or pushy |
| Task completion rate | % of calls that achieve the user's goal | The real business KPI — latency means nothing if the task fails |
Don't optimize latency in a vacuum
A 200ms agent that mishears half its inputs is worse than a 400ms agent that gets them right. Latency and accuracy trade off — pushing endpointing to be ultra-fast means cutting people off more often. Always read TTFT alongside WER and task completion, never alone.
10. The role shift: from script writers to conversation designers
Voice AI changes what the team around it actually does. The old IVR world was built by writing rigid call flows — "if the user presses 1, go to node B." The agentic voice world replaces that with conversation design and policy: defining the agent's persona and tone, the tools it may call, the escalation rules for handing off to a human, and the compliance guardrails for what it must and must not say.
For the Project Manager or product owner, voice introduces governance work that text never required: which calls are recorded and how transcripts are retained, how the agent identifies itself as an AI (a legal requirement in many jurisdictions), what happens when it cannot understand the caller, and how to measure not just deflection rate but customer trust. The voice agent becomes a product with an SLA on latency, an owner for the conversation design, and an audit trail — not a script someone wrote once and forgot.
The new central artifact: the conversation spec
Just as runbooks became shared property between SRE and agent, the conversation spec — persona, allowed tools, escalation triggers, compliance phrases, fallback behavior — becomes the artifact that product, engineering, and compliance co-own. Engineers wire the pipeline; product writes the persona and flows; compliance reviews the guardrails. Voice is where these three disciplines finally have to share one document.
11. How voice agents evolved
12. Common mistakes to avoid
1. Optimizing the LLM while ignoring transport
Teams obsess over model choice while running on a transport that already ate the budget. Fix WebRTC vs telephony and endpointing first — that is where the seconds hide.
2. Processing each stage to completion
Waiting for the full transcript, then the full LLM response, then full TTS guarantees a sluggish agent. Stream and overlap every stage, or you will never hit the budget.
3. Treating barge-in as "mute the speaker"
Without canceling in-flight LLM and TTS work, the agent resumes its abandoned sentence after the interruption. Real barge-in tears down and discards the old turn.
4. One endpointing threshold for all users
A single silence timeout cannot serve both a slow, thoughtful caller and a fast, impatient one. Use semantic endpointing and tune per use case.
13. Conclusion
The hardest thing about voice AI in 2026 is not making a machine talk — it is making it converse. Conversation is a real-time dance of turn-taking, interruption, and timing that humans do effortlessly and machines must be engineered to imitate, millisecond by millisecond. The LLM is the easy part; the orchestration of silence, the four-step teardown of barge-in, the streaming overlap that hides latency, and the transcription quality that everything else depends on — that is the real discipline.
If you are building one, start with the architecture decision (cascading for control, S2S for naturalness), nail your transport layer before touching the model, and instrument TTFT and task completion from day one. A voice agent that answers in 300ms, stops the instant you cut in, and actually completes your task does not feel like talking to a robot — and that feeling, not the model behind it, is the entire product. The race is against a 300ms clock, and every architectural choice is about how you spend those milliseconds.
References
- Softcery — Real-Time (Speech-to-Speech) vs Turn-Based (Cascading STT/TTS) Voice Agent Architecture
- Digital Applied — Voice Agent Infrastructure Stack 2026: Full Reference
- Future AGI — Voice AI Barge-In and Turn-Taking: A 2026 Implementation Guide
- Famulor — Realtime vs. Pipeline Voice Agent: Architecture Guide 2026
- Chanl — Voice AI Pipeline: STT, LLM, TTS and the 300ms Budget
- Telnyx — Voice AI Agents Compared on Latency in 2026
- Ultravox — Speech-to-Speech Voice Agents: Architecture, Benefits, and How They Work
- Retell AI — How Real-Time Voice AI Actually Works (STT → LLM → TTS)
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.