Voice AI Agents 2026: Building Real-Time Speech Agents

Posted on: 6/5/2026 1:14:29 AM

Table of contents

1. Why voice agents exploded in 2026
1. The fundamental constraint: conversation is real-time
2. Two architectures: cascading vs speech-to-speech
1. What most production teams actually choose
3. The latency budget — where the time actually goes
1. The transport layer quietly decides your budget
4. Turn-taking and barge-in: teaching a machine to converse
1. 4.1. End-of-turn detection
2. 4.2. Barge-in: handling interruptions
  1. Why barge-in needs cancellation, not just muting
5. Inside a production voice agent
6. Function calling during a live call
1. Stream everything, buffer nothing you don't have to
7. The 2026 voice tooling landscape
1. Build or buy?
8. The hard problems that aren't the LLM
9. Measuring a voice agent: the metrics that matter
1. Don't optimize latency in a vacuum
10. The role shift: from script writers to conversation designers
1. The new central artifact: the conversation spec
11. How voice agents evolved
12. Common mistakes to avoid
13. Conclusion
1. References

You call a support line. You finish your sentence, and for one full second there is silence — then the agent starts talking, right as you begin to add "...oh, and one more thing." It talks over you. You both stop. It keeps going anyway, finishing a thought you already abandoned. The conversation feels broken, and you can tell instantly that you are talking to a machine. Now imagine the same call where the reply lands in under 300 milliseconds, the agent stops the moment you cut in, and it picks up exactly where you redirected it. The difference between those two calls is not the language model — it is voice engineering, and in 2026 it has become its own discipline.

Text agents have the luxury of time. A voice agent does not. Human conversation runs on a turn-taking rhythm measured in milliseconds, and the brain notices delay long before it can name it. This article dissects how real-time voice agents actually work in 2026: the two competing architectures, where the latency really goes, how machines learn to take turns and handle being interrupted, the production tooling stack, and why the hardest problems in voice AI have almost nothing to do with the LLM.

~300msTarget response latency before a reply feels "laggy" to humans

~320msEnd-to-end latency of the fastest 2026 speech-to-speech models

150–700msLatency WebRTC saves versus a PSTN phone call

600ms–1.7sTypical end-to-end latency of a naive cascaded STT→LLM→TTS pipeline

1. Why voice agents exploded in 2026

Voice interfaces are not new — IVR phone trees have existed for decades, and voice assistants since the 2010s. What changed is that the three things voice always lacked finally arrived together. First, LLMs made open-ended conversation possible: a voice agent can now handle "actually, can you check if my other order shipped too?" instead of "press 2 for billing." Second, latency dropped below the perceptual threshold: streaming models, faster inference, and speech-native models pushed round-trip time under the ~300ms where a conversation stops feeling robotic. Third, orchestration frameworks matured: Pipecat reached v1.0 in April 2026, and LiveKit Agents shipped adaptive interruption handling — the plumbing that used to take a team months is now a library.

The result is that voice agents moved from gimmick to infrastructure: appointment scheduling, outbound sales qualification, healthcare intake, drive-through ordering, technical support. Anywhere a phone call or a microphone sits between a human and a system, a voice agent can now stand in.

The fundamental constraint: conversation is real-time

A text agent can think for three seconds and nobody minds. A voice agent that pauses three seconds before answering feels broken — humans interpret silence as confusion, disconnection, or rudeness. Every architectural decision in voice AI is downstream of one brutal fact: you are racing a 300ms clock on every single turn, and the clock starts the instant the user stops talking.

2. Two architectures: cascading vs speech-to-speech

There are exactly two ways to build a voice agent in 2026, and choosing between them is the most consequential decision you will make.

The cascading pipeline (also called turn-based) chains three separate models: speech-to-text (STT/ASR) transcribes what the user said, an LLM reasons over the transcript and produces a text reply, and text-to-speech (TTS) speaks it back. The speech-to-speech (S2S) approach uses a single multimodal model that ingests audio and emits audio directly, with no intermediate text — preserving tone, emphasis, and prosody that text throws away.

flowchart LR
    subgraph C[Cascading Pipeline]
      direction LR
      U1[User audio] --> VAD1[VAD +
endpointing]
      VAD1 --> STT[STT / ASR]
      STT --> LLM[LLM
reasoning]
      LLM --> TTS[TTS]
      TTS --> O1[Agent audio]
    end
    subgraph S[Speech-to-Speech]
      direction LR
      U2[User audio] --> M[Single multimodal
S2S model]
      M --> O2[Agent audio]
    end

    style C fill:#16213e,stroke:#fff,color:#fff
    style S fill:#0f3460,stroke:#fff,color:#fff
    style LLM fill:#e94560,stroke:#fff,color:#fff
    style M fill:#e94560,stroke:#fff,color:#fff

Cascading chains three swappable models with text in the middle; speech-to-speech collapses everything into one audio-native model.

The trade-off is real and it does not have a universal winner. Speech-to-speech wins on naturalness and latency — it hears laughter, hesitation, and sarcasm, and it can respond in ~320ms because there is no pipeline to traverse. Cascading wins on control, observability, and cost — you choose exactly which LLM reasons, you can read and log the transcript, you can inject business logic between transcription and response, and you can swap any component without re-architecting.

Dimension	Cascading (STT→LLM→TTS)	Speech-to-Speech (S2S)
Latency	Higher — sum of three models (600ms–1.7s naive, ~500ms tuned)	Lower — single model (~320ms best in class)
Naturalness	Loses prosody/emotion at the text bottleneck	Preserves tone, emphasis, laughter, hesitation
Control over reasoning	Full — pick any LLM, inject logic mid-pipeline	Limited — reasoning is baked into the model
Observability	High — text transcript at every stage	Low — no intermediate text to log/audit
Vendor lock-in	Low — mix and match providers	High — tied to one vendor's S2S model
Best for	Telephony, compliance, complex tool use, cost control	Consumer conversation, naturalness-first UX

What most production teams actually choose

In 2026, the majority of production deployments still run cascading pipelines, for one reason: control. Teams need to decide which LLM handles reasoning, which voice the user hears, and what compliance/business logic runs between transcription and response — especially in regulated domains like healthcare and finance. Speech-to-speech is winning consumer-facing, naturalness-first products, but "I need to read the transcript and route this to a tool" still pushes most enterprises toward the cascade.

3. The latency budget — where the time actually goes

The single most useful mental model in voice AI is the latency budget: a fixed amount of time — roughly 300ms for a snappy experience — that every component must share. The counterintuitive truth is that STT and TTS are not where the time goes. The two real culprits are turn-taking (deciding the user actually finished) and LLM time-to-first-token.

Stage	Typical cost	Notes
Network round-trip	30–80ms	WebRTC; PSTN telephony eats far more
Speech-to-text	100–300ms	Streaming STT is fast; runs while the user speaks
Endpoint / turn detection	500–1000ms+	The silent killer — waiting to be sure the user stopped
LLM time-to-first-token	350–1000ms	The biggest controllable variable
Text-to-speech (first audio)	90–200ms	Streaming TTS emits the first chunk fast

Notice that endpoint detection can cost more than every other stage combined. If you wait a full second of silence to be certain the user is done, you have already blown the budget before the LLM even starts. This is why turn detection is the hardest problem in voice AI — and the next section is dedicated to it.

The transport layer quietly decides your budget

With WebRTC you have roughly 240–270ms left for STT + LLM + TTS after transport overhead. On a PSTN phone call, transport can eat the entire budget, leaving 0–100ms — making the 300ms target physically impossible. WebRTC saves 150–700ms versus a traditional phone call. If you are building voice over the telephone network, you are not playing the same game; you must relax your latency target (≈800ms is tolerable for healthcare, <600ms for outbound sales) and design around it.

4. Turn-taking and barge-in: teaching a machine to converse

Humans are astonishingly good at knowing when it is their turn to speak. We use silence, intonation, grammar, and breathing as cues, and we overlap and interrupt gracefully. Machines have none of this for free. Two problems define conversational voice: knowing when the user is done (endpointing) and handling being cut off (barge-in).

4.1. End-of-turn detection

The naive approach is a silence timer: wait 800–1200ms of silence after the last word, then assume the user is done. But pure silence is a terrible signal — people pause mid-sentence to think ("I'd like to book... a table for four"), and a dumb timer will interrupt them. The 2026 answer is semantic endpointing: combine a Voice Activity Detection (VAD) silence threshold with a model that judges whether the sentence is semantically complete. "I'd like to book a table for" is grammatically unfinished — wait. "I'd like to book a table for four" is complete — respond. Good endpointing is the difference between an agent that feels patient and one that constantly cuts you off.

4.2. Barge-in: handling interruptions

When the agent is speaking and the user starts talking, the agent must stop instantly — just like a polite human. This sounds trivial and is brutally hard, because four things must happen near-simultaneously the moment barge-in is detected.

flowchart TD
    A[Agent is speaking] --> B{VAD detects user
speech above threshold
for min window?}
    B -- No --> A
    B -- Yes: BARGE-IN --> C[1. Stop TTS playback
immediately]
    C --> D[2. Cancel in-flight
TTS generation]
    D --> E[3. Cancel LLM
generation in progress]
    E --> F[4. Reset stream state
discard the old turn]
    F --> G[Listen to the
user's new input]
    G --> H[Start a fresh turn]

    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#fff3e0,stroke:#ff9800,color:#2c3e50
    style C fill:#16213e,stroke:#fff,color:#fff
    style D fill:#16213e,stroke:#fff,color:#fff
    style E fill:#16213e,stroke:#fff,color:#fff
    style F fill:#16213e,stroke:#fff,color:#fff
    style H fill:#4CAF50,stroke:#fff,color:#fff

Barge-in is a four-step teardown. Miss any step and the agent talks over the user or "finishes the old thought" after being interrupted.

If any of those four steps is missing, the failure is immediately audible: the agent keeps talking over the user, or it goes silent then suddenly resumes a sentence the user already moved past. Modern frameworks now ship tuned barge-in — LiveKit Agents reports adaptive interruption handling at roughly 86% precision and 100% recall — but the tuning matters: too sensitive and a cough cancels the agent mid-sentence; too lax and it ignores genuine interruptions.

Why barge-in needs cancellation, not just muting

A naive implementation just mutes the speaker. But the LLM is still generating, the TTS is still synthesizing, and tokens are still being buffered. If you only mute, the moment the user finishes their interruption the agent dumps the entire pre-interruption response — confusing and robotic. True barge-in cancels the in-flight LLM and TTS work and discards the buffered audio, so the agent genuinely abandons its old turn and responds to what the user actually just said.

5. Inside a production voice agent

Assemble the pieces and a real voice agent has a clear anatomy. The orchestrator sits at the center, coordinating a tight real-time loop between the transport layer and the model stack.

flowchart TB
    subgraph T[Transport Layer]
      WRTC[WebRTC / WebSocket
streaming audio in-out]
    end
    subgraph ORCH[Orchestrator - the real-time loop]
      VAD[VAD + semantic
endpointing]
      INT[Interruption /
barge-in handler]
      CTX[Conversation state
+ context]
    end
    subgraph MODELS[Model Stack]
      STT[STT streaming]
      LLM[LLM + tool calling]
      TTS[TTS streaming]
    end
    subgraph TOOLS[Tools / Backend]
      API[CRM / DB / booking
function calls]
    end
    WRTC --> VAD
    VAD --> STT
    STT --> CTX
    CTX --> LLM
    LLM --> TTS
    TTS --> WRTC
    LLM <--> API
    INT -. cancels .-> LLM
    INT -. cancels .-> TTS
    WRTC -. user speech .-> INT

    style T fill:#2c3e50,stroke:#fff,color:#fff
    style ORCH fill:#16213e,stroke:#fff,color:#fff
    style MODELS fill:#0f3460,stroke:#fff,color:#fff
    style TOOLS fill:#e94560,stroke:#fff,color:#fff

The orchestrator owns timing and interruption; the model stack does the heavy lifting; tools connect the call to real business systems.

Transport layer: streams audio bidirectionally over WebRTC (browser/app) or a WebSocket bridge (telephony). This is where your latency floor is set.
Orchestrator: the brain of timing — runs VAD and semantic endpointing, detects barge-in, holds conversation state, and decides when to hand off between listening and speaking. This is the part frameworks like Pipecat and LiveKit Agents give you.
Model stack: streaming STT, an LLM with tool-calling, and streaming TTS — or a single S2S model replacing all three.
Tools / backend: the LLM calls real functions mid-conversation — look up an order, book a slot, check inventory — which is what separates a useful agent from a chatbot that just talks.

6. Function calling during a live call

A voice agent that only talks is a podcast. The value comes when it acts — and tool calls during a live voice conversation introduce a latency problem text agents never face. When the user says "is my order shipped?", the agent must call your backend, wait for the response, and reply — but it cannot just go silent for two seconds while the API responds.

The standard 2026 pattern is the filler-while-fetching technique: the moment a tool call starts, the agent emits a short natural acknowledgment ("Let me check that for you...") to fill the dead air, runs the function call in parallel, and stitches the real answer in when it returns. This mirrors exactly what a human agent does when they say "one second while I pull that up." Combined with streaming the LLM's response token-by-token into TTS, it keeps the conversation feeling alive even when real work is happening underneath.

Stream everything, buffer nothing you don't have to

The golden rule of voice latency: never wait for a complete output when you can start emitting a partial one. STT streams partial transcripts as the user speaks. The LLM streams tokens as it reasons. TTS starts speaking the first sentence while the LLM is still generating the second. Each stream overlaps the next, so the user hears the first words of the reply long before the full response exists. A pipeline that processes each stage to completion before starting the next will always feel sluggish, no matter how fast each individual model is.

7. The 2026 voice tooling landscape

The ecosystem split into clear layers: orchestration frameworks, realtime transport platforms, and model providers (STT, TTS, S2S).

Layer	Examples	What it gives you
Orchestration framework	Pipecat (v1.0, Apr 2026), LiveKit Agents	The real-time loop: VAD, endpointing, barge-in, pipeline wiring, provider plug-ins
Realtime transport	LiveKit, WebRTC infra, telephony bridges	Low-latency audio streaming, phone-number integration, scaling concurrent calls
Speech-to-speech APIs	Vendor realtime/S2S models	Single-model audio-in/audio-out at ~320ms, prosody-preserving
STT / TTS providers	Streaming ASR + neural TTS vendors	The swappable components of a cascading pipeline
Voice agent platforms	Hosted end-to-end products	Build/deploy/monitor with less plumbing; trade control for speed-to-market

Build or buy?

Buy a hosted platform if you need a working voice agent in days and your use case is standard (scheduling, FAQ, qualification). Build on an orchestration framework like Pipecat or LiveKit Agents when you need fine control over the model stack, custom tool integrations, on-prem/compliance constraints, or per-call cost optimization at scale. The framework layer is the sweet spot for most engineering teams in 2026: it hands you the brutal real-time plumbing (barge-in, endpointing) while leaving the model and business-logic choices in your hands.

8. The hard problems that aren't the LLM

The counterintuitive lesson of building voice agents is that the language model is rarely the bottleneck. The failures that wreck production are almost all in the audio and timing layers.

1. Transcription errors poison everything downstream

If STT mishears "I want to cancel" as "I want to council," the LLM reasons over garbage and confidently does the wrong thing. Accents, background noise, crosstalk, and domain jargon (drug names, product SKUs) all degrade ASR. A voice agent is only as good as its worst transcription.

2. Hallucinated transcripts on silence

Some ASR models hallucinate words during silence or background noise — inventing a phrase the user never said, which the LLM then dutifully responds to. Guarding against phantom input is a real production concern.

3. Endpointing is never "solved"

Every domain has different pause patterns. An elderly caller speaks with long pauses; a frustrated one talks fast and interrupts. A single endpointing threshold cannot serve both, and getting it wrong means either cutting people off or feeling sluggish.

4. Cost compounds per minute

Unlike a one-shot text query, a voice call burns STT + LLM + TTS continuously for its entire duration. A ten-minute call is ten minutes of three models running. At scale, per-minute economics — not per-query — decide whether the product is viable, which is a major reason teams stay on controllable cascading pipelines.

9. Measuring a voice agent: the metrics that matter

Three numbers define voice agent quality, and none of them is "LLM accuracy" in isolation.

Metric	What it measures	Why it matters
TTFT (time to first token/audio)	Latency from end-of-user-speech to start-of-agent-speech	The single most felt number — this is "responsiveness"
WER (word error rate)	Transcription accuracy of the STT stage	Sets the ceiling on everything downstream
RTF (real-time factor)	Processing time relative to audio duration	Determines whether you can keep up with the stream live
Interruption precision/recall	How accurately barge-in fires (and doesn't)	Decides whether the agent feels polite or pushy
Task completion rate	% of calls that achieve the user's goal	The real business KPI — latency means nothing if the task fails

Don't optimize latency in a vacuum

A 200ms agent that mishears half its inputs is worse than a 400ms agent that gets them right. Latency and accuracy trade off — pushing endpointing to be ultra-fast means cutting people off more often. Always read TTFT alongside WER and task completion, never alone.

10. The role shift: from script writers to conversation designers

Voice AI changes what the team around it actually does. The old IVR world was built by writing rigid call flows — "if the user presses 1, go to node B." The agentic voice world replaces that with conversation design and policy: defining the agent's persona and tone, the tools it may call, the escalation rules for handing off to a human, and the compliance guardrails for what it must and must not say.

For the Project Manager or product owner, voice introduces governance work that text never required: which calls are recorded and how transcripts are retained, how the agent identifies itself as an AI (a legal requirement in many jurisdictions), what happens when it cannot understand the caller, and how to measure not just deflection rate but customer trust. The voice agent becomes a product with an SLA on latency, an owner for the conversation design, and an audit trail — not a script someone wrote once and forgot.

The new central artifact: the conversation spec

Just as runbooks became shared property between SRE and agent, the conversation spec — persona, allowed tools, escalation triggers, compliance phrases, fallback behavior — becomes the artifact that product, engineering, and compliance co-own. Engineers wire the pipeline; product writes the persona and flows; compliance reviews the guardrails. Voice is where these three disciplines finally have to share one document.

11. How voice agents evolved

2010s — IVR & first voice assistants

Rigid call trees and keyword-spotting assistants. Useful for narrow commands, helpless with open-ended conversation. "Press 1 for billing."

2023–2024 — LLM cascades arrive

STT→LLM→TTS pipelines make open conversation possible, but latency sits at 1–2 seconds and interruptions break the experience. Impressive demos, fragile production.

2025 — Speech-to-speech & sub-second latency

Audio-native models and streaming-everything pipelines push round-trip below the perceptual threshold. Barge-in and semantic endpointing become standard expectations.

2026 — Voice as infrastructure

Pipecat v1.0, LiveKit Agents with tuned interruption handling, ~320ms S2S models. Voice agents move from novelty to production infrastructure across support, sales, and healthcare.

2027+ outlook

Emotionally aware agents that read frustration and adjust, seamless human handoff mid-call, and voice as a first-class channel alongside text and screen.

12. Common mistakes to avoid

1. Optimizing the LLM while ignoring transport

Teams obsess over model choice while running on a transport that already ate the budget. Fix WebRTC vs telephony and endpointing first — that is where the seconds hide.

2. Processing each stage to completion

Waiting for the full transcript, then the full LLM response, then full TTS guarantees a sluggish agent. Stream and overlap every stage, or you will never hit the budget.

3. Treating barge-in as "mute the speaker"

Without canceling in-flight LLM and TTS work, the agent resumes its abandoned sentence after the interruption. Real barge-in tears down and discards the old turn.

4. One endpointing threshold for all users

A single silence timeout cannot serve both a slow, thoughtful caller and a fast, impatient one. Use semantic endpointing and tune per use case.

13. Conclusion

The hardest thing about voice AI in 2026 is not making a machine talk — it is making it converse. Conversation is a real-time dance of turn-taking, interruption, and timing that humans do effortlessly and machines must be engineered to imitate, millisecond by millisecond. The LLM is the easy part; the orchestration of silence, the four-step teardown of barge-in, the streaming overlap that hides latency, and the transcription quality that everything else depends on — that is the real discipline.

If you are building one, start with the architecture decision (cascading for control, S2S for naturalness), nail your transport layer before touching the model, and instrument TTFT and task completion from day one. A voice agent that answers in 300ms, stops the instant you cut in, and actually completes your task does not feel like talking to a robot — and that feeling, not the model behind it, is the entire product. The race is against a 300ms clock, and every architectural choice is about how you spend those milliseconds.

References

#Voice AI #AI Agent #Speech-to-Speech #WebRTC #Real-Time #LLM

# Voice AI Agents 2026: Building Real-Time Speech Agents

You call a support line. You finish your sentence, and for one full second there is silence — then the agent starts talking, right as you begin to add "...oh, and one more thing." It talks over you. You both stop. It keeps going anyway, finishing a thought you already abandoned. The conversation feels broken, and you can tell instantly that you are talking to a machine. Now imagine the same call where the reply lands in under 300 milliseconds, the agent stops the moment you cut in, and it picks up exactly where you redirected it. The difference between those two calls is not the language model — it is **voice engineering**, and in 2026 it has become its own discipline.

~300msTarget response latency before a reply feels "laggy" to humans

~320msEnd-to-end latency of the fastest 2026 speech-to-speech models

150–700msLatency WebRTC saves versus a PSTN phone call

600ms–1.7sTypical end-to-end latency of a naive cascaded STT→LLM→TTS pipeline

## 1. Why voice agents exploded in 2026

Voice interfaces are not new — IVR phone trees have existed for decades, and voice assistants since the 2010s. What changed is that the three things voice always lacked finally arrived together. First, **LLMs made open-ended conversation possible**: a voice agent can now handle "actually, can you check if my other order shipped too?" instead of "press 2 for billing." Second, **latency dropped below the perceptual threshold**: streaming models, faster inference, and speech-native models pushed round-trip time under the ~300ms where a conversation stops feeling robotic. Third, **orchestration frameworks matured**: Pipecat reached v1.0 in April 2026, and LiveKit Agents shipped adaptive interruption handling — the plumbing that used to take a team months is now a library.

#### The fundamental constraint: conversation is real-time

A text agent can think for three seconds and nobody minds. A voice agent that pauses three seconds before answering feels broken — humans interpret silence as confusion, disconnection, or rudeness. Every architectural decision in voice AI is downstream of one brutal fact: **you are racing a 300ms clock on every single turn**, and the clock starts the instant the user stops talking.

## 2. Two architectures: cascading vs speech-to-speech

There are exactly two ways to build a voice agent in 2026, and choosing between them is the most consequential decision you will make.

The **cascading pipeline** (also called turn-based) chains three separate models: speech-to-text (STT/ASR) transcribes what the user said, an LLM reasons over the transcript and produces a text reply, and text-to-speech (TTS) speaks it back. The **speech-to-speech (S2S)** approach uses a single multimodal model that ingests audio and emits audio directly, with no intermediate text — preserving tone, emphasis, and prosody that text throws away.

```
flowchart LR
    subgraph C[Cascading Pipeline]
      direction LR
      U1[User audio] --> VAD1[VAD +  
endpointing]
      VAD1 --> STT[STT / ASR]
      STT --> LLM[LLM  
reasoning]
      LLM --> TTS[TTS]
      TTS --> O1[Agent audio]
    end
    subgraph S[Speech-to-Speech]
      direction LR
      U2[User audio] --> M[Single multimodal  
S2S model]
      M --> O2[Agent audio]
    end

style C fill:#16213e,stroke:#fff,color:#fff
    style S fill:#0f3460,stroke:#fff,color:#fff
    style LLM fill:#e94560,stroke:#fff,color:#fff
    style M fill:#e94560,stroke:#fff,color:#fff

```
Cascading chains three swappable models with text in the middle; speech-to-speech collapses everything into one audio-native model.

The trade-off is real and it does not have a universal winner. Speech-to-speech wins on *naturalness and latency* — it hears laughter, hesitation, and sarcasm, and it can respond in ~320ms because there is no pipeline to traverse. Cascading wins on *control, observability, and cost* — you choose exactly which LLM reasons, you can read and log the transcript, you can inject business logic between transcription and response, and you can swap any component without re-architecting.

| Dimension | Cascading (STT→LLM→TTS) | Speech-to-Speech (S2S) |
| --- | --- | --- |
| **Latency** | Higher — sum of three models (600ms–1.7s naive, ~500ms tuned) | Lower — single model (~320ms best in class) |
| **Naturalness** | Loses prosody/emotion at the text bottleneck | Preserves tone, emphasis, laughter, hesitation |
| **Control over reasoning** | Full — pick any LLM, inject logic mid-pipeline | Limited — reasoning is baked into the model |
| **Observability** | High — text transcript at every stage | Low — no intermediate text to log/audit |
| **Vendor lock-in** | Low — mix and match providers | High — tied to one vendor's S2S model |
| **Best for** | Telephony, compliance, complex tool use, cost control | Consumer conversation, naturalness-first UX |

#### What most production teams actually choose

In 2026, the majority of production deployments still run **cascading pipelines**, for one reason: control. Teams need to decide which LLM handles reasoning, which voice the user hears, and what compliance/business logic runs between transcription and response — especially in regulated domains like healthcare and finance. Speech-to-speech is winning consumer-facing, naturalness-first products, but "I need to read the transcript and route this to a tool" still pushes most enterprises toward the cascade.

## 3. The latency budget — where the time actually goes

The single most useful mental model in voice AI is the **latency budget**: a fixed amount of time — roughly 300ms for a snappy experience — that every component must share. The counterintuitive truth is that STT and TTS are *not* where the time goes. The two real culprits are **turn-taking** (deciding the user actually finished) and **LLM time-to-first-token**.

| Stage | Typical cost | Notes |
| --- | --- | --- |
| **Network round-trip** | 30–80ms | WebRTC; PSTN telephony eats far more |
| **Speech-to-text** | 100–300ms | Streaming STT is fast; runs while the user speaks |
| **Endpoint / turn detection** | 500–1000ms+ | The silent killer — waiting to be sure the user stopped |
| **LLM time-to-first-token** | 350–1000ms | The biggest controllable variable |
| **Text-to-speech (first audio)** | 90–200ms | Streaming TTS emits the first chunk fast |

Notice that endpoint detection can cost more than every other stage combined. If you wait a full second of silence to be *certain* the user is done, you have already blown the budget before the LLM even starts. This is why turn detection is the hardest problem in voice AI — and the next section is dedicated to it.

#### The transport layer quietly decides your budget

With **WebRTC** you have roughly 240–270ms left for STT + LLM + TTS after transport overhead. On a **PSTN phone call**, transport can eat the entire budget, leaving 0–100ms — making the 300ms target physically impossible. WebRTC saves 150–700ms versus a traditional phone call. If you are building voice over the telephone network, you are not playing the same game; you must relax your latency target (≈800ms is tolerable for healthcare, <600ms for outbound sales) and design around it.

## 4. Turn-taking and barge-in: teaching a machine to converse

Humans are astonishingly good at knowing when it is their turn to speak. We use silence, intonation, grammar, and breathing as cues, and we overlap and interrupt gracefully. Machines have none of this for free. Two problems define conversational voice: **knowing when the user is done** (endpointing) and **handling being cut off** (barge-in).

### 4.1. End-of-turn detection

The naive approach is a silence timer: wait 800–1200ms of silence after the last word, then assume the user is done. But pure silence is a terrible signal — people pause mid-sentence to think ("I'd like to book... a table for four"), and a dumb timer will interrupt them. The 2026 answer is **semantic endpointing**: combine a Voice Activity Detection (VAD) silence threshold *with* a model that judges whether the sentence is semantically complete. "I'd like to book a table for" is grammatically unfinished — wait. "I'd like to book a table for four" is complete — respond. Good endpointing is the difference between an agent that feels patient and one that constantly cuts you off.

### 4.2. Barge-in: handling interruptions

When the agent is speaking and the user starts talking, the agent must **stop instantly** — just like a polite human. This sounds trivial and is brutally hard, because four things must happen near-simultaneously the moment barge-in is detected.

```
flowchart TD
    A[Agent is speaking] --> B{VAD detects user  
speech above threshold  
for min window?}
    B -- No --> A
    B -- Yes: BARGE-IN --> C[1. Stop TTS playback  
immediately]
    C --> D[2. Cancel in-flight  
TTS generation]
    D --> E[3. Cancel LLM  
generation in progress]
    E --> F[4. Reset stream state  
discard the old turn]
    F --> G[Listen to the  
user's new input]
    G --> H[Start a fresh turn]

style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#fff3e0,stroke:#ff9800,color:#2c3e50
    style C fill:#16213e,stroke:#fff,color:#fff
    style D fill:#16213e,stroke:#fff,color:#fff
    style E fill:#16213e,stroke:#fff,color:#fff
    style F fill:#16213e,stroke:#fff,color:#fff
    style H fill:#4CAF50,stroke:#fff,color:#fff

```
Barge-in is a four-step teardown. Miss any step and the agent talks over the user or "finishes the old thought" after being interrupted.

#### Why barge-in needs cancellation, not just muting

A naive implementation just mutes the speaker. But the LLM is still generating, the TTS is still synthesizing, and tokens are still being buffered. If you only mute, the moment the user finishes their interruption the agent dumps the *entire* pre-interruption response — confusing and robotic. True barge-in **cancels the in-flight LLM and TTS work and discards the buffered audio**, so the agent genuinely abandons its old turn and responds to what the user actually just said.

## 5. Inside a production voice agent

Assemble the pieces and a real voice agent has a clear anatomy. The orchestrator sits at the center, coordinating a tight real-time loop between the transport layer and the model stack.

```
flowchart TB
    subgraph T[Transport Layer]
      WRTC[WebRTC / WebSocket  
streaming audio in-out]
    end
    subgraph ORCH[Orchestrator - the real-time loop]
      VAD[VAD + semantic  
endpointing]
      INT[Interruption /  
barge-in handler]
      CTX[Conversation state  
+ context]
    end
    subgraph MODELS[Model Stack]
      STT[STT streaming]
      LLM[LLM + tool calling]
      TTS[TTS streaming]
    end
    subgraph TOOLS[Tools / Backend]
      API[CRM / DB / booking  
function calls]
    end
    WRTC --> VAD
    VAD --> STT
    STT --> CTX
    CTX --> LLM
    LLM --> TTS
    TTS --> WRTC
    LLM <--> API
    INT -. cancels .-> LLM
    INT -. cancels .-> TTS
    WRTC -. user speech .-> INT

style T fill:#2c3e50,stroke:#fff,color:#fff
    style ORCH fill:#16213e,stroke:#fff,color:#fff
    style MODELS fill:#0f3460,stroke:#fff,color:#fff
    style TOOLS fill:#e94560,stroke:#fff,color:#fff

```
The orchestrator owns timing and interruption; the model stack does the heavy lifting; tools connect the call to real business systems.

- **Transport layer:** streams audio bidirectionally over WebRTC (browser/app) or a WebSocket bridge (telephony). This is where your latency floor is set.
- **Orchestrator:** the brain of timing — runs VAD and semantic endpointing, detects barge-in, holds conversation state, and decides when to hand off between listening and speaking. This is the part frameworks like Pipecat and LiveKit Agents give you.
- **Model stack:** streaming STT, an LLM with tool-calling, and streaming TTS — or a single S2S model replacing all three.
- **Tools / backend:** the LLM calls real functions mid-conversation — look up an order, book a slot, check inventory — which is what separates a useful agent from a chatbot that just talks.

## 6. Function calling during a live call

A voice agent that only talks is a podcast. The value comes when it *acts* — and tool calls during a live voice conversation introduce a latency problem text agents never face. When the user says "is my order shipped?", the agent must call your backend, wait for the response, and reply — but it cannot just go silent for two seconds while the API responds.

The standard 2026 pattern is the **filler-while-fetching** technique: the moment a tool call starts, the agent emits a short natural acknowledgment ("Let me check that for you...") to fill the dead air, runs the function call in parallel, and stitches the real answer in when it returns. This mirrors exactly what a human agent does when they say "one second while I pull that up." Combined with streaming the LLM's response token-by-token into TTS, it keeps the conversation feeling alive even when real work is happening underneath.

#### Stream everything, buffer nothing you don't have to

The golden rule of voice latency: **never wait for a complete output when you can start emitting a partial one.** STT streams partial transcripts as the user speaks. The LLM streams tokens as it reasons. TTS starts speaking the first sentence while the LLM is still generating the second. Each stream overlaps the next, so the user hears the first words of the reply long before the full response exists. A pipeline that processes each stage to completion before starting the next will always feel sluggish, no matter how fast each individual model is.

## 7. The 2026 voice tooling landscape

The ecosystem split into clear layers: orchestration frameworks, realtime transport platforms, and model providers (STT, TTS, S2S).

| Layer | Examples | What it gives you |
| --- | --- | --- |
| **Orchestration framework** | Pipecat (v1.0, Apr 2026), LiveKit Agents | The real-time loop: VAD, endpointing, barge-in, pipeline wiring, provider plug-ins |
| **Realtime transport** | LiveKit, WebRTC infra, telephony bridges | Low-latency audio streaming, phone-number integration, scaling concurrent calls |
| **Speech-to-speech APIs** | Vendor realtime/S2S models | Single-model audio-in/audio-out at ~320ms, prosody-preserving |
| **STT / TTS providers** | Streaming ASR + neural TTS vendors | The swappable components of a cascading pipeline |
| **Voice agent platforms** | Hosted end-to-end products | Build/deploy/monitor with less plumbing; trade control for speed-to-market |

#### Build or buy?

**Buy** a hosted platform if you need a working voice agent in days and your use case is standard (scheduling, FAQ, qualification). **Build** on an orchestration framework like Pipecat or LiveKit Agents when you need fine control over the model stack, custom tool integrations, on-prem/compliance constraints, or per-call cost optimization at scale. The framework layer is the sweet spot for most engineering teams in 2026: it hands you the brutal real-time plumbing (barge-in, endpointing) while leaving the model and business-logic choices in your hands.

## 8. The hard problems that aren't the LLM

The counterintuitive lesson of building voice agents is that the language model is rarely the bottleneck. The failures that wreck production are almost all in the audio and timing layers.

#### 1. Transcription errors poison everything downstream

#### 2. Hallucinated transcripts on silence

#### 3. Endpointing is never "solved"

#### 4. Cost compounds per minute

## 9. Measuring a voice agent: the metrics that matter

Three numbers define voice agent quality, and none of them is "LLM accuracy" in isolation.

| Metric | What it measures | Why it matters |
| --- | --- | --- |
| **TTFT (time to first token/audio)** | Latency from end-of-user-speech to start-of-agent-speech | The single most felt number — this is "responsiveness" |
| **WER (word error rate)** | Transcription accuracy of the STT stage | Sets the ceiling on everything downstream |
| **RTF (real-time factor)** | Processing time relative to audio duration | Determines whether you can keep up with the stream live |
| **Interruption precision/recall** | How accurately barge-in fires (and doesn't) | Decides whether the agent feels polite or pushy |
| **Task completion rate** | % of calls that achieve the user's goal | The real business KPI — latency means nothing if the task fails |

#### Don't optimize latency in a vacuum

## 10. The role shift: from script writers to conversation designers

Voice AI changes what the team around it actually does. The old IVR world was built by writing rigid call flows — "if the user presses 1, go to node B." The agentic voice world replaces that with **conversation design and policy**: defining the agent's persona and tone, the tools it may call, the escalation rules for handing off to a human, and the compliance guardrails for what it must and must not say.

For the **Project Manager or product owner**, voice introduces governance work that text never required: which calls are recorded and how transcripts are retained, how the agent identifies itself as an AI (a legal requirement in many jurisdictions), what happens when it cannot understand the caller, and how to measure not just deflection rate but customer trust. The voice agent becomes a product with an SLA on latency, an owner for the conversation design, and an audit trail — not a script someone wrote once and forgot.

#### The new central artifact: the conversation spec

Just as runbooks became shared property between SRE and agent, the **conversation spec** — persona, allowed tools, escalation triggers, compliance phrases, fallback behavior — becomes the artifact that product, engineering, and compliance co-own. Engineers wire the pipeline; product writes the persona and flows; compliance reviews the guardrails. Voice is where these three disciplines finally have to share one document.

## 11. How voice agents evolved

2010s — IVR & first voice assistants

Rigid call trees and keyword-spotting assistants. Useful for narrow commands, helpless with open-ended conversation. "Press 1 for billing."

2023–2024 — LLM cascades arrive

STT→LLM→TTS pipelines make open conversation possible, but latency sits at 1–2 seconds and interruptions break the experience. Impressive demos, fragile production.

2025 — Speech-to-speech & sub-second latency

Audio-native models and streaming-everything pipelines push round-trip below the perceptual threshold. Barge-in and semantic endpointing become standard expectations.

2026 — Voice as infrastructure

Pipecat v1.0, LiveKit Agents with tuned interruption handling, ~320ms S2S models. Voice agents move from novelty to production infrastructure across support, sales, and healthcare.

2027+ outlook

Emotionally aware agents that read frustration and adjust, seamless human handoff mid-call, and voice as a first-class channel alongside text and screen.

## 12. Common mistakes to avoid

#### 1. Optimizing the LLM while ignoring transport

Teams obsess over model choice while running on a transport that already ate the budget. Fix WebRTC vs telephony and endpointing first — that is where the seconds hide.

#### 2. Processing each stage to completion

Waiting for the full transcript, then the full LLM response, then full TTS guarantees a sluggish agent. Stream and overlap every stage, or you will never hit the budget.

#### 3. Treating barge-in as "mute the speaker"

Without canceling in-flight LLM and TTS work, the agent resumes its abandoned sentence after the interruption. Real barge-in tears down and discards the old turn.

#### 4. One endpointing threshold for all users

A single silence timeout cannot serve both a slow, thoughtful caller and a fast, impatient one. Use semantic endpointing and tune per use case.

## 13. Conclusion

The hardest thing about voice AI in 2026 is not making a machine talk — it is making it *converse*. Conversation is a real-time dance of turn-taking, interruption, and timing that humans do effortlessly and machines must be engineered to imitate, millisecond by millisecond. The LLM is the easy part; the orchestration of silence, the four-step teardown of barge-in, the streaming overlap that hides latency, and the transcription quality that everything else depends on — that is the real discipline.

### References

- [Softcery — Real-Time (Speech-to-Speech) vs Turn-Based (Cascading STT/TTS) Voice Agent Architecture](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
- [Digital Applied — Voice Agent Infrastructure Stack 2026: Full Reference](https://www.digitalapplied.com/blog/voice-agent-infrastructure-stack-2026-reference)
- [Future AGI — Voice AI Barge-In and Turn-Taking: A 2026 Implementation Guide](https://futureagi.com/blog/voice-ai-barge-in-turn-taking-2026/)
- [Famulor — Realtime vs. Pipeline Voice Agent: Architecture Guide 2026](https://www.famulor.io/blog/realtime-vs-pipeline-voice-agent-architecture-guide-2026)
- [Chanl — Voice AI Pipeline: STT, LLM, TTS and the 300ms Budget](https://www.channel.tel/blog/voice-ai-pipeline-stt-tts-latency-budget)
- [Telnyx — Voice AI Agents Compared on Latency in 2026](https://telnyx.com/resources/voice-ai-agents-compared-latency)
- [Ultravox — Speech-to-Speech Voice Agents: Architecture, Benefits, and How They Work](https://www.ultravox.ai/voice-ai/speech-to-speech-voice-agents-architecture-benefits-and-how-they-work)
- [Retell AI — How Real-Time Voice AI Actually Works (STT → LLM → TTS)](https://www.retellai.com/blog/how-real-time-voice-ai-works-stt-llm-tts)

AI Agent Observability 2026: How Do You Know Your Agent Works?

AI Agent Identity 2026: Authentication & Authorization

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.