Speculative Decoding 2026: How LLMs Generate Text 2–3× Faster
Posted on: 6/15/2026 1:14:48 AM
Table of contents
- Speculative Decoding in one sentence
- The real bottleneck: LLMs aren't short on compute, they're short on bandwidth
- The core idea: guessing is cheap, verifying is parallel
- The math behind "free": why quality stays identical
- The four families dominating 2026
- Token tree: verifying many guess branches at once
- When it speeds up, and when it backfires
- From idea to production standard
- Turning it on in production: real config
- Operational impact: why this is a cost-optimization problem
- Conclusion
When you type a question to Claude or ChatGPT, text appears almost instantly and then streams out smoothly. That fluidity hides an uncomfortable truth: under the hood, a large language model (LLM) generates text one token at a time, and each token requires streaming all of its tens of billions of parameters out of GPU memory. This is the fundamental bottleneck of autoregressive decoding. So how do providers still serve responses this fast, to millions of users at once?
The answer, in large part, is a counterintuitive trick called Speculative Decoding: let a small, fast model guess ahead the next several tokens, then let the big model verify the whole batch at once in a single forward pass. The most elegant part: the final output is bit-for-bit identical to running the big model alone — with zero quality trade-off. This article dissects why the trick works, the math that makes it "free," the four families dominating 2026, and how to turn it on in production.
Speculative Decoding in one sentence
A technique that accelerates LLM inference by using a cheap draft to propose several next tokens, then letting the target model verify the whole batch in a single forward pass — accepting the correct guesses and fixing the first wrong token — so the output distribution is unchanged versus running the target model alone.
The real bottleneck: LLMs aren't short on compute, they're short on bandwidth
The common intuition is "to go faster, you need more FLOPs." For LLM decoding in interactive mode (small batch), that is wrong. At each generation step, the GPU must pull the model's entire weights from high-bandwidth memory (HBM) into the compute cores — only to multiply them by a single state vector. The ratio of "compute per byte loaded" (arithmetic intensity) is extremely low, so the GPU spends most of its time waiting on memory while the matrix cores sit idle. This is the memory-bandwidth-bound regime.
The paradoxical consequence: loading the weights to predict 1 token, or to predict 5 tokens at once, takes almost the same time, because the cost is dominated by moving the weights, not by the number of multiplications. In other words, a single forward pass of the big model has a "hidden compute budget" that one-token-at-a-time generation throws away. Speculative Decoding exists precisely to reclaim that wasted budget.
The key idea to remember
Generating text is sequential (each token depends on the previous one), so it can't be parallelized. But verifying a given sequence of tokens can be fully parallelized: one forward pass scores the probabilities at every position simultaneously. All of Speculative Decoding is about exploiting this asymmetry.
The core idea: guessing is cheap, verifying is parallel
One Speculative Decoding cycle has three steps:
- Draft. A small, fast helper model autoregressively generates
Kcandidate tokens (e.g. K = 4–8). Because it's small, these K steps are cheap. - Verify. The target model runs one forward pass over (prompt + the K drafted tokens), returning a probability distribution at every position — the big model's "opinion" on each token the draft proposed.
- Accept / fix. A clever sampling step decides the longest prefix of the draft to keep, and fixes the first wrong token. The rest of the draft is discarded. Repeat.
If each cycle accepts γ tokens on average, then every forward pass of the big model yields γ+1 tokens instead of 1. Since the big model's forward pass is the dominant cost, generation speed rises roughly γ+1× (minus the cost of running the draft).
flowchart TB
S["Current context"] --> D["DRAFT model (small)
generate K candidate tokens
(autoregressive, cheap)"]
D --> V["TARGET model (large)
1 forward pass verifies all K
(parallel)"]
V --> A{"Acceptance sampling
which tokens are right?"}
A -- "accept prefix + 1 bonus token" --> G["Emit gamma+1 tokens
per large forward pass"]
A -- "first wrong token" --> F["Fix from residual dist
discard remaining draft"]
F --> G
G --> C{"Done?"}
C -- "no" --> S
C -- "yes" --> E["Return output"]
style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D fill:#16213e,stroke:#fff,color:#fff
style V fill:#e94560,stroke:#fff,color:#fff
style A fill:#ff9800,stroke:#fff,color:#fff
style G fill:#2c3e50,stroke:#fff,color:#fff
The math behind "free": why quality stays identical
What sets Speculative Decoding apart from other acceleration tricks (like quantization or pruning) is that it does not approximate. The output is provably distributed exactly as the target model's. The secret is a variant of rejection sampling, published independently in 2023 by Leviathan et al. (Google) and Chen et al. (DeepMind).
Let q(x) be the probability the draft model assigns to token x, and p(x) the probability the target model assigns to it. For each drafted token, in order:
- Accept the token with probability
min(1, p(x) / q(x)). If the big model "believes" in this token at least as much as the small model does, always keep it. - If rejected at some position, sample a replacement token from the normalized residual distribution
norm(max(0, p(x) − q(x))), then stop — discard all later drafted tokens. - If all K tokens are accepted, sample one extra "bonus" token from the target model's distribution at position K+1 — essentially free, since the big forward pass already computed it.
It can be proven that this procedure produces tokens distributed exactly as if sampled directly from the target model. That's why Speculative Decoding is called a lossless technique: it changes the order of computation, not the distribution of results.
The greedy case (low temperature)
Under greedy decoding the rule collapses to something intuitive: accept the longest prefix where the draft's argmax token matches the target's argmax token; at the first divergence, take the target's token. This is why low temperature (more deterministic output) yields a higher acceptance rate — the draft finds it easier to "guess right."
sequenceDiagram
participant U as Loop
participant Dr as Draft model (small)
participant Tg as Target model (large)
U->>Dr: Current context
Dr-->>U: K draft tokens: [t1 t2 t3 t4]
U->>Tg: prompt + [t1 t2 t3 t4] (1 forward pass)
Tg-->>U: distribution p at each position
Note over U: Accept t1,t2,t3 (p>=q)
t4 rejected -> resample from (p-q)+
U-->>U: Emit t1 t2 t3 + t4' = 4 tokens / 1 large pass
Note over U,Tg: Repeat with new context
The four families dominating 2026
The "draft" can come from many sources, and this is where methods diverge. The core trade-off is always: the closer the draft is to the target, the higher the acceptance rate — but the more expensive the draft is to produce, the more the benefit erodes.
| Family | Draft source | Extra training? | Strength | Best when |
|---|---|---|---|---|
| Two-model (draft model) | A small LLM from the same family (e.g. 1B drafts for 70B) | No, if a small model already exists | Simple, easy to reason about, solid gains | You already have a small–large pair with the same tokenizer |
| Medusa | Extra prediction "heads" bolted onto the target model itself | Yes — train the extra heads | No separate helper model, low memory overhead | You want self-speculation, dislike managing two models |
| EAGLE / EAGLE-2 / EAGLE-3 | Autoregression at the feature level + candidate tree | Yes — train a lightweight draft layer | SOTA acceptance rate, highest speedup in 2026 | You need maximum speed and can afford one training run |
| N-gram / Prompt Lookup | Copy spans straight from the prompt/context | No — zero-cost, no model needed | Completely free draft | Input-repetitive workloads: RAG, code edits, summarization |
Two details worth remembering. First, EAGLE doesn't guess raw tokens but guesses at the hidden feature level and only then projects to tokens — making the draft more "in phase" with the big model and pushing the acceptance rate up. EAGLE-3 (2025) drops feature regression for direct token-level prediction with multi-layer fusion, reaching 2–6× with an average of ~4–5 accepted tokens per cycle. Second, Prompt Lookup is beautiful because for tasks where the output reuses much of the input (citing sources in RAG, editing one code file, rewriting a paragraph), the best draft is simply the spans already present in the context — produced without a single FLOP.
Token tree: verifying many guess branches at once
Modern methods (EAGLE, Medusa, SpecInfer) don't guess just one token sequence but a whole tree of possibilities. Instead of betting everything on one linear prediction, the draft proposes multiple branches, and the target model verifies the entire tree in a single forward pass via tree attention — a special attention mask that lets each branch "see" only its ancestors. The deepest accepted branch becomes that cycle's output.
flowchart TB
R(("root token")) --> A1["the"]
R --> A2["a"]
A1 --> B1["cat"]
A1 --> B2["dog"]
A2 --> B3["big"]
B1 --> C1["sat"]
B2 --> C2["ran"]
style R fill:#e94560,stroke:#fff,color:#fff
style A1 fill:#2c3e50,stroke:#fff,color:#fff
style A2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B1 fill:#2c3e50,stroke:#fff,color:#fff
style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C1 fill:#2c3e50,stroke:#fff,color:#fff
style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
The benefit of the tree is a higher expected number of accepted tokens per cycle: if the first straight branch goes wrong early, another branch in the tree may still be correct for longer. The price is a bit of extra compute for branches that get discarded — but since we're in the memory-bound regime, that extra compute is mostly "free."
When it speeds up, and when it backfires
Speculative Decoding is not a magic always-win button. Three factors decide how much it helps — or whether it slows things down.
| Factor | Good for Spec Decoding | Adverse / needs care |
|---|---|---|
| Temperature | Low / greedy → draft guesses easier, high acceptance | High → more random output, acceptance drops |
| Batch size | Small, interactive → memory-bound, big win | Large → already compute-bound, extra verify FLOPs can erode throughput |
| Draft quality | Draft resembles target → high γ | Draft too expensive (large helper) → benefit eaten back |
| Speculation length K | Moderate K, matched to workload predictability | K too large → wasteful when a wrong token appears early |
The common trap
In high-throughput, large-batch serving, the GPU has shifted to compute-bound. There, rejected draft tokens become wasted FLOPs, and Speculative Decoding can reduce total throughput even while improving per-token latency for some requests. Measure on your own workload: optimizing for interactive latency (TPOT) is not the same as optimizing for batch throughput. Also note: this technique speeds up the time between tokens (TPOT), not the time to first token (TTFT), which is determined by the prefill step.
From idea to production standard
Turning it on in production: real config
Good news for engineers: you almost never need to implement the algorithm yourself. Popular serving engines support it out of the box. On vLLM (2026), Speculative Decoding is declared via the speculative_config field, supporting many methods: ngram, eagle, eagle3, medusa, draft_model, and MTP variants.
# vLLM -- n-gram (zero-cost draft), great for RAG / code edits / summarization
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_config={
"method": "ngram",
"num_speculative_tokens": 5, # K
"prompt_lookup_max": 4, # max n-gram window
},
)
# vLLM -- EAGLE-3 (highest speedup, needs a trained draft layer)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"method": "eagle3",
"model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
"num_speculative_tokens": 6,
},
)
Quick method-picking tips
- No training, input-repetitive workload (RAG, code edit) → start with
ngram/ prompt-lookup. Instant benefit, zero risk. - Maximum speedup, can afford one training run → EAGLE-3, currently the best acceptance rate.
- Already have a small–large pair with the same tokenizer → the two-model draft is the simplest, most predictable choice.
- Always benchmark on real traffic with the temperature and batch distribution of production before fixing K and the method.
Operational impact: why this is a cost-optimization problem
From an operations lens, Speculative Decoding is one of the rare levers that improves both latency and cost without sacrificing quality. On the same GPU, you emit more tokens per second for each interactive request → smoother streaming, less waiting. For the same target experience, you need fewer GPU-hours to serve the same load → a lower inference bill.
This is the often-overlooked piece in the "AI for operational efficiency" story: most of an LLM product's cost lives in serving-time inference, not training. A single serving-layer config change — enabling Speculative Decoding and tuning K to the workload — can deliver savings that the "switch to a smaller model" route would only buy by paying in quality. Here you trade nothing: the output distribution is preserved exactly.
Do
- Trial-enable Speculative Decoding by default on every interactive endpoint, and measure before concluding.
- Start with
ngramif the workload has heavy input repetition — free benefit, no training. - Tune
Kto how "guessable" the workload is; track average accepted tokens as a health metric. - Separate your goals: optimize for latency (small batch) or throughput (large batch) — the optimal config differs.
- Keep greedy/low temperature on accuracy-critical endpoints — better quality and a higher acceptance rate at once.
Don't
- Blindly enable it in very large batch mode and be surprised when throughput drops — the GPU is already compute-bound.
- Pick a draft model that's too large: drafting cost eats the entire speedup.
- Expect TTFT improvements — this technique speeds up between tokens, not the prefill step.
- Assume there must be a quality trade-off and "hesitate to enable it" — with proper rejection sampling, the output is lossless.
- Hard-code one
Kfor every workload without re-measuring.
Conclusion
Speculative Decoding is one of the most elegant ideas in modern AI infrastructure: it doesn't make the model smarter, doesn't change a single parameter, it merely reorders the computation to reclaim the GPU bandwidth that one-token-at-a-time decoding throws away. Guessing is cheap and sequential; verifying is parallel and nearly free in the memory-bound regime — and thanks to a rejection-sampling trick, the final result is bit-for-bit identical to the original model. From the classic two-model setup to EAGLE-3 and token trees in 2026, the direction of travel is unchanged: make the draft ever more "in phase" with the target so more tokens are accepted per cycle. For anyone building LLM products, this is a rare lever that cuts both latency and cost without sacrificing quality — a config worth enabling by default and worth measuring for your exact workload.
References
- Leviathan, Kalman, Matias (Google) — Fast Inference from Transformers via Speculative Decoding (ICML 2023)
- Chen et al. (DeepMind) — Accelerating Large Language Model Decoding with Speculative Sampling
- Cai et al. — Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads
- Li et al. — EAGLE-3: Scaling Inference Acceleration via Training-Time Test
- vLLM — Speculative Decoding documentation (ngram / EAGLE / Medusa)
- AWS ML Blog — P-EAGLE: Parallel Speculative Decoding in vLLM
- Hugging Face — Speculative Decoding in Practice: How EAGLE-3 Makes LLMs Faster
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.