Speculative Decoding 2026: How LLMs Generate Text 2–3× Faster

Posted on: 6/15/2026 1:14:48 AM

When you type a question to Claude or ChatGPT, text appears almost instantly and then streams out smoothly. That fluidity hides an uncomfortable truth: under the hood, a large language model (LLM) generates text one token at a time, and each token requires streaming all of its tens of billions of parameters out of GPU memory. This is the fundamental bottleneck of autoregressive decoding. So how do providers still serve responses this fast, to millions of users at once?

The answer, in large part, is a counterintuitive trick called Speculative Decoding: let a small, fast model guess ahead the next several tokens, then let the big model verify the whole batch at once in a single forward pass. The most elegant part: the final output is bit-for-bit identical to running the big model alone — with zero quality trade-off. This article dissects why the trick works, the math that makes it "free," the four families dominating 2026, and how to turn it on in production.

2–3×typical production speedup for interactive workloads
~6.5×peak EAGLE-3 speedup on some tasks (code, chat)
4–5average tokens accepted per verify cycle with EAGLE-3
0%quality loss — output matches the original model's distribution

Speculative Decoding in one sentence

A technique that accelerates LLM inference by using a cheap draft to propose several next tokens, then letting the target model verify the whole batch in a single forward pass — accepting the correct guesses and fixing the first wrong token — so the output distribution is unchanged versus running the target model alone.

The real bottleneck: LLMs aren't short on compute, they're short on bandwidth

The common intuition is "to go faster, you need more FLOPs." For LLM decoding in interactive mode (small batch), that is wrong. At each generation step, the GPU must pull the model's entire weights from high-bandwidth memory (HBM) into the compute cores — only to multiply them by a single state vector. The ratio of "compute per byte loaded" (arithmetic intensity) is extremely low, so the GPU spends most of its time waiting on memory while the matrix cores sit idle. This is the memory-bandwidth-bound regime.

The paradoxical consequence: loading the weights to predict 1 token, or to predict 5 tokens at once, takes almost the same time, because the cost is dominated by moving the weights, not by the number of multiplications. In other words, a single forward pass of the big model has a "hidden compute budget" that one-token-at-a-time generation throws away. Speculative Decoding exists precisely to reclaim that wasted budget.

The key idea to remember

Generating text is sequential (each token depends on the previous one), so it can't be parallelized. But verifying a given sequence of tokens can be fully parallelized: one forward pass scores the probabilities at every position simultaneously. All of Speculative Decoding is about exploiting this asymmetry.

The core idea: guessing is cheap, verifying is parallel

One Speculative Decoding cycle has three steps:

  1. Draft. A small, fast helper model autoregressively generates K candidate tokens (e.g. K = 4–8). Because it's small, these K steps are cheap.
  2. Verify. The target model runs one forward pass over (prompt + the K drafted tokens), returning a probability distribution at every position — the big model's "opinion" on each token the draft proposed.
  3. Accept / fix. A clever sampling step decides the longest prefix of the draft to keep, and fixes the first wrong token. The rest of the draft is discarded. Repeat.

If each cycle accepts γ tokens on average, then every forward pass of the big model yields γ+1 tokens instead of 1. Since the big model's forward pass is the dominant cost, generation speed rises roughly γ+1× (minus the cost of running the draft).

flowchart TB
    S["Current context"] --> D["DRAFT model (small)
generate K candidate tokens
(autoregressive, cheap)"] D --> V["TARGET model (large)
1 forward pass verifies all K
(parallel)"] V --> A{"Acceptance sampling
which tokens are right?"} A -- "accept prefix + 1 bonus token" --> G["Emit gamma+1 tokens
per large forward pass"] A -- "first wrong token" --> F["Fix from residual dist
discard remaining draft"] F --> G G --> C{"Done?"} C -- "no" --> S C -- "yes" --> E["Return output"] style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style D fill:#16213e,stroke:#fff,color:#fff style V fill:#e94560,stroke:#fff,color:#fff style A fill:#ff9800,stroke:#fff,color:#fff style G fill:#2c3e50,stroke:#fff,color:#fff
The draft–verify loop: the small model guesses, the big model checks the whole batch in one pass, keeping the correct prefix.

The math behind "free": why quality stays identical

What sets Speculative Decoding apart from other acceleration tricks (like quantization or pruning) is that it does not approximate. The output is provably distributed exactly as the target model's. The secret is a variant of rejection sampling, published independently in 2023 by Leviathan et al. (Google) and Chen et al. (DeepMind).

Let q(x) be the probability the draft model assigns to token x, and p(x) the probability the target model assigns to it. For each drafted token, in order:

  • Accept the token with probability min(1, p(x) / q(x)). If the big model "believes" in this token at least as much as the small model does, always keep it.
  • If rejected at some position, sample a replacement token from the normalized residual distribution norm(max(0, p(x) − q(x))), then stop — discard all later drafted tokens.
  • If all K tokens are accepted, sample one extra "bonus" token from the target model's distribution at position K+1 — essentially free, since the big forward pass already computed it.

It can be proven that this procedure produces tokens distributed exactly as if sampled directly from the target model. That's why Speculative Decoding is called a lossless technique: it changes the order of computation, not the distribution of results.

The greedy case (low temperature)

Under greedy decoding the rule collapses to something intuitive: accept the longest prefix where the draft's argmax token matches the target's argmax token; at the first divergence, take the target's token. This is why low temperature (more deterministic output) yields a higher acceptance rate — the draft finds it easier to "guess right."

sequenceDiagram
    participant U as Loop
    participant Dr as Draft model (small)
    participant Tg as Target model (large)
    U->>Dr: Current context
    Dr-->>U: K draft tokens: [t1 t2 t3 t4]
    U->>Tg: prompt + [t1 t2 t3 t4] (1 forward pass)
    Tg-->>U: distribution p at each position
    Note over U: Accept t1,t2,t3 (p>=q)
t4 rejected -> resample from (p-q)+ U-->>U: Emit t1 t2 t3 + t4' = 4 tokens / 1 large pass Note over U,Tg: Repeat with new context
A typical cycle: three drafted tokens accepted, the fourth rejected and corrected — four correctly-distributed tokens from a single large forward pass.

The four families dominating 2026

The "draft" can come from many sources, and this is where methods diverge. The core trade-off is always: the closer the draft is to the target, the higher the acceptance rate — but the more expensive the draft is to produce, the more the benefit erodes.

FamilyDraft sourceExtra training?StrengthBest when
Two-model (draft model)A small LLM from the same family (e.g. 1B drafts for 70B)No, if a small model already existsSimple, easy to reason about, solid gainsYou already have a small–large pair with the same tokenizer
MedusaExtra prediction "heads" bolted onto the target model itselfYes — train the extra headsNo separate helper model, low memory overheadYou want self-speculation, dislike managing two models
EAGLE / EAGLE-2 / EAGLE-3Autoregression at the feature level + candidate treeYes — train a lightweight draft layerSOTA acceptance rate, highest speedup in 2026You need maximum speed and can afford one training run
N-gram / Prompt LookupCopy spans straight from the prompt/contextNo — zero-cost, no model neededCompletely free draftInput-repetitive workloads: RAG, code edits, summarization

Two details worth remembering. First, EAGLE doesn't guess raw tokens but guesses at the hidden feature level and only then projects to tokens — making the draft more "in phase" with the big model and pushing the acceptance rate up. EAGLE-3 (2025) drops feature regression for direct token-level prediction with multi-layer fusion, reaching 2–6× with an average of ~4–5 accepted tokens per cycle. Second, Prompt Lookup is beautiful because for tasks where the output reuses much of the input (citing sources in RAG, editing one code file, rewriting a paragraph), the best draft is simply the spans already present in the context — produced without a single FLOP.

Token tree: verifying many guess branches at once

Modern methods (EAGLE, Medusa, SpecInfer) don't guess just one token sequence but a whole tree of possibilities. Instead of betting everything on one linear prediction, the draft proposes multiple branches, and the target model verifies the entire tree in a single forward pass via tree attention — a special attention mask that lets each branch "see" only its ancestors. The deepest accepted branch becomes that cycle's output.

flowchart TB
    R(("root token")) --> A1["the"]
    R --> A2["a"]
    A1 --> B1["cat"]
    A1 --> B2["dog"]
    A2 --> B3["big"]
    B1 --> C1["sat"]
    B2 --> C2["ran"]

    style R fill:#e94560,stroke:#fff,color:#fff
    style A1 fill:#2c3e50,stroke:#fff,color:#fff
    style A2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B1 fill:#2c3e50,stroke:#fff,color:#fff
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
Tree-based guessing: instead of one chain, the draft proposes several branches; the target verifies the whole tree in one pass and picks the deepest correct branch (highlighted).

The benefit of the tree is a higher expected number of accepted tokens per cycle: if the first straight branch goes wrong early, another branch in the tree may still be correct for longer. The price is a bit of extra compute for branches that get discarded — but since we're in the memory-bound regime, that extra compute is mostly "free."

When it speeds up, and when it backfires

Speculative Decoding is not a magic always-win button. Three factors decide how much it helps — or whether it slows things down.

FactorGood for Spec DecodingAdverse / needs care
TemperatureLow / greedy → draft guesses easier, high acceptanceHigh → more random output, acceptance drops
Batch sizeSmall, interactive → memory-bound, big winLarge → already compute-bound, extra verify FLOPs can erode throughput
Draft qualityDraft resembles target → high γDraft too expensive (large helper) → benefit eaten back
Speculation length KModerate K, matched to workload predictabilityK too large → wasteful when a wrong token appears early

The common trap

In high-throughput, large-batch serving, the GPU has shifted to compute-bound. There, rejected draft tokens become wasted FLOPs, and Speculative Decoding can reduce total throughput even while improving per-token latency for some requests. Measure on your own workload: optimizing for interactive latency (TPOT) is not the same as optimizing for batch throughput. Also note: this technique speeds up the time between tokens (TPOT), not the time to first token (TTFT), which is determined by the prefill step.

From idea to production standard

2018 — Blockwise Parallel Decoding
The seed. The idea of predicting several future tokens at once and re-checking them appears, but without a rigorous mechanism to preserve the distribution.
2023 — Speculative Sampling
The turning point. Leviathan (Google) and Chen (DeepMind) independently prove a rejection-sampling mechanism that keeps the target distribution intact — 2–3× speedup with no quality loss. Speculative Decoding becomes officially "lossless."
2024 — Self-speculation explodes
Dropping the helper model. Medusa (extra heads), EAGLE (feature regression + tree), and Lookahead Decoding (Jacobi iteration, no draft model) arrive almost simultaneously, moving speculation inside the target model itself.
2025 — EAGLE-2/3 & framework integration
SOTA + ready to use. EAGLE-3 pushes speedups to 2–6×. vLLM, TensorRT-LLM, and SGLang make Speculative Decoding a native feature; NVIDIA demonstrates ~3.6× throughput on H200.
2026 — The default standard
From research to infrastructure. Parallel variants (such as P-EAGLE) remove the last sequential bottleneck in the drafting step, alongside dedicated frameworks for training drafts. Speculative Decoding becomes the default configuration for serving LLMs at scale.

Turning it on in production: real config

Good news for engineers: you almost never need to implement the algorithm yourself. Popular serving engines support it out of the box. On vLLM (2026), Speculative Decoding is declared via the speculative_config field, supporting many methods: ngram, eagle, eagle3, medusa, draft_model, and MTP variants.

# vLLM -- n-gram (zero-cost draft), great for RAG / code edits / summarization
from vllm import LLM
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,   # K
        "prompt_lookup_max": 4,        # max n-gram window
    },
)

# vLLM -- EAGLE-3 (highest speedup, needs a trained draft layer)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 6,
    },
)

Quick method-picking tips

  • No training, input-repetitive workload (RAG, code edit) → start with ngram / prompt-lookup. Instant benefit, zero risk.
  • Maximum speedup, can afford one training run → EAGLE-3, currently the best acceptance rate.
  • Already have a small–large pair with the same tokenizer → the two-model draft is the simplest, most predictable choice.
  • Always benchmark on real traffic with the temperature and batch distribution of production before fixing K and the method.

Operational impact: why this is a cost-optimization problem

From an operations lens, Speculative Decoding is one of the rare levers that improves both latency and cost without sacrificing quality. On the same GPU, you emit more tokens per second for each interactive request → smoother streaming, less waiting. For the same target experience, you need fewer GPU-hours to serve the same load → a lower inference bill.

This is the often-overlooked piece in the "AI for operational efficiency" story: most of an LLM product's cost lives in serving-time inference, not training. A single serving-layer config change — enabling Speculative Decoding and tuning K to the workload — can deliver savings that the "switch to a smaller model" route would only buy by paying in quality. Here you trade nothing: the output distribution is preserved exactly.

Do

  • Trial-enable Speculative Decoding by default on every interactive endpoint, and measure before concluding.
  • Start with ngram if the workload has heavy input repetition — free benefit, no training.
  • Tune K to how "guessable" the workload is; track average accepted tokens as a health metric.
  • Separate your goals: optimize for latency (small batch) or throughput (large batch) — the optimal config differs.
  • Keep greedy/low temperature on accuracy-critical endpoints — better quality and a higher acceptance rate at once.

Don't

  • Blindly enable it in very large batch mode and be surprised when throughput drops — the GPU is already compute-bound.
  • Pick a draft model that's too large: drafting cost eats the entire speedup.
  • Expect TTFT improvements — this technique speeds up between tokens, not the prefill step.
  • Assume there must be a quality trade-off and "hesitate to enable it" — with proper rejection sampling, the output is lossless.
  • Hard-code one K for every workload without re-measuring.

Conclusion

Speculative Decoding is one of the most elegant ideas in modern AI infrastructure: it doesn't make the model smarter, doesn't change a single parameter, it merely reorders the computation to reclaim the GPU bandwidth that one-token-at-a-time decoding throws away. Guessing is cheap and sequential; verifying is parallel and nearly free in the memory-bound regime — and thanks to a rejection-sampling trick, the final result is bit-for-bit identical to the original model. From the classic two-model setup to EAGLE-3 and token trees in 2026, the direction of travel is unchanged: make the draft ever more "in phase" with the target so more tokens are accepted per cycle. For anyone building LLM products, this is a rare lever that cuts both latency and cost without sacrificing quality — a config worth enabling by default and worth measuring for your exact workload.


References