Speculative Decoding 2026: How LLMs Generate Text 2–3× Faster

Posted on: 6/15/2026 1:14:48 AM

Table of contents

Speculative Decoding in one sentence
The real bottleneck: LLMs aren't short on compute, they're short on bandwidth
1. The key idea to remember
The core idea: guessing is cheap, verifying is parallel
The math behind "free": why quality stays identical
1. The greedy case (low temperature)
The four families dominating 2026
Token tree: verifying many guess branches at once
When it speeds up, and when it backfires
1. The common trap
From idea to production standard
Turning it on in production: real config
1. Quick method-picking tips
Operational impact: why this is a cost-optimization problem
1. Do
2. Don't
Conclusion
1. References

When you type a question to Claude or ChatGPT, text appears almost instantly and then streams out smoothly. That fluidity hides an uncomfortable truth: under the hood, a large language model (LLM) generates text one token at a time, and each token requires streaming all of its tens of billions of parameters out of GPU memory. This is the fundamental bottleneck of autoregressive decoding. So how do providers still serve responses this fast, to millions of users at once?

The answer, in large part, is a counterintuitive trick called Speculative Decoding: let a small, fast model guess ahead the next several tokens, then let the big model verify the whole batch at once in a single forward pass. The most elegant part: the final output is bit-for-bit identical to running the big model alone — with zero quality trade-off. This article dissects why the trick works, the math that makes it "free," the four families dominating 2026, and how to turn it on in production.

2–3×typical production speedup for interactive workloads

~6.5×peak EAGLE-3 speedup on some tasks (code, chat)

4–5average tokens accepted per verify cycle with EAGLE-3

0%quality loss — output matches the original model's distribution

Speculative Decoding in one sentence

A technique that accelerates LLM inference by using a cheap draft to propose several next tokens, then letting the target model verify the whole batch in a single forward pass — accepting the correct guesses and fixing the first wrong token — so the output distribution is unchanged versus running the target model alone.

The real bottleneck: LLMs aren't short on compute, they're short on bandwidth

The common intuition is "to go faster, you need more FLOPs." For LLM decoding in interactive mode (small batch), that is wrong. At each generation step, the GPU must pull the model's entire weights from high-bandwidth memory (HBM) into the compute cores — only to multiply them by a single state vector. The ratio of "compute per byte loaded" (arithmetic intensity) is extremely low, so the GPU spends most of its time waiting on memory while the matrix cores sit idle. This is the memory-bandwidth-bound regime.

The paradoxical consequence: loading the weights to predict 1 token, or to predict 5 tokens at once, takes almost the same time, because the cost is dominated by moving the weights, not by the number of multiplications. In other words, a single forward pass of the big model has a "hidden compute budget" that one-token-at-a-time generation throws away. Speculative Decoding exists precisely to reclaim that wasted budget.

The key idea to remember

Generating text is sequential (each token depends on the previous one), so it can't be parallelized. But verifying a given sequence of tokens can be fully parallelized: one forward pass scores the probabilities at every position simultaneously. All of Speculative Decoding is about exploiting this asymmetry.

The core idea: guessing is cheap, verifying is parallel

One Speculative Decoding cycle has three steps:

Draft. A small, fast helper model autoregressively generates K candidate tokens (e.g. K = 4–8). Because it's small, these K steps are cheap.
Verify. The target model runs one forward pass over (prompt + the K drafted tokens), returning a probability distribution at every position — the big model's "opinion" on each token the draft proposed.
Accept / fix. A clever sampling step decides the longest prefix of the draft to keep, and fixes the first wrong token. The rest of the draft is discarded. Repeat.

If each cycle accepts γ tokens on average, then every forward pass of the big model yields γ+1 tokens instead of 1. Since the big model's forward pass is the dominant cost, generation speed rises roughly γ+1× (minus the cost of running the draft).

flowchart TB
    S["Current context"] --> D["DRAFT model (small)
generate K candidate tokens
(autoregressive, cheap)"]
    D --> V["TARGET model (large)
1 forward pass verifies all K
(parallel)"]
    V --> A{"Acceptance sampling
which tokens are right?"}
    A -- "accept prefix + 1 bonus token" --> G["Emit gamma+1 tokens
per large forward pass"]
    A -- "first wrong token" --> F["Fix from residual dist
discard remaining draft"]
    F --> G
    G --> C{"Done?"}
    C -- "no" --> S
    C -- "yes" --> E["Return output"]

    style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#16213e,stroke:#fff,color:#fff
    style V fill:#e94560,stroke:#fff,color:#fff
    style A fill:#ff9800,stroke:#fff,color:#fff
    style G fill:#2c3e50,stroke:#fff,color:#fff

The draft–verify loop: the small model guesses, the big model checks the whole batch in one pass, keeping the correct prefix.

The math behind "free": why quality stays identical

What sets Speculative Decoding apart from other acceleration tricks (like quantization or pruning) is that it does not approximate. The output is provably distributed exactly as the target model's. The secret is a variant of rejection sampling, published independently in 2023 by Leviathan et al. (Google) and Chen et al. (DeepMind).

Let q(x) be the probability the draft model assigns to token x, and p(x) the probability the target model assigns to it. For each drafted token, in order:

Accept the token with probability min(1, p(x) / q(x)). If the big model "believes" in this token at least as much as the small model does, always keep it.
If rejected at some position, sample a replacement token from the normalized residual distribution norm(max(0, p(x) − q(x))), then stop — discard all later drafted tokens.
If all K tokens are accepted, sample one extra "bonus" token from the target model's distribution at position K+1 — essentially free, since the big forward pass already computed it.

It can be proven that this procedure produces tokens distributed exactly as if sampled directly from the target model. That's why Speculative Decoding is called a lossless technique: it changes the order of computation, not the distribution of results.

The greedy case (low temperature)

Under greedy decoding the rule collapses to something intuitive: accept the longest prefix where the draft's argmax token matches the target's argmax token; at the first divergence, take the target's token. This is why low temperature (more deterministic output) yields a higher acceptance rate — the draft finds it easier to "guess right."

sequenceDiagram
    participant U as Loop
    participant Dr as Draft model (small)
    participant Tg as Target model (large)
    U->>Dr: Current context
    Dr-->>U: K draft tokens: [t1 t2 t3 t4]
    U->>Tg: prompt + [t1 t2 t3 t4] (1 forward pass)
    Tg-->>U: distribution p at each position
    Note over U: Accept t1,t2,t3 (p>=q)
t4 rejected -> resample from (p-q)+
    U-->>U: Emit t1 t2 t3 + t4' = 4 tokens / 1 large pass
    Note over U,Tg: Repeat with new context

A typical cycle: three drafted tokens accepted, the fourth rejected and corrected — four correctly-distributed tokens from a single large forward pass.

The four families dominating 2026

The "draft" can come from many sources, and this is where methods diverge. The core trade-off is always: the closer the draft is to the target, the higher the acceptance rate — but the more expensive the draft is to produce, the more the benefit erodes.

Family	Draft source	Extra training?	Strength	Best when
Two-model (draft model)	A small LLM from the same family (e.g. 1B drafts for 70B)	No, if a small model already exists	Simple, easy to reason about, solid gains	You already have a small–large pair with the same tokenizer
Medusa	Extra prediction "heads" bolted onto the target model itself	Yes — train the extra heads	No separate helper model, low memory overhead	You want self-speculation, dislike managing two models
EAGLE / EAGLE-2 / EAGLE-3	Autoregression at the feature level + candidate tree	Yes — train a lightweight draft layer	SOTA acceptance rate, highest speedup in 2026	You need maximum speed and can afford one training run
N-gram / Prompt Lookup	Copy spans straight from the prompt/context	No — zero-cost, no model needed	Completely free draft	Input-repetitive workloads: RAG, code edits, summarization

Two details worth remembering. First, EAGLE doesn't guess raw tokens but guesses at the hidden feature level and only then projects to tokens — making the draft more "in phase" with the big model and pushing the acceptance rate up. EAGLE-3 (2025) drops feature regression for direct token-level prediction with multi-layer fusion, reaching 2–6× with an average of ~4–5 accepted tokens per cycle. Second, Prompt Lookup is beautiful because for tasks where the output reuses much of the input (citing sources in RAG, editing one code file, rewriting a paragraph), the best draft is simply the spans already present in the context — produced without a single FLOP.

Token tree: verifying many guess branches at once

Modern methods (EAGLE, Medusa, SpecInfer) don't guess just one token sequence but a whole tree of possibilities. Instead of betting everything on one linear prediction, the draft proposes multiple branches, and the target model verifies the entire tree in a single forward pass via tree attention — a special attention mask that lets each branch "see" only its ancestors. The deepest accepted branch becomes that cycle's output.

flowchart TB
    R(("root token")) --> A1["the"]
    R --> A2["a"]
    A1 --> B1["cat"]
    A1 --> B2["dog"]
    A2 --> B3["big"]
    B1 --> C1["sat"]
    B2 --> C2["ran"]

    style R fill:#e94560,stroke:#fff,color:#fff
    style A1 fill:#2c3e50,stroke:#fff,color:#fff
    style A2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B1 fill:#2c3e50,stroke:#fff,color:#fff
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Tree-based guessing: instead of one chain, the draft proposes several branches; the target verifies the whole tree in one pass and picks the deepest correct branch (highlighted).

The benefit of the tree is a higher expected number of accepted tokens per cycle: if the first straight branch goes wrong early, another branch in the tree may still be correct for longer. The price is a bit of extra compute for branches that get discarded — but since we're in the memory-bound regime, that extra compute is mostly "free."

When it speeds up, and when it backfires

Speculative Decoding is not a magic always-win button. Three factors decide how much it helps — or whether it slows things down.

Factor	Good for Spec Decoding	Adverse / needs care
Temperature	Low / greedy → draft guesses easier, high acceptance	High → more random output, acceptance drops
Batch size	Small, interactive → memory-bound, big win	Large → already compute-bound, extra verify FLOPs can erode throughput
Draft quality	Draft resembles target → high γ	Draft too expensive (large helper) → benefit eaten back
Speculation length K	Moderate K, matched to workload predictability	K too large → wasteful when a wrong token appears early

The common trap

In high-throughput, large-batch serving, the GPU has shifted to compute-bound. There, rejected draft tokens become wasted FLOPs, and Speculative Decoding can reduce total throughput even while improving per-token latency for some requests. Measure on your own workload: optimizing for interactive latency (TPOT) is not the same as optimizing for batch throughput. Also note: this technique speeds up the time between tokens (TPOT), not the time to first token (TTFT), which is determined by the prefill step.

From idea to production standard

2018 — Blockwise Parallel Decoding

The seed. The idea of predicting several future tokens at once and re-checking them appears, but without a rigorous mechanism to preserve the distribution.

2023 — Speculative Sampling

The turning point. Leviathan (Google) and Chen (DeepMind) independently prove a rejection-sampling mechanism that keeps the target distribution intact — 2–3× speedup with no quality loss. Speculative Decoding becomes officially "lossless."

2024 — Self-speculation explodes

Dropping the helper model. Medusa (extra heads), EAGLE (feature regression + tree), and Lookahead Decoding (Jacobi iteration, no draft model) arrive almost simultaneously, moving speculation inside the target model itself.

2025 — EAGLE-2/3 & framework integration

SOTA + ready to use. EAGLE-3 pushes speedups to 2–6×. vLLM, TensorRT-LLM, and SGLang make Speculative Decoding a native feature; NVIDIA demonstrates ~3.6× throughput on H200.

2026 — The default standard

From research to infrastructure. Parallel variants (such as P-EAGLE) remove the last sequential bottleneck in the drafting step, alongside dedicated frameworks for training drafts. Speculative Decoding becomes the default configuration for serving LLMs at scale.

Turning it on in production: real config

Good news for engineers: you almost never need to implement the algorithm yourself. Popular serving engines support it out of the box. On vLLM (2026), Speculative Decoding is declared via the speculative_config field, supporting many methods: ngram, eagle, eagle3, medusa, draft_model, and MTP variants.

# vLLM -- n-gram (zero-cost draft), great for RAG / code edits / summarization
from vllm import LLM
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,   # K
        "prompt_lookup_max": 4,        # max n-gram window
    },
)

# vLLM -- EAGLE-3 (highest speedup, needs a trained draft layer)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 6,
    },
)

Quick method-picking tips

No training, input-repetitive workload (RAG, code edit) → start with ngram / prompt-lookup. Instant benefit, zero risk.
Maximum speedup, can afford one training run → EAGLE-3, currently the best acceptance rate.
Already have a small–large pair with the same tokenizer → the two-model draft is the simplest, most predictable choice.
Always benchmark on real traffic with the temperature and batch distribution of production before fixing K and the method.

Operational impact: why this is a cost-optimization problem

From an operations lens, Speculative Decoding is one of the rare levers that improves both latency and cost without sacrificing quality. On the same GPU, you emit more tokens per second for each interactive request → smoother streaming, less waiting. For the same target experience, you need fewer GPU-hours to serve the same load → a lower inference bill.

This is the often-overlooked piece in the "AI for operational efficiency" story: most of an LLM product's cost lives in serving-time inference, not training. A single serving-layer config change — enabling Speculative Decoding and tuning K to the workload — can deliver savings that the "switch to a smaller model" route would only buy by paying in quality. Here you trade nothing: the output distribution is preserved exactly.

Do

Trial-enable Speculative Decoding by default on every interactive endpoint, and measure before concluding.
Start with ngram if the workload has heavy input repetition — free benefit, no training.
Tune K to how "guessable" the workload is; track average accepted tokens as a health metric.
Separate your goals: optimize for latency (small batch) or throughput (large batch) — the optimal config differs.
Keep greedy/low temperature on accuracy-critical endpoints — better quality and a higher acceptance rate at once.

Don't

Blindly enable it in very large batch mode and be surprised when throughput drops — the GPU is already compute-bound.
Pick a draft model that's too large: drafting cost eats the entire speedup.
Expect TTFT improvements — this technique speeds up between tokens, not the prefill step.
Assume there must be a quality trade-off and "hesitate to enable it" — with proper rejection sampling, the output is lossless.
Hard-code one K for every workload without re-measuring.

Conclusion

Speculative Decoding is one of the most elegant ideas in modern AI infrastructure: it doesn't make the model smarter, doesn't change a single parameter, it merely reorders the computation to reclaim the GPU bandwidth that one-token-at-a-time decoding throws away. Guessing is cheap and sequential; verifying is parallel and nearly free in the memory-bound regime — and thanks to a rejection-sampling trick, the final result is bit-for-bit identical to the original model. From the classic two-model setup to EAGLE-3 and token trees in 2026, the direction of travel is unchanged: make the draft ever more "in phase" with the target so more tokens are accepted per cycle. For anyone building LLM products, this is a rare lever that cuts both latency and cost without sacrificing quality — a config worth enabling by default and worth measuring for your exact workload.

References

#LLM Inference #vLLM #Speculative Decoding #AI Optimization #EAGLE

# Speculative Decoding 2026: How LLMs Generate Text 2–3× Faster

When you type a question to Claude or ChatGPT, text appears almost instantly and then streams out smoothly. That fluidity hides an uncomfortable truth: under the hood, a large language model (LLM) generates text **one token at a time**, and each token requires streaming all of its tens of billions of parameters out of GPU memory. This is the fundamental bottleneck of autoregressive decoding. So how do providers still serve responses this fast, to millions of users at once?

The answer, in large part, is a counterintuitive trick called **Speculative Decoding**: let a small, fast model *guess ahead* the next several tokens, then let the big model *verify the whole batch at once* in a single forward pass. The most elegant part: the final output is **bit-for-bit identical** to running the big model alone — with zero quality trade-off. This article dissects why the trick works, the math that makes it "free," the four families dominating 2026, and how to turn it on in production.

2–3×typical production speedup for interactive workloads

~6.5×peak EAGLE-3 speedup on some tasks (code, chat)

4–5average tokens accepted per verify cycle with EAGLE-3

0%quality loss — output matches the original model's distribution

#### Speculative Decoding in one sentence

A technique that accelerates LLM inference by using a cheap **draft** to propose several next tokens, then letting the **target** model verify the whole batch in a single forward pass — accepting the correct guesses and fixing the first wrong token — so the output distribution is *unchanged* versus running the target model alone.

## The real bottleneck: LLMs aren't short on compute, they're short on bandwidth

The common intuition is "to go faster, you need more FLOPs." For LLM decoding in interactive mode (small batch), that is **wrong**. At each generation step, the GPU must pull the model's entire weights from high-bandwidth memory (HBM) into the compute cores — only to multiply them by a *single* state vector. The ratio of "compute per byte loaded" (arithmetic intensity) is extremely low, so the GPU spends most of its time **waiting on memory** while the matrix cores sit idle. This is the *memory-bandwidth-bound* regime.

The paradoxical consequence: loading the weights to predict 1 token, or to predict 5 tokens at once, takes almost the **same time**, because the cost is dominated by moving the weights, not by the number of multiplications. In other words, a single forward pass of the big model has a "hidden compute budget" that one-token-at-a-time generation throws away. Speculative Decoding exists precisely to **reclaim that wasted budget**.

#### The key idea to remember

Generating text is **sequential** (each token depends on the previous one), so it can't be parallelized. But **verifying** a given sequence of tokens **can be fully parallelized**: one forward pass scores the probabilities at every position simultaneously. All of Speculative Decoding is about exploiting this asymmetry.

## The core idea: guessing is cheap, verifying is parallel

One Speculative Decoding cycle has three steps:

1. **Draft.** A small, fast helper model autoregressively generates `K` candidate tokens (e.g. K = 4–8). Because it's small, these K steps are cheap.
2. **Verify.** The target model runs **one** forward pass over (prompt + the K drafted tokens), returning a probability distribution at every position — the big model's "opinion" on each token the draft proposed.
3. **Accept / fix.** A clever sampling step decides the *longest prefix* of the draft to keep, and fixes the first wrong token. The rest of the draft is discarded. Repeat.

```
flowchart TB
    S["Current context"] --> D["DRAFT model (small)  
generate K candidate tokens  
(autoregressive, cheap)"]
    D --> V["TARGET model (large)  
1 forward pass verifies all K  
(parallel)"]
    V --> A{"Acceptance sampling  
which tokens are right?"}
    A -- "accept prefix + 1 bonus token" --> G["Emit gamma+1 tokens  
per large forward pass"]
    A -- "first wrong token" --> F["Fix from residual dist  
discard remaining draft"]
    F --> G
    G --> C{"Done?"}
    C -- "no" --> S
    C -- "yes" --> E["Return output"]

style S fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#16213e,stroke:#fff,color:#fff
    style V fill:#e94560,stroke:#fff,color:#fff
    style A fill:#ff9800,stroke:#fff,color:#fff
    style G fill:#2c3e50,stroke:#fff,color:#fff

```

The draft–verify loop: the small model guesses, the big model checks the whole batch in one pass, keeping the correct prefix.

## The math behind "free": why quality stays identical

What sets Speculative Decoding apart from other acceleration tricks (like quantization or pruning) is that it does **not approximate**. The output is provably distributed *exactly* as the target model's. The secret is a variant of **rejection sampling**, published independently in 2023 by Leviathan et al. (Google) and Chen et al. (DeepMind).

Let `q(x)` be the probability the draft model assigns to token `x`, and `p(x)` the probability the target model assigns to it. For each drafted token, in order:

- **Accept** the token with probability `min(1, p(x) / q(x))`. If the big model "believes" in this token at least as much as the small model does, always keep it.
- **If rejected** at some position, sample a replacement token from the normalized *residual distribution* `norm(max(0, p(x) − q(x)))`, then **stop** — discard all later drafted tokens.
- **If all K tokens are accepted**, sample one extra "bonus" token from the target model's distribution at position K+1 — essentially free, since the big forward pass already computed it.

It can be proven that this procedure produces tokens distributed *exactly* as if sampled directly from the target model. That's why Speculative Decoding is called a **lossless** technique: it changes the *order of computation*, not the *distribution of results*.

#### The greedy case (low temperature)

Under greedy decoding the rule collapses to something intuitive: accept the longest prefix where the draft's argmax token matches the target's argmax token; at the first divergence, take the target's token. This is why low temperature (more deterministic output) yields a **higher acceptance rate** — the draft finds it easier to "guess right."

```
sequenceDiagram
    participant U as Loop
    participant Dr as Draft model (small)
    participant Tg as Target model (large)
    U->>Dr: Current context
    Dr-->>U: K draft tokens: [t1 t2 t3 t4]
    U->>Tg: prompt + [t1 t2 t3 t4] (1 forward pass)
    Tg-->>U: distribution p at each position
    Note over U: Accept t1,t2,t3 (p>=q)  
t4 rejected -> resample from (p-q)+
    U-->>U: Emit t1 t2 t3 + t4' = 4 tokens / 1 large pass
    Note over U,Tg: Repeat with new context

```

A typical cycle: three drafted tokens accepted, the fourth rejected and corrected — four correctly-distributed tokens from a single large forward pass.

## The four families dominating 2026

The "draft" can come from many sources, and this is where methods diverge. The core trade-off is always: **the closer the draft is to the target, the higher the acceptance rate — but the more expensive the draft is to produce, the more the benefit erodes.**

| Family | Draft source | Extra training? | Strength | Best when |
| --- | --- | --- | --- | --- |
| **Two-model** (draft model) | A small LLM from the same family (e.g. 1B drafts for 70B) | No, if a small model already exists | Simple, easy to reason about, solid gains | You already have a small–large pair with the same tokenizer |
| **Medusa** | Extra prediction "heads" bolted onto the target model itself | Yes — train the extra heads | No separate helper model, low memory overhead | You want self-speculation, dislike managing two models |
| **EAGLE / EAGLE-2 / EAGLE-3** | Autoregression at the *feature* level + candidate tree | Yes — train a lightweight draft layer | SOTA acceptance rate, highest speedup in 2026 | You need maximum speed and can afford one training run |
| **N-gram / Prompt Lookup** | Copy spans straight from the prompt/context | No — zero-cost, no model needed | Completely free draft | Input-repetitive workloads: RAG, code edits, summarization |

Two details worth remembering. First, **EAGLE** doesn't guess raw tokens but guesses at the *hidden feature* level and only then projects to tokens — making the draft more "in phase" with the big model and pushing the acceptance rate up. EAGLE-3 (2025) drops feature regression for direct token-level prediction with multi-layer fusion, reaching 2–6× with an average of ~4–5 accepted tokens per cycle. Second, **Prompt Lookup** is beautiful because for tasks where the output reuses much of the input (citing sources in RAG, editing one code file, rewriting a paragraph), the best draft is simply *the spans already present in the context* — produced without a single FLOP.

## Token tree: verifying many guess branches at once

Modern methods (EAGLE, Medusa, SpecInfer) don't guess just *one* token sequence but a whole **tree** of possibilities. Instead of betting everything on one linear prediction, the draft proposes multiple branches, and the target model verifies *the entire tree* in a single forward pass via **tree attention** — a special attention mask that lets each branch "see" only its ancestors. The deepest accepted branch becomes that cycle's output.

```
flowchart TB
    R(("root token")) --> A1["the"]
    R --> A2["a"]
    A1 --> B1["cat"]
    A1 --> B2["dog"]
    A2 --> B3["big"]
    B1 --> C1["sat"]
    B2 --> C2["ran"]

style R fill:#e94560,stroke:#fff,color:#fff
    style A1 fill:#2c3e50,stroke:#fff,color:#fff
    style A2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B1 fill:#2c3e50,stroke:#fff,color:#fff
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#2c3e50,stroke:#fff,color:#fff
    style C2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```

Tree-based guessing: instead of one chain, the draft proposes several branches; the target verifies the whole tree in one pass and picks the deepest correct branch (highlighted).

The benefit of the tree is a **higher expected number of accepted tokens per cycle**: if the first straight branch goes wrong early, another branch in the tree may still be correct for longer. The price is a bit of extra compute for branches that get discarded — but since we're in the memory-bound regime, that extra compute is mostly "free."

## When it speeds up, and when it backfires

Speculative Decoding is not a magic always-win button. Three factors decide how much it helps — or whether it slows things down.

| Factor | Good for Spec Decoding | Adverse / needs care |
| --- | --- | --- |
| Temperature | Low / greedy → draft guesses easier, high acceptance | High → more random output, acceptance drops |
| Batch size | Small, interactive → memory-bound, big win | Large → already compute-bound, extra verify FLOPs can erode throughput |
| Draft quality | Draft resembles target → high γ | Draft too expensive (large helper) → benefit eaten back |
| Speculation length K | Moderate K, matched to workload predictability | K too large → wasteful when a wrong token appears early |

#### The common trap

In **high-throughput, large-batch** serving, the GPU has shifted to compute-bound. There, rejected draft tokens become wasted FLOPs, and Speculative Decoding can *reduce* total throughput even while improving per-token latency for some requests. Measure on **your own workload**: optimizing for interactive latency (TPOT) is not the same as optimizing for batch throughput. Also note: this technique speeds up the *time between tokens* (TPOT), not the *time to first token* (TTFT), which is determined by the prefill step.

## From idea to production standard

2018 — Blockwise Parallel Decoding

**The seed.** The idea of predicting several future tokens at once and re-checking them appears, but without a rigorous mechanism to preserve the distribution.

2023 — Speculative Sampling

**The turning point.** Leviathan (Google) and Chen (DeepMind) independently prove a rejection-sampling mechanism that keeps the target distribution *intact* — 2–3× speedup with no quality loss. Speculative Decoding becomes officially "lossless."

2024 — Self-speculation explodes

**Dropping the helper model.** Medusa (extra heads), EAGLE (feature regression + tree), and Lookahead Decoding (Jacobi iteration, no draft model) arrive almost simultaneously, moving speculation inside the target model itself.

2025 — EAGLE-2/3 & framework integration

**SOTA + ready to use.** EAGLE-3 pushes speedups to 2–6×. vLLM, TensorRT-LLM, and SGLang make Speculative Decoding a native feature; NVIDIA demonstrates ~3.6× throughput on H200.

2026 — The default standard

**From research to infrastructure.** Parallel variants (such as P-EAGLE) remove the last sequential bottleneck in the drafting step, alongside dedicated frameworks for training drafts. Speculative Decoding becomes the default configuration for serving LLMs at scale.

## Turning it on in production: real config

Good news for engineers: you almost never need to implement the algorithm yourself. Popular serving engines support it out of the box. On vLLM (2026), Speculative Decoding is declared via the `speculative_config` field, supporting many methods: `ngram`, `eagle`, `eagle3`, `medusa`, `draft_model`, and MTP variants.

```
# vLLM -- n-gram (zero-cost draft), great for RAG / code edits / summarization
from vllm import LLM
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,   # K
        "prompt_lookup_max": 4,        # max n-gram window
    },
)

# vLLM -- EAGLE-3 (highest speedup, needs a trained draft layer)
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "method": "eagle3",
        "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B",
        "num_speculative_tokens": 6,
    },
)

```

#### Quick method-picking tips

- **No training, input-repetitive workload** (RAG, code edit) → start with `ngram` / prompt-lookup. Instant benefit, zero risk.
- **Maximum speedup, can afford one training run** → EAGLE-3, currently the best acceptance rate.
- **Already have a small–large pair with the same tokenizer** → the two-model draft is the simplest, most predictable choice.
- **Always benchmark on real traffic** with the temperature and batch distribution of production before fixing K and the method.

## Operational impact: why this is a cost-optimization problem

From an operations lens, Speculative Decoding is one of the rare levers that improves **both latency and cost** without sacrificing quality. On the same GPU, you emit more tokens per second for each interactive request → smoother streaming, less waiting. For the same target experience, you need fewer GPU-hours to serve the same load → a lower inference bill.

This is the often-overlooked piece in the "AI for operational efficiency" story: most of an LLM product's cost lives in *serving-time inference*, not training. A single serving-layer config change — enabling Speculative Decoding and tuning K to the workload — can deliver savings that the "switch to a smaller model" route would only buy by paying in quality. Here you trade nothing: the output distribution is preserved exactly.

#### Do

- **Trial-enable** Speculative Decoding by default on every interactive endpoint, and measure before concluding.
- Start with `ngram` if the workload has heavy input repetition — free benefit, no training.
- Tune `K` to how "guessable" the workload is; track **average accepted tokens** as a health metric.
- Separate your goals: optimize for **latency** (small batch) or **throughput** (large batch) — the optimal config differs.
- Keep greedy/low temperature on accuracy-critical endpoints — better quality and a higher acceptance rate at once.

#### Don't

- Blindly enable it in **very large batch** mode and be surprised when throughput drops — the GPU is already compute-bound.
- Pick a draft model that's too large: drafting cost eats the entire speedup.
- Expect **TTFT** improvements — this technique speeds up between tokens, not the prefill step.
- Assume there must be a quality trade-off and "hesitate to enable it" — with proper rejection sampling, the output is *lossless*.
- Hard-code one `K` for every workload without re-measuring.

## Conclusion

Speculative Decoding is one of the most elegant ideas in modern AI infrastructure: it doesn't make the model smarter, doesn't change a single parameter, it merely **reorders the computation** to reclaim the GPU bandwidth that one-token-at-a-time decoding throws away. Guessing is cheap and sequential; verifying is parallel and nearly free in the memory-bound regime — and thanks to a rejection-sampling trick, the final result is bit-for-bit identical to the original model. From the classic two-model setup to EAGLE-3 and token trees in 2026, the direction of travel is unchanged: make the draft ever more "in phase" with the target so more tokens are accepted per cycle. For anyone building LLM products, this is a rare lever that cuts both latency and cost without sacrificing quality — a config worth enabling by default and worth measuring for your exact workload.

---

### References

- [Leviathan, Kalman, Matias (Google) — Fast Inference from Transformers via Speculative Decoding (ICML 2023)](https://arxiv.org/abs/2211.17192)
- [Chen et al. (DeepMind) — Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318)
- [Cai et al. — Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads](https://arxiv.org/abs/2401.10774)
- [Li et al. — EAGLE-3: Scaling Inference Acceleration via Training-Time Test](https://arxiv.org/abs/2503.01840)
- [vLLM — Speculative Decoding documentation (ngram / EAGLE / Medusa)](https://docs.vllm.ai/en/latest/features/speculative_decoding/)
- [AWS ML Blog — P-EAGLE: Parallel Speculative Decoding in vLLM](https://aws.amazon.com/blogs/machine-learning/p-eagle-faster-llm-inference-with-parallel-speculative-decoding-in-vllm/)
- [Hugging Face — Speculative Decoding in Practice: How EAGLE-3 Makes LLMs Faster](https://huggingface.co/blog/lujangusface/tw-eagle3-gpu)

Fine-tuning, RAG, or Prompting? Customizing LLMs in 2026

Multimodal AI 2026: When AI Learns to See and Hear

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.