vllm - 💡(How to fix) Fix [Feature]: Speculative Prefill — Draft-Assisted Sparse Prefill for TTFT Reduction [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39060Fetched 2026-04-08 02:52:43
View on GitHub
Comments
3
Participants
3
Timeline
6
Reactions
0
Author
Timeline (top)
commented ×3labeled ×1mentioned ×1subscribed ×1

Add speculative prefill (aka sparse prefill) support to vLLM: use a lightweight draft model to score prompt token importance via attention patterns, then prefill only the top-k% most important tokens into the target model with position-preserving RoPE. This directly reduces TTFT, which is the primary latency bottleneck for long-context serving.

Root Cause

Add speculative prefill (aka sparse prefill) support to vLLM: use a lightweight draft model to score prompt token importance via attention patterns, then prefill only the top-k% most important tokens into the target model with position-preserving RoPE. This directly reduces TTFT, which is the primary latency bottleneck for long-context serving.

Fix Action

Fix / Workaround

  • SpecPrefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation (ICML 2025, from LMSYS team). Demonstrated up to 7.66× TTFT improvement and 7× maximal end-to-end QPS improvement on Llama-3.1-405B-Instruct-FP8 with <5% accuracy loss. The authors provide a reference implementation built as a monkey patch on top of vLLM — it works with vLLM v0 codebase via enable_prefill_spec() applied before model loading. This demonstrates the technique is already validated on vLLM's architecture and could serve as a starting point for native integration.
  • Cross-Family Speculative Prefill (March 2026). Extends the technique across model families (Qwen, LLaMA, DeepSeek), retaining 90-100% of full-prompt baseline performance without requiring a same-family draft model.
  1. Draft scoring: Small draft model (e.g. 2B–4B) prefills the full prompt at high throughput, capturing post-RoPE query vectors from attention layers during a few lookahead decode steps.
  2. Importance scoring: softmax(Q_lookahead @ K_prompt^T / sqrt(d)) → average pool over chunks → max over layers/heads → mean over lookahead steps.
  3. Chunk selection: Group prompt into fixed-size chunks (e.g. 32 tokens), keep top-k% by average importance score.
  4. Sparse prefill: Target model prefills only the selected tokens. Original position IDs are preserved via manual RoPE patching so relative positional encoding remains correct during subsequent decode.

Code Example

vllm serve <model> \
    --speculative-prefill \
    --speculative-prefill-draft-model <draft-model> \
    --speculative-prefill-threshold 4096 \
    --speculative-prefill-keep-pct 0.2

---

client.chat.completions.create(
    model="my-model",
    messages=[...],
    extra_body={
        "speculative_prefill": True,
        "speculative_prefill_keep_pct": 0.3,
    }
)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

Add speculative prefill (aka sparse prefill) support to vLLM: use a lightweight draft model to score prompt token importance via attention patterns, then prefill only the top-k% most important tokens into the target model with position-preserving RoPE. This directly reduces TTFT, which is the primary latency bottleneck for long-context serving.

Motivation

TTFT scales linearly (or worse, due to O(n²) attention) with prompt length and is the dominant latency bottleneck for long-context workloads — coding assistants, document analysis, RAG with large retrieval contexts, and agentic tool-use pipelines. A 64K-token prompt on a 122B MoE model can take 7+ minutes to first token on consumer hardware, and even on datacenter GPUs the prefill phase dominates end-to-end latency at high concurrency.

vLLM already has complementary prefill optimizations — chunked prefill and disaggregated prefill — but both process all prompt tokens. Speculative prefill is orthogonal: it reduces the number of tokens the target model must prefill, and composes with both chunked and disaggregated prefill.

Prior Art

Academic + reference implementation:

  • SpecPrefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation (ICML 2025, from LMSYS team). Demonstrated up to 7.66× TTFT improvement and 7× maximal end-to-end QPS improvement on Llama-3.1-405B-Instruct-FP8 with <5% accuracy loss. The authors provide a reference implementation built as a monkey patch on top of vLLM — it works with vLLM v0 codebase via enable_prefill_spec() applied before model loading. This demonstrates the technique is already validated on vLLM's architecture and could serve as a starting point for native integration.
  • Cross-Family Speculative Prefill (March 2026). Extends the technique across model families (Qwen, LLaMA, DeepSeek), retaining 90-100% of full-prompt baseline performance without requiring a same-family draft model.

Production implementations:

  • vllm-mlx PR #180 — merged March 21, 2026. First (and currently only) production implementation of speculative prefill in any inference runtime. Implemented by @Thump604 with accompanying paper. Results on Apple Silicon M2 Ultra with Qwen3.5-122B (2B draft, 20% keep):

    Prompt LengthBaseline TTFTSpecPrefill TTFTSpeedup
    8K45.0s12.1s3.71×
    16K92.3s22.5s4.11×
    32K186.3s44.1s4.23×
    64K417.6s92.8s4.50×
    128K19.3 min3.5 min5.45×

    Cross-architecture validation on Nemotron-H 120B (2.10–2.19×) and GPT-OSS 120B (1.24–1.28×) confirms the generality of the approach. Quality validation showed zero regressions across 16 adversarial tests (needle-in-haystack, structured extraction, mixed-language, etc.).

Algorithm

  1. Draft scoring: Small draft model (e.g. 2B–4B) prefills the full prompt at high throughput, capturing post-RoPE query vectors from attention layers during a few lookahead decode steps.
  2. Importance scoring: softmax(Q_lookahead @ K_prompt^T / sqrt(d)) → average pool over chunks → max over layers/heads → mean over lookahead steps.
  3. Chunk selection: Group prompt into fixed-size chunks (e.g. 32 tokens), keep top-k% by average importance score.
  4. Sparse prefill: Target model prefills only the selected tokens. Original position IDs are preserved via manual RoPE patching so relative positional encoding remains correct during subsequent decode.

The key insight is that RoPE is relative — Q_m @ K_p^T depends only on (m - p). Selected keys stored in the cache with their original RoPE angles produce correct attention during generation.

Why this is especially impactful for vLLM

  • MoE models benefit most. The draft-to-target FLOP ratio is the dominant predictor of speedup. MoE models have high total params but low active params — a 2B draft against a 122B MoE (10B active) gives a ~50:1 FLOP ratio on prefill, making draft scoring overhead negligible. This aligns with the industry trend toward MoE architectures (Mixtral, DBRX, DeepSeek-V3, Qwen3.5, Gemma 4 MoE).
  • Composes with existing optimizations. Speculative prefill reduces token count before chunked prefill or disaggregated prefill processes them. In PD disaggregation, prefill nodes would see proportionally less compute per request.
  • Shares infrastructure with speculative decoding. vLLM already loads and manages draft models for speculative decoding. Speculative prefill can reuse the same draft model — score tokens during prefill, then use the same model for speculative decoding during generation. This amortizes the memory cost of the draft model across both phases.
  • GPU-specific advantages. On GPU, the draft model can run on a subset of TP ranks or on a dedicated small GPU in the node, fully overlapping draft scoring with target model idle time between batches. The high compute density of datacenter GPUs (A100/H100/B200) could yield even larger speedups than the Apple Silicon results, since the FLOP savings translate more directly when compute is the bottleneck.

Proposed API

Server CLI

vllm serve <model> \
    --speculative-prefill \
    --speculative-prefill-draft-model <draft-model> \
    --speculative-prefill-threshold 4096 \
    --speculative-prefill-keep-pct 0.2
  • --speculative-prefill-threshold: minimum prompt token count to activate (short prompts don't benefit enough to justify draft overhead)
  • --speculative-prefill-keep-pct: fraction of prompt tokens to retain (0.2 = keep 20%, drop 80%)

Per-request override (OpenAI-compatible)

client.chat.completions.create(
    model="my-model",
    messages=[...],
    extra_body={
        "speculative_prefill": True,
        "speculative_prefill_keep_pct": 0.3,
    }
)

Reuse with speculative decoding

When both --speculative-model and --speculative-prefill are configured, the same draft model could serve both purposes — scoring token importance during prefill and proposing tokens during decode.

Alternatives

AlternativeLimitation
Chunked prefill onlyReduces scheduling interference but still processes all tokens. Does not reduce total prefill FLOPs.
Disaggregated prefill onlyIsolates prefill to dedicated nodes but still processes all tokens per request.
Prompt compression / summarizationRequires training or fine-tuning. Not plug-and-play. Changes the semantic content of the prompt.
Longer chunked prefill stepsReduces overhead per chunk but does not reduce total compute.
KV cache reuse (prefix caching)Only helps when prompts share prefixes. Does not help with novel long prompts.

Speculative prefill is orthogonal to all of the above and composes with chunked prefill, disaggregated prefill, and prefix caching.

Additional context

Quality considerations

The original SpecPrefill paper (ICML 2025) reports <5% accuracy loss on Llama-3.1-405B with 20% keep rate. The vllm-mlx implementation's adversarial testing showed 0/16 specprefill-specific regressions (14 BOTH_PASS, 2 BOTH_FAIL where baseline also failed). However, quality impact is task-dependent and should be configurable via keep_pct. Safety-critical or factual-extraction workloads may want higher keep rates (30-50%), while summarization and conversational workloads can tolerate 20% or lower.

Related issues and RFCs

  • #19038 — [RFC] Prefill-only optimizations for PD disaggregation (complementary — focuses on KV offloading and scheduling, not token reduction)
  • #5016 — Combine chunked prefill with speculative decoding (composability precedent)
  • Speculative decoding infrastructure (--speculative-model, draft model loading) — reusable for speculative prefill

References

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To implement speculative prefill in vLLM, integrate a lightweight draft model to score token importance and prefill only the top-k% most important tokens, reducing TTFT latency.

Guidance

  1. Review the SpecPrefill paper and reference implementation: Understand the algorithm and its components, including draft scoring, importance scoring, chunk selection, and sparse prefill.
  2. Assess the applicability of speculative prefill to your use case: Consider the benefits of reducing TTFT latency, especially for MoE models, and evaluate the potential impact on your specific workload.
  3. Implement the proposed API: Integrate the --speculative-prefill flag and related options into your vLLM server CLI and per-request override, allowing for flexible configuration of speculative prefill.
  4. Evaluate quality considerations: Investigate the potential accuracy loss associated with speculative prefill and adjust the keep_pct parameter to balance speedup and quality for your specific task.

Example

client.chat.completions.create(
    model="my-model",
    messages=[...],
    extra_body={
        "speculative_prefill": True,
        "speculative_prefill_keep_pct": 0.2,
    }
)

Notes

The implementation of speculative prefill may require additional infrastructure and resources, such as a dedicated small GPU for draft model scoring. The quality impact of speculative prefill is task-dependent and should be carefully evaluated.

Recommendation

Apply the speculative prefill workaround to reduce TTFT latency, as it has been shown to provide significant speedups (up to 7.66×) with minimal accuracy loss (<5%).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING