vllm - 💡(How to fix) Fix [Feature]: Speculative Prefill — Draft-Assisted Sparse Prefill for TTFT Reduction [3 comments, 3 participants]

vllm2026-04-06 06:11:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39060•Fetched 2026-04-08 02:52:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×3labeled ×1mentioned ×1subscribed ×1

Add speculative prefill (aka sparse prefill) support to vLLM: use a lightweight draft model to score prompt token importance via attention patterns, then prefill only the top-k% most important tokens into the target model with position-preserving RoPE. This directly reduces TTFT, which is the primary latency bottleneck for long-context serving.

Root Cause

Fix Action

Fix / Workaround

SpecPrefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation (ICML 2025, from LMSYS team). Demonstrated up to 7.66× TTFT improvement and 7× maximal end-to-end QPS improvement on Llama-3.1-405B-Instruct-FP8 with <5% accuracy loss. The authors provide a reference implementation built as a monkey patch on top of vLLM — it works with vLLM v0 codebase via enable_prefill_spec() applied before model loading. This demonstrates the technique is already validated on vLLM's architecture and could serve as a starting point for native integration.
Cross-Family Speculative Prefill (March 2026). Extends the technique across model families (Qwen, LLaMA, DeepSeek), retaining 90-100% of full-prompt baseline performance without requiring a same-family draft model.

Draft scoring: Small draft model (e.g. 2B–4B) prefills the full prompt at high throughput, capturing post-RoPE query vectors from attention layers during a few lookahead decode steps.
Importance scoring: softmax(Q_lookahead @ K_prompt^T / sqrt(d)) → average pool over chunks → max over layers/heads → mean over lookahead steps.
Chunk selection: Group prompt into fixed-size chunks (e.g. 32 tokens), keep top-k% by average importance score.
Sparse prefill: Target model prefills only the selected tokens. Original position IDs are preserved via manual RoPE patching so relative positional encoding remains correct during subsequent decode.

Code Example

vllm serve <model> \
    --speculative-prefill \
    --speculative-prefill-draft-model <draft-model> \
    --speculative-prefill-threshold 4096 \
    --speculative-prefill-keep-pct 0.2

---

client.chat.completions.create(
    model="my-model",
    messages=[...],
    extra_body={
        "speculative_prefill": True,
        "speculative_prefill_keep_pct": 0.3,
    }
)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

Motivation

TTFT scales linearly (or worse, due to O(n²) attention) with prompt length and is the dominant latency bottleneck for long-context workloads — coding assistants, document analysis, RAG with large retrieval contexts, and agentic tool-use pipelines. A 64K-token prompt on a 122B MoE model can take 7+ minutes to first token on consumer hardware, and even on datacenter GPUs the prefill phase dominates end-to-end latency at high concurrency.

vLLM already has complementary prefill optimizations — chunked prefill and disaggregated prefill — but both process all prompt tokens. Speculative prefill is orthogonal: it reduces the number of tokens the target model must prefill, and composes with both chunked and disaggregated prefill.

Prior Art

Academic + reference implementation:

SpecPrefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation (ICML 2025, from LMSYS team). Demonstrated up to 7.66× TTFT improvement and 7× maximal end-to-end QPS improvement on Llama-3.1-405B-Instruct-FP8 with <5% accuracy loss. The authors provide a reference implementation built as a monkey patch on top of vLLM — it works with vLLM v0 codebase via enable_prefill_spec() applied before model loading. This demonstrates the technique is already validated on vLLM's architecture and could serve as a starting point for native integration.
Cross-Family Speculative Prefill (March 2026). Extends the technique across model families (Qwen, LLaMA, DeepSeek), retaining 90-100% of full-prompt baseline performance without requiring a same-family draft model.

Production implementations:

vllm-mlx PR #180 — merged March 21, 2026. First (and currently only) production implementation of speculative prefill in any inference runtime. Implemented by @Thump604 with accompanying paper. Results on Apple Silicon M2 Ultra with Qwen3.5-122B (2B draft, 20% keep):

Prompt Length Baseline TTFT SpecPrefill TTFT Speedup
8K 45.0s 12.1s 3.71×
16K 92.3s 22.5s 4.11×
32K 186.3s 44.1s 4.23×
64K 417.6s 92.8s 4.50×
128K 19.3 min 3.5 min 5.45×

Cross-architecture validation on Nemotron-H 120B (2.10–2.19×) and GPT-OSS 120B (1.24–1.28×) confirms the generality of the approach. Quality validation showed zero regressions across 16 adversarial tests (needle-in-haystack, structured extraction, mixed-language, etc.).

Prompt Length	Baseline TTFT	SpecPrefill TTFT	Speedup
8K	45.0s	12.1s	3.71×
16K	92.3s	22.5s	4.11×
32K	186.3s	44.1s	4.23×
64K	417.6s	92.8s	4.50×
128K	19.3 min	3.5 min	5.45×

Algorithm

Draft scoring: Small draft model (e.g. 2B–4B) prefills the full prompt at high throughput, capturing post-RoPE query vectors from attention layers during a few lookahead decode steps.
Importance scoring: softmax(Q_lookahead @ K_prompt^T / sqrt(d)) → average pool over chunks → max over layers/heads → mean over lookahead steps.
Chunk selection: Group prompt into fixed-size chunks (e.g. 32 tokens), keep top-k% by average importance score.
Sparse prefill: Target model prefills only the selected tokens. Original position IDs are preserved via manual RoPE patching so relative positional encoding remains correct during subsequent decode.

The key insight is that RoPE is relative — Q_m @ K_p^T depends only on (m - p). Selected keys stored in the cache with their original RoPE angles produce correct attention during generation.

Why this is especially impactful for vLLM

MoE models benefit most. The draft-to-target FLOP ratio is the dominant predictor of speedup. MoE models have high total params but low active params — a 2B draft against a 122B MoE (10B active) gives a ~50:1 FLOP ratio on prefill, making draft scoring overhead negligible. This aligns with the industry trend toward MoE architectures (Mixtral, DBRX, DeepSeek-V3, Qwen3.5, Gemma 4 MoE).
Composes with existing optimizations. Speculative prefill reduces token count before chunked prefill or disaggregated prefill processes them. In PD disaggregation, prefill nodes would see proportionally less compute per request.
Shares infrastructure with speculative decoding. vLLM already loads and manages draft models for speculative decoding. Speculative prefill can reuse the same draft model — score tokens during prefill, then use the same model for speculative decoding during generation. This amortizes the memory cost of the draft model across both phases.
GPU-specific advantages. On GPU, the draft model can run on a subset of TP ranks or on a dedicated small GPU in the node, fully overlapping draft scoring with target model idle time between batches. The high compute density of datacenter GPUs (A100/H100/B200) could yield even larger speedups than the Apple Silicon results, since the FLOP savings translate more directly when compute is the bottleneck.

Proposed API

Server CLI

vllm serve <model> \
    --speculative-prefill \
    --speculative-prefill-draft-model <draft-model> \
    --speculative-prefill-threshold 4096 \
    --speculative-prefill-keep-pct 0.2

--speculative-prefill-threshold: minimum prompt token count to activate (short prompts don't benefit enough to justify draft overhead)
--speculative-prefill-keep-pct: fraction of prompt tokens to retain (0.2 = keep 20%, drop 80%)

Per-request override (OpenAI-compatible)

client.chat.completions.create(
    model="my-model",
    messages=[...],
    extra_body={
        "speculative_prefill": True,
        "speculative_prefill_keep_pct": 0.3,
    }
)

Reuse with speculative decoding

When both --speculative-model and --speculative-prefill are configured, the same draft model could serve both purposes — scoring token importance during prefill and proposing tokens during decode.

Alternatives

Alternative	Limitation
Chunked prefill only	Reduces scheduling interference but still processes all tokens. Does not reduce total prefill FLOPs.
Disaggregated prefill only	Isolates prefill to dedicated nodes but still processes all tokens per request.
Prompt compression / summarization	Requires training or fine-tuning. Not plug-and-play. Changes the semantic content of the prompt.
Longer chunked prefill steps	Reduces overhead per chunk but does not reduce total compute.
KV cache reuse (prefix caching)	Only helps when prompts share prefixes. Does not help with novel long prompts.

Speculative prefill is orthogonal to all of the above and composes with chunked prefill, disaggregated prefill, and prefix caching.

Additional context

Quality considerations

The original SpecPrefill paper (ICML 2025) reports <5% accuracy loss on Llama-3.1-405B with 20% keep rate. The vllm-mlx implementation's adversarial testing showed 0/16 specprefill-specific regressions (14 BOTH_PASS, 2 BOTH_FAIL where baseline also failed). However, quality impact is task-dependent and should be configurable via keep_pct. Safety-critical or factual-extraction workloads may want higher keep rates (30-50%), while summarization and conversational workloads can tolerate 20% or lower.

Related issues and RFCs

#19038 — [RFC] Prefill-only optimizations for PD disaggregation (complementary — focuses on KV offloading and scheduling, not token reduction)
#5016 — Combine chunked prefill with speculative decoding (composability precedent)
Speculative decoding infrastructure (--speculative-model, draft model loading) — reusable for speculative prefill

References

Liu et al., "Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation," ICML 2025. https://arxiv.org/abs/2502.02789
"Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models," March 2026. https://arxiv.org/html/2603.02631v3
Green, "SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for LLMs on Apple Silicon," 2026. https://doi.org/10.5281/zenodo.19120919
vllm-mlx implementation: https://github.com/waybarrios/vllm-mlx/pull/180

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To implement speculative prefill in vLLM, integrate a lightweight draft model to score token importance and prefill only the top-k% most important tokens, reducing TTFT latency.

Guidance

Review the SpecPrefill paper and reference implementation: Understand the algorithm and its components, including draft scoring, importance scoring, chunk selection, and sparse prefill.
Assess the applicability of speculative prefill to your use case: Consider the benefits of reducing TTFT latency, especially for MoE models, and evaluate the potential impact on your specific workload.
Implement the proposed API: Integrate the --speculative-prefill flag and related options into your vLLM server CLI and per-request override, allowing for flexible configuration of speculative prefill.
Evaluate quality considerations: Investigate the potential accuracy loss associated with speculative prefill and adjust the keep_pct parameter to balance speedup and quality for your specific task.

Example

client.chat.completions.create(
    model="my-model",
    messages=[...],
    extra_body={
        "speculative_prefill": True,
        "speculative_prefill_keep_pct": 0.2,
    }
)

Notes

The implementation of speculative prefill may require additional infrastructure and resources, such as a dedicated small GPU for draft model scoring. The quality impact of speculative prefill is task-dependent and should be carefully evaluated.

Recommendation

Apply the speculative prefill workaround to reduce TTFT latency, as it has been shown to provide significant speedups (up to 7.66×) with minimal accuracy loss (<5%).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #optimization #model loading #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.