vllm - 💡(How to fix) Fix [Feature]: SubSpec — Lossless Training-Free Speculative Decoding for CPU-Offloaded LLMs via Quantized Substitute Draft [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39427Fetched 2026-04-10 03:40:42
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Code Example

CPU RAM:   [Offloaded layers, BF16]       ← verify path (unchanged)
GPU VRAM:  [GPU-resident layers, BF16]    ← shared by draft and target
           [Quantized substitutes, 4-bit] ← draft only, replaces offloaded layers

---

original_weight: device=cpu dtype=bfloat16   
confirmed
max_abs_diff (HQQ vs GemLite output): 0.042969
hqq_tokens_per_sec:     744.19
gemlite_tokens_per_sec: 4640.876.2× faster
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

vLLM's --cpu-offload-gb flag enables large models on memory-limited GPUs, but in practice the PCIe bottleneck makes it painful for interactive use — every forward pass stalls waiting for offloaded weights to come back from CPU RAM. Speculative decoding is the natural fix to amortize this cost, but --speculative-model requires a separate draft model — which either doesn't exist for custom-trained targets, or eats up the VRAM we're trying to save in the first place.

I'd like to implement SubSpec , from our lab's NeurIPS 2025 paper:

Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding
Pei-Shuo Wang, Jian-Jia Chen, Chun-Che Yang, Chi-Chih Chang, Ning-Chi Huang, Mohamed S. Abdelfattah, Kai-Chiang Wu
https://arxiv.org/abs/2509.18344

Instead of a separate model, SubSpec builds a draft by replacing the CPU-offloaded layers with low-bit quantized substitutes on GPU, while sharing the GPU-resident layers and KV cache with the target model:

CPU RAM:   [Offloaded layers, BF16]       ← verify path (unchanged)
GPU VRAM:  [GPU-resident layers, BF16]    ← shared by draft and target
           [Quantized substitutes, 4-bit] ← draft only, replaces offloaded layers

This is lossless and training-free. The paper reports 9.1× speedup for Qwen2.5-7B and 12.5× speedup for Qwen2.5-32B .

Pre-implementation experiments

To validate feasibility in the current vLLM/HQQ/GemLite stack, I ran two experiments on Qwen2.5-7B-Instruct with RTX 3090 Ti.

Experiment 1 — Operator verification: CPU BF16 weights and GPU 4-bit HQQ weights coexist in the same process without conflicts. The more surprising result was GemLite's A16Wn kernel hitting ~4640 tok/s vs. 744 tok/s for plain HQQ dequant+matmul — a 6.2× gap. That margin is large enough to absorb draft overhead and still net a speedup.

original_weight: device=cpu dtype=bfloat16   
confirmed
max_abs_diff (HQQ vs GemLite output): 0.042969
hqq_tokens_per_sec:     744.19
gemlite_tokens_per_sec: 4640.87              ← 6.2× faster

Experiment 2 — Output alignment: KL divergence vs. a conventional small draft model (group_size=64, 3 prompts):

DraftKL(Original ‖ Draft)
SubSpec 4-bit (same model, quantized)0.1176
Qwen2.5-1.5B (traditional small draft)0.5899

The quantized substitute is 5 times closer in distribution which means higher token acceptance and more speedup. The 4-bit 7B model uses only 5.08 GB GPU memory.

Implementation plan

I would like to implement SubstituteModelProposer targeting vLLM v1's spec decode module, following the unified propose() interface direction from #36219/#37399. A few integration points I've identified:

  • Weight loading: use AutoWeightsLoader to intercept layer weights at load time and generate 4-bit HQQ substitutes for the offloaded layers, without duplicating the full loading path.
  • KV cache sharing: leverage the existing shared_kv_cache_layers mechanism in vllm/v1/worker/utils.py to map draft layers to target KV cache groups — no new cache memory needed.
  • Async prefetch: hook into PrefetchOffloader (vllm/model_executor/offloader/prefetch.py) so the CPU→GPU weight transfer for the target path overlaps with the draft forward pass.

If the community's focus shifts to Model Runner v2, I'm happy to adapt the implementation accordingly.

Alternatives

No response

Additional context

This is my first time opening an issue here and I would be happy to receive any feedback on the proposal or the format.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing the Substitute Speculative Decoding (SubSpec) method from the NeurIPS 2025 paper can potentially resolve the PCIe bottleneck issue in vLLM by utilizing low-bit quantized substitutes on GPU for offloaded layers.

Guidance

  • Review the NeurIPS 2025 paper and its implementation details to understand the SubSpec method and its potential benefits for vLLM.
  • Investigate the feasibility of integrating SubSpec with the current vLLM/HQQ/GemLite stack, considering the results of the pre-implementation experiments.
  • Identify the key integration points, such as weight loading, KV cache sharing, and async prefetch, to ensure a seamless implementation of SubSpec.
  • Consider the potential impact of shifting the community's focus to Model Runner v2 on the implementation of SubSpec.

Example

No code snippet is provided as the issue is focused on proposing an implementation plan rather than providing a specific code solution.

Notes

The implementation of SubSpec may require significant changes to the vLLM codebase, and its success depends on various factors, including the feasibility of integrating with the current stack and the potential benefits of the method.

Recommendation

Apply the SubSpec workaround, as it has shown promising results in pre-implementation experiments and has the potential to significantly improve the performance of vLLM by reducing the PCIe bottleneck.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING