vllm - 💡(How to fix) Fix [Bug]: Step-3.5/3.7-Flash MTP speculative decoding fails to load on NVFP4 (drafter quantizes mtp_block, can't keep unquantized MTP weights)

vllm2026-05-31 01:04:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On an NVFP4 checkpoint, Step-3.5/3.7-Flash MTP speculative decoding fails to load the draft model with an AssertionError (weight shape mismatch), because the MTP drafter builds its mtp_block (and shared_head) using the model's NVFP4 quant config even when the MTP weights are unquantized (BF16). The draft model's quant config is filtered to the target model's modules, so hf_quant_config.json exclude_modules cannot reach the MTP layers — there is no way to keep them unquantized.

Distinct from the existing Step-3.5 MTP issues (#38339 low acceptance, #40000 v0.19 start failure, #38498 ROCm): this is an NVFP4-specific quantization/load mismatch.

Root Cause

Fix Action

Fix / Workaround

Let the MTP drafter keep mtp_block + shared_head unquantized when the MTP weights are not in the quantized format — e.g. honor exclude_modules for the draft's own modules, or skip quantization for the MTP block when its checkpoint tensors are unquantized. A minimal NVFP4 workaround (the grafted MTP is BF16; FP8 ships FP8 MTP and is left unchanged):

Code Example

quant_config = vllm_config.quant_config
...
self.shared_head = SharedHead(config=config, quant_config=quant_config)
self.mtp_block = Step3p5DecoderLayer(
    vllm_config,            # carries the NVFP4 quant_config
    prefix=f"{prefix}.mtp_block",
)

---

parameter.py: assert param_data.shape == loaded_weight.shape  -> AssertionError
# e.g. mtp_block.mlp.gate_up_proj: param=(11264, 2048) [NVFP4-packed] vs loaded=(11264, 4096) [BF16]

---

qname = str(quant_config.get_name()) if quant_config is not None else ""
mtp_unquant = "fp4" in qname.lower()
# build shared_head + mtp_block with quant_config=None when mtp_unquant

RAW_BUFFERClick to expand / collapse

Summary

Distinct from the existing Step-3.5 MTP issues (#38339 low acceptance, #40000 v0.19 start failure, #38498 ROCm): this is an NVFP4-specific quantization/load mismatch.

Where

vllm/model_executor/models/step3p5_mtp.py, Step3p5AMultiTokenPredictorLayer.__init__:

quant_config = vllm_config.quant_config
...
self.shared_head = SharedHead(config=config, quant_config=quant_config)
self.mtp_block = Step3p5DecoderLayer(
    vllm_config,            # carries the NVFP4 quant_config
    prefix=f"{prefix}.mtp_block",
)

So mtp_block's linears are created as NVFP4 (packed) and expect packed FP4 weights, while a checkpoint whose MTP block is unquantized (BF16) provides full-width weights:

parameter.py: assert param_data.shape == loaded_weight.shape  -> AssertionError
# e.g. mtp_block.mlp.gate_up_proj: param=(11264, 2048) [NVFP4-packed] vs loaded=(11264, 4096) [BF16]

Why `exclude_modules` doesn't help

The draft model (Step3p5MTP) gets its own quant config, and vLLM maps/filters exclude_modules against the target model's module tree. The MTP layers (model.layers.{num_hidden_layers .. +n} → .mtp_block) are not target-model modules, so any exclude_modules entry for them is dropped before it reaches the draft. (Same general failure mode has been reported for other MTP models on NVFP4, e.g. Qwen3-Next, where fused names aren't in the ignore list.)

Reproduction

An NVFP4 Step-3.7-Flash checkpoint whose MTP (next-n predict) weights are kept BF16. (Note: the stock stepfun-ai/Step-3.7-Flash-NVFP4 export ships no MTP weights at all, so this requires grafting the BF16 MTP block from stepfun-ai/Step-3.7-Flash — but the vLLM-side bug below is independent of how the BF16 MTP weights got there.)
Serve with --speculative-config '{"method":"mtp","num_speculative_tokens":3}'.
Draft model load → AssertionError (shape mismatch) on mtp_block.mlp.gate_up_proj / self_attn.qkv_proj.

Environment: vLLM 0.21–0.22, NVIDIA DGX Spark (GB10 / sm_121a), -tp 2, --quantization modelopt, --kv-cache-dtype fp8.

Suggested fix

qname = str(quant_config.get_name()) if quant_config is not None else ""
mtp_unquant = "fp4" in qname.lower()
# build shared_head + mtp_block with quant_config=None when mtp_unquant

This restores MTP on NVFP4 (verified on dual GB10, TP=2: mean acceptance length ~2.4–2.6 tokens/step). Glad to send a PR once a maintainer confirms the preferred approach (minimal NVFP4 override vs. honoring exclude_modules for the draft generally).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Step-3.5/3.7-Flash MTP speculative decoding fails to load on NVFP4 (drafter quantizes mtp_block, can't keep unquantized MTP weights)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Where

Why `exclude_modules` doesn't help

Reproduction

Suggested fix

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Step-3.5/3.7-Flash MTP speculative decoding fails to load on NVFP4 (drafter quantizes mtp_block, can't keep unquantized MTP weights)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Where

Why exclude_modules doesn't help

Reproduction

Suggested fix

Still need to ship something?

TRENDING

Why `exclude_modules` doesn't help