vllm - 💡(How to fix) Fix [Bug] DSV4 MTP draft model inherits main quantization scheme; can't load artifacts with BF16 MTP block

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When serving a DSV4-class quantized artifact with --speculative-config '{"method":"mtp","num_speculative_tokens":N}', vLLM constructs the MTP draft model (DeepSeekV4MTP / DeepseekV4MTPModel / DeepSeekV4MultiTokenPredictorLayer) by passing the SAME vllm_config as the main model — including its quant_config. That means the MTP block's DeepseekV4DecoderLayer constructs its DSV4MoE FFN and DSV4Attention with the main model's quantization scheme.

For artifacts whose MTP block was intentionally left unquantized on disk — a common pattern because of vllm-project/llm-compressor#2745 (the Inplace update to inference tensor crash that fires at MTP qparam writeback during calibration) — the resulting mismatch between the draft model's expected parameter names (w13_weight_packed, weight_scale_inv, etc.) and the on-disk parameter names (w1.weight, w2.weight, no scales) means weight load fails with KeyError or the attention forward fails with AttributeError: ColumnParallelLinear has no attribute weight_scale.

Error Message

The full traceback originates at vllm/v1/spec_decode/llm_base_proposer.py:1171get_model(vllm_config=draft_vllm_config, model_config=self.speculative_config.draft_model_config, ...) → MTP module construction with main quant_config flowing through. The only artifact-side workaround would be to re-calibrate WITH MTP attn and experts in scope. We tried this (commit https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/commit/6f6899c with a narrower MTP ignore that excluded only embed/e_proj/h_proj). The calibration completed subgraphs 1–43 successfully (~73 min on B300, 1-rank) and then crashed at subgraph 44 (MTP block) with the SAME Inplace update to inference tensor outside InferenceMode is not allowed error from vllm-project/llm-compressor#2745. The inference-mode marking propagates through the entire MTP block forward graph, not just the shared-embed lookup. So any MTP-block module quantization triggers the crash.

Root Cause

DSV4-class artifacts with BF16 MTP are a real shipping pattern — the upstream llm-compressor#2745 crash effectively forces any production recipe to exclude MTP from the quantization scope. Without this fix in vLLM, those artifacts cannot be served with --speculative-config method=mtp, defeating the entire point of preserving the MTP block in the first place.

Fix Action

Fix / Workaround

Why the workaround (ignore list with re:.*mtp_block.* or re:.*layers.43.*) is unreliable

Workaround at the artifact level (currently impossible)

The only artifact-side workaround would be to re-calibrate WITH MTP attn and experts in scope. We tried this (commit https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/commit/6f6899c with a narrower MTP ignore that excluded only embed/e_proj/h_proj). The calibration completed subgraphs 1–43 successfully (~73 min on B300, 1-rank) and then crashed at subgraph 44 (MTP block) with the SAME Inplace update to inference tensor outside InferenceMode is not allowed error from vllm-project/llm-compressor#2745. The inference-mode marking propagates through the entire MTP block forward graph, not just the shared-embed lookup. So any MTP-block module quantization triggers the crash.

Code Example

vllm serve <artifact> --tensor-parallel-size 4 --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'
RAW_BUFFERClick to expand / collapse

Summary

When serving a DSV4-class quantized artifact with --speculative-config '{"method":"mtp","num_speculative_tokens":N}', vLLM constructs the MTP draft model (DeepSeekV4MTP / DeepseekV4MTPModel / DeepSeekV4MultiTokenPredictorLayer) by passing the SAME vllm_config as the main model — including its quant_config. That means the MTP block's DeepseekV4DecoderLayer constructs its DSV4MoE FFN and DSV4Attention with the main model's quantization scheme.

For artifacts whose MTP block was intentionally left unquantized on disk — a common pattern because of vllm-project/llm-compressor#2745 (the Inplace update to inference tensor crash that fires at MTP qparam writeback during calibration) — the resulting mismatch between the draft model's expected parameter names (w13_weight_packed, weight_scale_inv, etc.) and the on-disk parameter names (w1.weight, w2.weight, no scales) means weight load fails with KeyError or the attention forward fails with AttributeError: ColumnParallelLinear has no attribute weight_scale.

Reproducer

Take any DSV4 artifact whose calibration recipe excluded MTP modules — e.g. an llm-compressor recipe with ignore=[..., r"re:.*mtp\..*"]. Such artifacts have:

  • mtp.0.attn.{wq_a,wq_b,wkv,wo_a,wo_b}.weight (BF16, no scales)
  • mtp.0.ffn.experts.{0..255}.{w1,w2,w3}.weight (BF16, no scales)

Serve with --speculative-config method=mtp:

vllm serve <artifact> --tensor-parallel-size 4 --kv-cache-dtype fp8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Result, depending on which scheme the main model uses:

  • If NVFP4 MoE main scheme: KeyError: 'model.layers.43.mtp_block.ffn.experts.w13_weight' at load (MTP draft model allocates w13_weight_packed, loader looks for w13_weight)
  • If FP8_BLOCK attn main scheme: AttributeError: 'ColumnParallelLinear' object has no attribute 'weight_scale_inv' (or weight_scale) at forward (MTP wo_a is unquantized, no scale parameters exist)

The full traceback originates at vllm/v1/spec_decode/llm_base_proposer.py:1171get_model(vllm_config=draft_vllm_config, model_config=self.speculative_config.draft_model_config, ...) → MTP module construction with main quant_config flowing through.

Why the workaround (ignore list with re:.*mtp_block.* or re:.*layers.43.*) is unreliable

The should_ignore_layer matcher in vllm/model_executor/layers/quantization/compressed_tensors/utils.py is called with the construction-time prefix (e.g. model.layers.43.ffn.experts), but the module-attribute-traversal name (model.layers.43.mtp_block.ffn.experts) differs by the inserted .mtp_block. segment. Regex authors attempting to skip MTP at the config level have to know the EXACT construction-time prefix format, which is brittle (depends on vLLM's internal naming, may change). For our case, re:.*mtp_block.* does NOT match because the construction prefix doesn't have mtp_block. re:.*layers\.43\..* DOES match for V4-Flash (where MTP is at layer 43 = num_hidden_layers), but only works if num_hidden_layers is known to the config author at calibration time, which it isn't generally.

Proposed fix

Inside the MTP draft model construction, either:

  1. Override quant_config to None for the MTP block when the artifact has no MTP scale tensors on disk. The MTP loader at vllm/models/deepseek_v4/nvidia/mtp.py:288 could pre-scan the artifact's safetensor keys, detect mtp.\d+.attn.*.weight without companion weight_scale* keys, and instantiate the mtp_block with a stripped quant_config that excludes the MTP-block prefix patterns.

  2. Add an explicit speculative_config.draft_quant_config option to vLLM CLI so users can override the draft model's quant_config independently of the main model. Default behavior unchanged.

  3. At the DeepseekV4DecoderLayer constructor (which serves both main and MTP layers), accept an optional quant_config_override=None parameter that the MTP path can use to force unquantized FFN + attention construction.

The first option is automatic (matches the artifact's actual shape); the second is most explicit (user opt-in); the third is targeted (no global impact).

Why this matters

DSV4-class artifacts with BF16 MTP are a real shipping pattern — the upstream llm-compressor#2745 crash effectively forces any production recipe to exclude MTP from the quantization scope. Without this fix in vLLM, those artifacts cannot be served with --speculative-config method=mtp, defeating the entire point of preserving the MTP block in the first place.

Workaround at the artifact level (currently impossible)

The only artifact-side workaround would be to re-calibrate WITH MTP attn and experts in scope. We tried this (commit https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/commit/6f6899c with a narrower MTP ignore that excluded only embed/e_proj/h_proj). The calibration completed subgraphs 1–43 successfully (~73 min on B300, 1-rank) and then crashed at subgraph 44 (MTP block) with the SAME Inplace update to inference tensor outside InferenceMode is not allowed error from vllm-project/llm-compressor#2745. The inference-mode marking propagates through the entire MTP block forward graph, not just the shared-embed lookup. So any MTP-block module quantization triggers the crash.

This means the only currently-shippable recipe for DSV4 MTP-preserving artifacts is: exclude the entire MTP block from quant scope. Which then triggers the vLLM-side issue this report describes.

Related

  • vllm-project/llm-compressor#2745 — the upstream cause that forces MTP exclusion
  • vLLM PR #43290 (this org) — weight_scale_inv/weight_scale fallback at attention.py:334; partial unblock for the AttributeError variant but still requires wo_a to be quantized
  • The artifact this affects: canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP (the first MTP-preserving NVFP4-FP8 DSV4 quant; GSM8K 0.9181 strict / 0.9515 flexible beats RedHat 0.910, full repo at https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING