vllm - 💡(How to fix) Fix [deepseek_v4] DeepSeekV4MTP loader silently skips top-level head.weight + embed.weight → 0% MTP draft acceptance with no error

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vllm.models.deepseek_v4.nvidia.mtp.DeepSeekV4MTP.load_weights silently skips top-level head.weight and embed.weight when the saved artifact stores them at the top level (not as mtp.0.head.weight / mtp.0.emb.tok_emb.weight). The MTP layer's shared_head.head (ParallelLMHead) and embed_tokens (VocabParallelEmbedding) stay uninitialized → MTP draft head emits garbage logits → 0% MTP acceptance with no load-time error. Speculative decoding produces draft tokens that are 100% rejected by the verifier.

Error Message

vllm.models.deepseek_v4.nvidia.mtp.DeepSeekV4MTP.load_weights silently skips top-level head.weight and embed.weight when the saved artifact stores them at the top level (not as mtp.0.head.weight / mtp.0.emb.tok_emb.weight). The MTP layer's shared_head.head (ParallelLMHead) and embed_tokens (VocabParallelEmbedding) stay uninitialized → MTP draft head emits garbage logits → 0% MTP acceptance with no load-time error. Speculative decoding produces draft tokens that are 100% rejected by the verifier.

Root Cause

vllm.models.deepseek_v4.nvidia.mtp.DeepSeekV4MTP.load_weights silently skips top-level head.weight and embed.weight when the saved artifact stores them at the top level (not as mtp.0.head.weight / mtp.0.emb.tok_emb.weight). The MTP layer's shared_head.head (ParallelLMHead) and embed_tokens (VocabParallelEmbedding) stay uninitialized → MTP draft head emits garbage logits → 0% MTP acceptance with no load-time error. Speculative decoding produces draft tokens that are 100% rejected by the verifier.

Fix Action

Fix / Workaround

Workaround (production-validated)

After the workaround, MTP acceptance lands at 69.94% over 200 random prompts (raw data in benchmarks/phase2/acc_*.json).

Code Example

for name, loaded_weight in weights:
    name = name.replace("mtp.0.", "")  # no-op on top-level keys like "head.weight"
    spec_layer = get_spec_layer_idx(name)
    if spec_layer is None:
        continue  # ← top-level head.weight, embed.weight die here
    ...

---

for name, loaded_weight in weights:
    # Re-route top-level head/embed to the MTP slot (model-level keys that
    # the MTP draft tower needs but get filtered out by the spec-layer check)
    extra = []
    for layer_off in range(self.config.num_nextn_predict_layers):
        if name == "head.weight":
            extra.append((f"mtp.{layer_off}.shared_head.head.weight", loaded_weight))
        elif name == "embed.weight":
            extra.append((f"mtp.{layer_off}.embed_tokens.weight", loaded_weight))
    if extra:
        # process each routing manually + continue
        for n, w in extra: weight_loader(n, w)
        continue
    # ... existing mtp.0. routing
RAW_BUFFERClick to expand / collapse

Summary

vllm.models.deepseek_v4.nvidia.mtp.DeepSeekV4MTP.load_weights silently skips top-level head.weight and embed.weight when the saved artifact stores them at the top level (not as mtp.0.head.weight / mtp.0.emb.tok_emb.weight). The MTP layer's shared_head.head (ParallelLMHead) and embed_tokens (VocabParallelEmbedding) stay uninitialized → MTP draft head emits garbage logits → 0% MTP acceptance with no load-time error. Speculative decoding produces draft tokens that are 100% rejected by the verifier.

Mechanism

DeepSeekV4MTP.load_weights runs this loop:

for name, loaded_weight in weights:
    name = name.replace("mtp.0.", "")  # no-op on top-level keys like "head.weight"
    spec_layer = get_spec_layer_idx(name)
    if spec_layer is None:
        continue  # ← top-level head.weight, embed.weight die here
    ...

For a key like head.weight:

  • name.replace("mtp.0.", "") returns head.weight unchanged
  • get_spec_layer_idx("head.weight") returns None
  • Loop hits continue → key never routed to the MTP layer
  • MTP layer constructed shared_head.head as ParallelLMHead but no weight ever assigned → keeps init random values

Repro

  1. Save a DSv4-Flash artifact via transformers.save_pretrained (the canonical path). MTP weights land at upstream naming mtp.0.* but head.weight + embed.weight stay at top level.
  2. Serve with --speculative-config '{"method":"mtp","num_speculative_tokens":1}'.
  3. Watch vllm:spec_decode_num_accepted_tokens_total / vllm:spec_decode_num_draft_tokens_total. Result: drafts produced, accepted = 0.

We hit this exact scenario in canada-quant/dsv4-flash-w4a16-fp8-mtp iteration 9. Smoke artifact had 797 mtp.0.* keys at first and got 0% acceptance silently. Diagnosis trace in FINDINGS_FOR_SIBLING.md §C14.

Workaround (production-validated)

Postprocess injects mtp.0.head.weight (FP32 copy of head.weight) and mtp.0.emb.tok_emb.weight (BF16 copy of embed.weight) as full duplicates. Working example: scripts/fixup_artifact.py. The sibling artifact canada-quant/DeepSeek-V4-Flash-NVFP4-FP8-MTP applies the same pattern (799 mtp.* keys vs our pre-fix 797 — the 2 deltas are these aliases).

After the workaround, MTP acceptance lands at 69.94% over 200 random prompts (raw data in benchmarks/phase2/acc_*.json).

Suggested fix

Option A — DeepSeekV4MTP.load_weights should explicitly route top-level head.weight to mtp.{N}.shared_head.head.weight and top-level embed.weight to mtp.{N}.embed_tokens.weight (one line added before the continue):

for name, loaded_weight in weights:
    # Re-route top-level head/embed to the MTP slot (model-level keys that
    # the MTP draft tower needs but get filtered out by the spec-layer check)
    extra = []
    for layer_off in range(self.config.num_nextn_predict_layers):
        if name == "head.weight":
            extra.append((f"mtp.{layer_off}.shared_head.head.weight", loaded_weight))
        elif name == "embed.weight":
            extra.append((f"mtp.{layer_off}.embed_tokens.weight", loaded_weight))
    if extra:
        # process each routing manually + continue
        for n, w in extra: weight_loader(n, w)
        continue
    # ... existing mtp.0. routing

Option B — raise at construction time when shared_head.head / embed_tokens end up uninitialized. Either is fine; silent 0% acceptance is the worst possible failure mode for spec-decode.

Severity

Anyone who quantizes DSv4-Flash and ships an artifact via the standard transformers.save_pretrained path will hit this silently. The sibling NVFP4 team hit it and worked around it; we hit it independently and worked around it; future MTP-preserving quantizations will hit it too unless this is fixed in the loader.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [deepseek_v4] DeepSeekV4MTP loader silently skips top-level head.weight + embed.weight → 0% MTP draft acceptance with no error