transformers - 💡(How to fix) Fix [deepseek_v4] conversion_mapping doesn't cover mtp.* paths — MTP keys silently random-init even after _keys_to_ignore is empty

transformers2026-05-20 21:39:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

transformers.conversion_mapping.get_checkpoint_conversion_mapping("deepseek_v4") returns 41 WeightRenaming entries that rename upstream-internal naming to HF naming (attn. → self_attn., ffn. → mlp., attn_norm. → input_layernorm., attn.wq_a. → self_attn.q_a_proj., attn.attn_sink → self_attn.sinks, etc.).

Entries 6–38 are anchored at ^layers\.(\d+)\. — they only fire on main-layer keys. None cover mtp.\d+.* paths.

Combined with the existing _keys_to_ignore_on_load_unexpected = [r"(^|\.)mtp\..*"] regex on DeepseekV4PreTrainedModel (filed separately as huggingface/transformers#46127), mtp.* keys never reach the model at all. Even after that regex is dropped (as #46127 does), the MTP keys arrive in upstream form (mtp.0.attn.wq_a.weight) — but the MTP submodules expect HF naming (mtp.0.self_attn.q_a_proj.weight). The keys are then flagged "unexpected", the submodules remain "uninitialized", and _initialize_weights falls through to _init_weights → init.normal_ random-initializes the MTP block.

Root Cause

Entries 6–38 are anchored at ^layers\.(\d+)\. — they only fire on main-layer keys. None cover mtp.\d+.* paths.

Fix Action

Fix / Workaround

Runtime workaround for downstream users

Code Example

# (assumes huggingface/transformers#46127 is applied — DeepseekV4NextNPredictor
#  exists, _keys_to_ignore_on_load_unexpected = [])
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("<DSv4-Flash BF16 with mtp.* keys>")

# Compare loaded vs source
import safetensors.torch as st
from pathlib import Path
loaded_w = model.model.mtp[0].self_attn.q_a_proj.weight
for shard in sorted(Path("<path>").glob("model-*.safetensors")):
    with st.safe_open(shard, framework="pt") as f:
        if "mtp.0.attn.wq_a.weight" in f.keys():
            source_w = f.get_tensor("mtp.0.attn.wq_a.weight")
            break

diff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()
print(f"max_diff = {diff}")
# Without conversion mapping for mtp.*: diff ≈ random Gaussian range (e.g. 0.1+)
# With the mtp.* mapping extension: diff ≈ 0

---

^layers\.(\d+)\.attn_norm\.                      → layers.\1.input_layernorm.
^layers\.(\d+)\.ffn_norm\.                       → layers.\1.post_attention_layernorm.
^layers\.(\d+)\.hc_attn_fn$                       → layers.\1.attn_hc.fn
^layers\.(\d+)\.hc_attn_base$                     → layers.\1.attn_hc.base
^layers\.(\d+)\.hc_attn_scale$                    → layers.\1.attn_hc.scale
^layers\.(\d+)\.hc_ffn_fn$                        → layers.\1.ffn_hc.fn
^layers\.(\d+)\.hc_ffn_base$                      → layers.\1.ffn_hc.base
^layers\.(\d+)\.hc_ffn_scale$                     → layers.\1.ffn_hc.scale
^layers\.(\d+)\.attn\.                            → layers.\1.self_attn.
^layers\.(\d+)\.ffn\.                             → layers.\1.mlp.
^layers\.(\d+)\.self_attn\.attn_sink$             → layers.\1.self_attn.sinks
^layers\.(\d+)\.self_attn\.(.*?)\.wq_a\.          → layers.\1.self_attn.\2.q_a_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wq_b\.          → layers.\1.self_attn.\2.q_b_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wkv\.           → layers.\1.self_attn.\2.kv_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wgate\.         → layers.\1.self_attn.\2.gate_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wo_a\.          → layers.\1.self_attn.\2.o_a_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wo_b\.          → layers.\1.self_attn.\2.o_b_proj.
^layers\.(\d+)\.self_attn\.wq_a\.                 → layers.\1.self_attn.q_a_proj.
^layers\.(\d+)\.self_attn\.wq_b\.                 → layers.\1.self_attn.q_b_proj.
^layers\.(\d+)\.self_attn\.wkv\.                  → layers.\1.self_attn.kv_proj.
^layers\.(\d+)\.self_attn\.wo_a\.                 → layers.\1.self_attn.o_a_proj.
^layers\.(\d+)\.self_attn\.wo_b\.                 → layers.\1.self_attn.o_b_proj.
^layers\.(\d+)\.self_attn\.q_norm\.               → layers.\1.self_attn.q_a_norm.
^layers\.(\d+)\.mlp\.gate\.bias$                  → layers.\1.mlp.gate.e_score_correction_bias
^layers\.(\d+)\.mlp\.shared_experts\.w1\.         → layers.\1.mlp.shared_experts.gate_proj.
^layers\.(\d+)\.mlp\.shared_experts\.w2\.         → layers.\1.mlp.shared_experts.down_proj.
^layers\.(\d+)\.mlp\.shared_experts\.w3\.         → layers.\1.mlp.shared_experts.up_proj.

---

from transformers.conversion_mapping import (
    get_checkpoint_conversion_mapping,
    register_checkpoint_conversion_mapping,
)
existing = get_checkpoint_conversion_mapping("deepseek_v4")
added = []
for entry in existing:
    sp = getattr(entry, "source_patterns", None)
    tp = getattr(entry, "target_patterns", None)
    if sp is None or tp is None:
        continue
    sp_list = sp if isinstance(sp, (list, tuple)) else [sp]
    tp_list = tp if isinstance(tp, (list, tuple)) else [tp]
    new_sp, new_tp = [], []
    for s, t in zip(sp_list, tp_list):
        if isinstance(s, str) and s.startswith(r"^layers\.(\d+)\."):
            new_sp.append(s.replace(r"^layers\.(\d+)\.", r"^mtp\.(\d+)\.", 1))
            new_tp.append(t.replace("layers.\\1.", "mtp.\\1.", 1))
    if new_sp:
        added.append(type(entry)(
            source_patterns=new_sp if len(new_sp) > 1 else new_sp[0],
            target_patterns=new_tp if len(new_tp) > 1 else new_tp[0],
        ))
register_checkpoint_conversion_mapping(
    "deepseek_v4", list(existing) + added, overwrite=True)

---

import safetensors.torch as st
from pathlib import Path

loaded_w = model.model.mtp[0].self_attn.q_a_proj.weight
source_w = None
for shard in sorted(Path(model_path).glob("model-*.safetensors")):
    with st.safe_open(shard, framework="pt") as f:
        if "mtp.0.attn.wq_a.weight" in f.keys():
            source_w = f.get_tensor("mtp.0.attn.wq_a.weight")
            break
assert source_w is not None
diff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()
assert diff < 1e-4, f"MTP weight mismatch: {diff} (silent random-init?)"

RAW_BUFFERClick to expand / collapse

Summary

Entries 6–38 are anchored at ^layers\.(\d+)\. — they only fire on main-layer keys. None cover mtp.\d+.* paths.

Symptom

The model loads "successfully" (no errors, no warnings about missing keys after the regex is dropped), model.mtp[0] exists with the right structure, from_pretrained returns. But model.mtp[0].self_attn.q_a_proj.weight is random Gaussian, not the value in the safetensors file. Silent corruption of the MTP draft head. Any downstream calibration / quantization / inference using model.mtp produces garbage.

Repro

# (assumes huggingface/transformers#46127 is applied — DeepseekV4NextNPredictor
#  exists, _keys_to_ignore_on_load_unexpected = [])
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("<DSv4-Flash BF16 with mtp.* keys>")

# Compare loaded vs source
import safetensors.torch as st
from pathlib import Path
loaded_w = model.model.mtp[0].self_attn.q_a_proj.weight
for shard in sorted(Path("<path>").glob("model-*.safetensors")):
    with st.safe_open(shard, framework="pt") as f:
        if "mtp.0.attn.wq_a.weight" in f.keys():
            source_w = f.get_tensor("mtp.0.attn.wq_a.weight")
            break

diff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()
print(f"max_diff = {diff}")
# Without conversion mapping for mtp.*: diff ≈ random Gaussian range (e.g. 0.1+)
# With the mtp.* mapping extension: diff ≈ 0

Proposed fix

Add 33 mtp.\d+.* equivalents mirroring the existing ^layers\.(\d+)\. entries to _checkpoint_conversion_mapping for the deepseek_v4 architecture. The 6 model-level entries (embed., head., norm., hc_head_*) do NOT need to be mirrored — MTP doesn't have its own copy of those (it shares embed_tokens and lm_head with the main model).

Specifically, for each of these patterns, add a parallel entry anchored at ^mtp\.(\d+)\.:

^layers\.(\d+)\.attn_norm\.                      → layers.\1.input_layernorm.
^layers\.(\d+)\.ffn_norm\.                       → layers.\1.post_attention_layernorm.
^layers\.(\d+)\.hc_attn_fn$                       → layers.\1.attn_hc.fn
^layers\.(\d+)\.hc_attn_base$                     → layers.\1.attn_hc.base
^layers\.(\d+)\.hc_attn_scale$                    → layers.\1.attn_hc.scale
^layers\.(\d+)\.hc_ffn_fn$                        → layers.\1.ffn_hc.fn
^layers\.(\d+)\.hc_ffn_base$                      → layers.\1.ffn_hc.base
^layers\.(\d+)\.hc_ffn_scale$                     → layers.\1.ffn_hc.scale
^layers\.(\d+)\.attn\.                            → layers.\1.self_attn.
^layers\.(\d+)\.ffn\.                             → layers.\1.mlp.
^layers\.(\d+)\.self_attn\.attn_sink$             → layers.\1.self_attn.sinks
^layers\.(\d+)\.self_attn\.(.*?)\.wq_a\.          → layers.\1.self_attn.\2.q_a_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wq_b\.          → layers.\1.self_attn.\2.q_b_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wkv\.           → layers.\1.self_attn.\2.kv_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wgate\.         → layers.\1.self_attn.\2.gate_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wo_a\.          → layers.\1.self_attn.\2.o_a_proj.
^layers\.(\d+)\.self_attn\.(.*?)\.wo_b\.          → layers.\1.self_attn.\2.o_b_proj.
^layers\.(\d+)\.self_attn\.wq_a\.                 → layers.\1.self_attn.q_a_proj.
^layers\.(\d+)\.self_attn\.wq_b\.                 → layers.\1.self_attn.q_b_proj.
^layers\.(\d+)\.self_attn\.wkv\.                  → layers.\1.self_attn.kv_proj.
^layers\.(\d+)\.self_attn\.wo_a\.                 → layers.\1.self_attn.o_a_proj.
^layers\.(\d+)\.self_attn\.wo_b\.                 → layers.\1.self_attn.o_b_proj.
^layers\.(\d+)\.self_attn\.q_norm\.               → layers.\1.self_attn.q_a_norm.
^layers\.(\d+)\.mlp\.gate\.bias$                  → layers.\1.mlp.gate.e_score_correction_bias
^layers\.(\d+)\.mlp\.shared_experts\.w1\.         → layers.\1.mlp.shared_experts.gate_proj.
^layers\.(\d+)\.mlp\.shared_experts\.w2\.         → layers.\1.mlp.shared_experts.down_proj.
^layers\.(\d+)\.mlp\.shared_experts\.w3\.         → layers.\1.mlp.shared_experts.up_proj.

The entries at indexes 17–22 (compressor/indexer renames) only need to mirror if MTP can be configured with compressed_sparse_attention or heavily_compressed_attention layer_type. For DSv4-Flash, MTP uses sliding_attention (compressor = None — see #46127 discussion), so those 6 entries don't need to mirror, but mirroring them is harmless (the regex just won't match anything).

Runtime workaround for downstream users

Until upstream lands, here's the runtime mirror:

from transformers.conversion_mapping import (
    get_checkpoint_conversion_mapping,
    register_checkpoint_conversion_mapping,
)
existing = get_checkpoint_conversion_mapping("deepseek_v4")
added = []
for entry in existing:
    sp = getattr(entry, "source_patterns", None)
    tp = getattr(entry, "target_patterns", None)
    if sp is None or tp is None:
        continue
    sp_list = sp if isinstance(sp, (list, tuple)) else [sp]
    tp_list = tp if isinstance(tp, (list, tuple)) else [tp]
    new_sp, new_tp = [], []
    for s, t in zip(sp_list, tp_list):
        if isinstance(s, str) and s.startswith(r"^layers\.(\d+)\."):
            new_sp.append(s.replace(r"^layers\.(\d+)\.", r"^mtp\.(\d+)\.", 1))
            new_tp.append(t.replace("layers.\\1.", "mtp.\\1.", 1))
    if new_sp:
        added.append(type(entry)(
            source_patterns=new_sp if len(new_sp) > 1 else new_sp[0],
            target_patterns=new_tp if len(new_tp) > 1 else new_tp[0],
        ))
register_checkpoint_conversion_mapping(
    "deepseek_v4", list(existing) + added, overwrite=True)

Detection — value-verification assertion

A 50-line fixture that catches this regression class (and the related layer_type bug at #46127) by comparing a loaded MTP tensor to its source:

import safetensors.torch as st
from pathlib import Path

loaded_w = model.model.mtp[0].self_attn.q_a_proj.weight
source_w = None
for shard in sorted(Path(model_path).glob("model-*.safetensors")):
    with st.safe_open(shard, framework="pt") as f:
        if "mtp.0.attn.wq_a.weight" in f.keys():
            source_w = f.get_tensor("mtp.0.attn.wq_a.weight")
            break
assert source_w is not None
diff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()
assert diff < 1e-4, f"MTP weight mismatch: {diff} (silent random-init?)"

This belongs as a test under tests/models/deepseek_v4/ paired with #46127.

#46127 — adds DeepseekV4NextNPredictor class + Model.mtp ModuleList + sliding_attention layer_type for MTP. The class shim PR. This issue is the companion — even with the class shim, the conversion mapping needs to be extended for MTP keys to actually load into the new submodules.
vllm-project/llm-compressor#2735 — calibration-side rollup of both issues.
vllm-project/llm-compressor#2739 — companion mapping extension PR (for the ARCH_TO_2D_MAPPINGS that lives on llm-compressor's side).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix [deepseek_v4] conversion_mapping doesn't cover mtp.* paths — MTP keys silently random-init even after _keys_to_ignore is empty

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Runtime workaround for downstream users

Code Example

Summary

Symptom

Repro

Proposed fix

Runtime workaround for downstream users

Detection — value-verification assertion

Related

Still need to ship something?

TRENDING