transformers - ✅(Solved) Fix Qwen3.5: DeepSpeed ZeRO-3 fails to load weights for language_model [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45313Fetched 2026-04-09 07:50:47
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #45314: Conversion for LLM class loading with VLM ckpt

Description (problem / solution / changelog)

What does this PR do?

fixes https://github.com/huggingface/transformers/issues/45216 and https://github.com/huggingface/transformers/issues/45310 and https://github.com/huggingface/transformers/issues/45313

TBH load-save-load works for the model on main branch which is why the tests are not failing, it is only that the saved sd is completely weird and incorrect. Also smth when deepspeed loading, but I didn't check really

This works when from_pretrained just because we replace all matches with original_key.replace thus the whole language_model.language_model.language_model part is replaced

Changed files

  • src/transformers/conversion_mapping.py (modified, +12/-8)
  • src/transformers/models/gemma3n/modeling_gemma3n.py (modified, +1/-0)
  • src/transformers/models/gemma3n/modular_gemma3n.py (modified, +1/-1)
  • src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +1/-0)
  • src/transformers/models/qwen3_5/modular_qwen3_5.py (modified, +1/-0)
  • src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +1/-0)
  • src/transformers/models/qwen3_5_moe/modular_qwen3_5_moe.py (modified, +1/-0)
  • tests/models/gemma3n/test_modeling_gemma3n.py (modified, +0/-6)
  • tests/models/qwen3_5/test_modeling_qwen3_5.py (modified, +0/-6)
  • tests/models/qwen3_5_moe/test_modeling_qwen3_5_moe.py (modified, +0/-6)

Code Example

Key                                                                  | Status  | Details
---------------------------------------------------------------------+---------+--------
model.language_model.layers.{0...63}.post_attention_layernorm.weight | MISSING |        
model.language_model.layers.{0...62}.linear_attn.out_proj.weight     | MISSING |        
model.language_model.norm.weight                                     | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_qkv.weight  | MISSING |        
model.language_model.layers.{0...62}.linear_attn.conv1d.weight       | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_norm.weight         | MISSING |        
model.language_model.layers.{0...62}.linear_attn.dt_bias             | MISSING |        
model.language_model.layers.{0...62}.linear_attn.A_log               | MISSING |        
model.language_model.layers.{0...62}.linear_attn.norm.weight         | MISSING |        
model.language_model.layers.{0...63}.mlp.gate_proj.weight            | MISSING |        
model.language_model.layers.{0...63}.mlp.down_proj.weight            | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_a.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_proj.weight         | MISSING |        
model.language_model.layers.{0...63}.input_layernorm.weight          | MISSING |        
model.language_model.layers.{0...63}.mlp.up_proj.weight              | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_b.weight    | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_z.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_norm.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.o_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.v_proj.weight         | MISSING |        
model.language_model.embed_tokens.weight                             | MISSING |

---

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto")
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.4.0
  • Platform: Linux (H200 x4)
  • Python version: 3.12.0
  • DeepSpeed version: 0.18.5
  • PyTorch version: 2.8.0+cu128 (CUDA)

Problem

When loading Qwen/Qwen3.5-27B (also tested with 9B) with DeepSpeed ZeRO-3, language_model parameters are reported as MISSING in the load report.

Key                                                                  | Status  | Details
---------------------------------------------------------------------+---------+--------
model.language_model.layers.{0...63}.post_attention_layernorm.weight | MISSING |        
model.language_model.layers.{0...62}.linear_attn.out_proj.weight     | MISSING |        
model.language_model.norm.weight                                     | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_qkv.weight  | MISSING |        
model.language_model.layers.{0...62}.linear_attn.conv1d.weight       | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_norm.weight         | MISSING |        
model.language_model.layers.{0...62}.linear_attn.dt_bias             | MISSING |        
model.language_model.layers.{0...62}.linear_attn.A_log               | MISSING |        
model.language_model.layers.{0...62}.linear_attn.norm.weight         | MISSING |        
model.language_model.layers.{0...63}.mlp.gate_proj.weight            | MISSING |        
model.language_model.layers.{0...63}.mlp.down_proj.weight            | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_a.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_proj.weight         | MISSING |        
model.language_model.layers.{0...63}.input_layernorm.weight          | MISSING |        
model.language_model.layers.{0...63}.mlp.up_proj.weight              | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_b.weight    | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_z.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_norm.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.o_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.v_proj.weight         | MISSING |        
model.language_model.embed_tokens.weight                             | MISSING |

Cause hypothesis

In conversion_mapping.py, the language_model weight keys are remapped to model only. This conversion is called in _load_pretrained_model when DeepSpeed ZeRO-3 is turned on. https://github.com/huggingface/transformers/blob/d081c718b8825036a7662ec819313e5141dc34b5/src/transformers/conversion_mapping.py#L155-L157

The problem disappears when setting target_patterns="model.language_model" or using ZeRO-2.

Reproduction

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto")

Expected Behavior

Model weights should load correctly with DeepSpeed ZeRO-3.

Related issues

  • #45310
  • #45216

extent analysis

TL;DR

The most likely fix is to adjust the target_patterns in the conversion mapping to include model.language_model when using DeepSpeed ZeRO-3.

Guidance

  • Verify that the issue is indeed related to the conversion mapping in conversion_mapping.py by checking if setting target_patterns="model.language_model" resolves the problem.
  • Consider using ZeRO-2 as a temporary workaround if adjusting target_patterns is not feasible.
  • Review the conversion_mapping.py file to understand how weight keys are remapped and how this affects the loading of language_model parameters.
  • Check the related issues (#45310, #45216) for any additional information or potential fixes.

Example

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto", 
                                                        target_patterns="model.language_model")

Notes

This fix assumes that the issue is indeed related to the conversion mapping and that adjusting target_patterns will resolve the problem. If this does not work, further investigation into the conversion_mapping.py file and related issues may be necessary.

Recommendation

Apply workaround by setting target_patterns="model.language_model" when using DeepSpeed ZeRO-3, as this has been shown to resolve the issue in the provided reproduction code.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING