transformers - ✅(Solved) Fix Qwen3.5: DeepSpeed ZeRO-3 fails to load weights for language_model [1 pull requests, 1 comments, 2 participants]

transformers2026-04-08 11:44:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45313•Fetched 2026-04-09 07:50:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

debOliveira

Participants

debOliveira

zucchini-nlp

Timeline (top)

commented ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: Conversion for LLM class loading with VLM ckpt (https://github.com/huggingface/transformers/pull/45314)

PR fix notes

PR #45314: Conversion for LLM class loading with VLM ckpt

Repository: huggingface/transformers
Author: zucchini-nlp
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45314

Description (problem / solution / changelog)

What does this PR do?

fixes https://github.com/huggingface/transformers/issues/45216 and https://github.com/huggingface/transformers/issues/45310 and https://github.com/huggingface/transformers/issues/45313

TBH load-save-load works for the model on main branch which is why the tests are not failing, it is only that the saved sd is completely weird and incorrect. Also smth when deepspeed loading, but I didn't check really

This works when from_pretrained just because we replace all matches with original_key.replace thus the whole language_model.language_model.language_model part is replaced

Changed files

src/transformers/conversion_mapping.py (modified, +12/-8)
src/transformers/models/gemma3n/modeling_gemma3n.py (modified, +1/-0)
src/transformers/models/gemma3n/modular_gemma3n.py (modified, +1/-1)
src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +1/-0)
src/transformers/models/qwen3_5/modular_qwen3_5.py (modified, +1/-0)
src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +1/-0)
src/transformers/models/qwen3_5_moe/modular_qwen3_5_moe.py (modified, +1/-0)
tests/models/gemma3n/test_modeling_gemma3n.py (modified, +0/-6)
tests/models/qwen3_5/test_modeling_qwen3_5.py (modified, +0/-6)
tests/models/qwen3_5_moe/test_modeling_qwen3_5_moe.py (modified, +0/-6)

Code Example

Key                                                                  | Status  | Details
---------------------------------------------------------------------+---------+--------
model.language_model.layers.{0...63}.post_attention_layernorm.weight | MISSING |        
model.language_model.layers.{0...62}.linear_attn.out_proj.weight     | MISSING |        
model.language_model.norm.weight                                     | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_qkv.weight  | MISSING |        
model.language_model.layers.{0...62}.linear_attn.conv1d.weight       | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_norm.weight         | MISSING |        
model.language_model.layers.{0...62}.linear_attn.dt_bias             | MISSING |        
model.language_model.layers.{0...62}.linear_attn.A_log               | MISSING |        
model.language_model.layers.{0...62}.linear_attn.norm.weight         | MISSING |        
model.language_model.layers.{0...63}.mlp.gate_proj.weight            | MISSING |        
model.language_model.layers.{0...63}.mlp.down_proj.weight            | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_a.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_proj.weight         | MISSING |        
model.language_model.layers.{0...63}.input_layernorm.weight          | MISSING |        
model.language_model.layers.{0...63}.mlp.up_proj.weight              | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_b.weight    | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_z.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_norm.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.o_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.v_proj.weight         | MISSING |        
model.language_model.embed_tokens.weight                             | MISSING |

---

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto")

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.4.0
Platform: Linux (H200 x4)
Python version: 3.12.0
DeepSpeed version: 0.18.5
PyTorch version: 2.8.0+cu128 (CUDA)

Problem

When loading Qwen/Qwen3.5-27B (also tested with 9B) with DeepSpeed ZeRO-3, language_model parameters are reported as MISSING in the load report.

Key                                                                  | Status  | Details
---------------------------------------------------------------------+---------+--------
model.language_model.layers.{0...63}.post_attention_layernorm.weight | MISSING |        
model.language_model.layers.{0...62}.linear_attn.out_proj.weight     | MISSING |        
model.language_model.norm.weight                                     | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_qkv.weight  | MISSING |        
model.language_model.layers.{0...62}.linear_attn.conv1d.weight       | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_norm.weight         | MISSING |        
model.language_model.layers.{0...62}.linear_attn.dt_bias             | MISSING |        
model.language_model.layers.{0...62}.linear_attn.A_log               | MISSING |        
model.language_model.layers.{0...62}.linear_attn.norm.weight         | MISSING |        
model.language_model.layers.{0...63}.mlp.gate_proj.weight            | MISSING |        
model.language_model.layers.{0...63}.mlp.down_proj.weight            | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_a.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.q_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_proj.weight         | MISSING |        
model.language_model.layers.{0...63}.input_layernorm.weight          | MISSING |        
model.language_model.layers.{0...63}.mlp.up_proj.weight              | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_b.weight    | MISSING |        
model.language_model.layers.{0...62}.linear_attn.in_proj_z.weight    | MISSING |        
model.language_model.layers.{3...63}.self_attn.k_norm.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.o_proj.weight         | MISSING |        
model.language_model.layers.{3...63}.self_attn.v_proj.weight         | MISSING |        
model.language_model.embed_tokens.weight                             | MISSING |

Cause hypothesis

In conversion_mapping.py, the language_model weight keys are remapped to model only. This conversion is called in _load_pretrained_model when DeepSpeed ZeRO-3 is turned on. https://github.com/huggingface/transformers/blob/d081c718b8825036a7662ec819313e5141dc34b5/src/transformers/conversion_mapping.py#L155-L157

The problem disappears when setting target_patterns="model.language_model" or using ZeRO-2.

Reproduction

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto")

Expected Behavior

Model weights should load correctly with DeepSpeed ZeRO-3.

Related issues

#45310
#45216

extent analysis

TL;DR

The most likely fix is to adjust the target_patterns in the conversion mapping to include model.language_model when using DeepSpeed ZeRO-3.

Guidance

Verify that the issue is indeed related to the conversion mapping in conversion_mapping.py by checking if setting target_patterns="model.language_model" resolves the problem.
Consider using ZeRO-2 as a temporary workaround if adjusting target_patterns is not feasible.
Review the conversion_mapping.py file to understand how weight keys are remapped and how this affects the loading of language_model parameters.
Check the related issues (#45310, #45216) for any additional information or potential fixes.

Example

from transformers import Qwen3_5ForConditionalGeneration
from transformers.integrations import HfDeepSpeedConfig

ds_cfg = {
    "bf16": {"enabled": True},
    "train_micro_batch_size_per_gpu": 1,
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
    }
}
dschf = HfDeepSpeedConfig(ds_cfg)
model = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="auto", 
                                                        target_patterns="model.language_model")

Notes

This fix assumes that the issue is indeed related to the conversion mapping and that adjusting target_patterns will resolve the problem. If this does not work, further investigation into the conversion_mapping.py file and related issues may be necessary.

Recommendation

Apply workaround by setting target_patterns="model.language_model" when using DeepSpeed ZeRO-3, as this has been shown to resolve the issue in the provided reproduction code.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Qwen3.5: DeepSpeed ZeRO-3 fails to load weights for language_model [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45314: Conversion for LLM class loading with VLM ckpt

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Problem

Cause hypothesis

Reproduction

Expected Behavior

Related issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Qwen3.5: DeepSpeed ZeRO-3 fails to load weights for language_model [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45314: Conversion for LLM class loading with VLM ckpt

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Problem

Cause hypothesis

Reproduction

Expected Behavior

Related issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING