transformers - ✅(Solved) Fix Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism [1 pull requests, 1 participants]

Q: Expected behavior

`lm_head` should be sharded with `colwise_gather_output` when using `tp_plan="auto"`, consistent with `Qwen3_5MoeForCausalLM`. Fix: huggingface/transformers#45124

transformers2026-03-30 16:33:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45125•Fetched 2026-04-08 01:52:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

danielquintas8

Participants

danielquintas8

Timeline (top)

mentioned ×2subscribed ×2closed ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration (https://github.com/huggingface/transformers/pull/45124)

PR fix notes

PR #45124: [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration

Repository: huggingface/transformers
Author: danielquintas8
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45124

Description (problem / solution / changelog)

What does this PR do?

Adds _tp_plan = {"lm_head": "colwise_gather_output"} to Qwen3_5MoeForConditionalGeneration (the VL wrapper class).

The text-only Qwen3_5MoeForCausalLM already had _tp_plan, but the VL variant was missing it. This meant that when using tp_plan="auto", the lm_head on the VL model was not sharded — each GPU held a full copy and the all-gather behavior (colwise_gather_output) was not applied, which could produce incorrect logits under tensor parallelism.

Change: Applied in modular_qwen3_5_moe.py (source of truth) and regenerated modeling_qwen3_5_moe.py.

Already in place (no changes needed):

base_model_tp_plan on Qwen3_5MoeTextConfig covers full attention (q/k/v/o_proj, q/k_norm), MoE experts, and shared experts.

Out of scope (future work):

Linear attention (GatedDeltaNet) TP — blocked on causal_conv1d DTensor support.
Vision block TP — pending path resolution investigation.

Fixes #45125

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@3outeille @ArthurZucker (distributed / model loading)

Changed files

src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +1/-0)
src/transformers/models/qwen3_5_moe/modular_qwen3_5_moe.py (modified, +2/-0)

Code Example

from transformers import Qwen3_5MoeForConditionalGeneration

# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)

RAW_BUFFERClick to expand / collapse

System Info

transformers main branch (post-Qwen3.5 MoE addition)
Any platform with multi-GPU setup

Who can help?

@3outeille @ArthurZucker

Information

My own modified scripts

Tasks

My own task or dataset (give details below)

Reproduction

Qwen3_5MoeForConditionalGeneration (the VL wrapper) is missing _tp_plan, while the text-only Qwen3_5MoeForCausalLM already has _tp_plan = {"lm_head": "colwise_gather_output"}.

from transformers import Qwen3_5MoeForConditionalGeneration

# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)

The lm_head Linear layer is not included in any TP plan for this class, so under tp_plan="auto" it remains replicated instead of being sharded with colwise_gather_output. This wastes memory and may produce incorrect logits since the all-gather is not applied.

Expected behavior

lm_head should be sharded with colwise_gather_output when using tp_plan="auto", consistent with Qwen3_5MoeForCausalLM.

Fix: huggingface/transformers#45124

extent analysis

Fix Plan

To fix the issue, we need to update the _tp_plan attribute in the Qwen3_5MoeForConditionalGeneration class to include the lm_head layer with colwise_gather_output sharding.

Update the Qwen3_5MoeForConditionalGeneration class to include the _tp_plan attribute:

from transformers import Qwen3_5MoeForConditionalGeneration

class Qwen3_5MoeForConditionalGeneration(Qwen3_5MoeForConditionalGeneration):
    _tp_plan = {"lm_head": "colwise_gather_output"}

Alternatively, you can also update the tp_plan argument when creating an instance of the model:

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", 
    tp_plan={"lm_head": "colwise_gather_output"}, 
    torch_dtype=torch.bfloat16
)

Verification

To verify that the fix worked, you can check the memory usage and the output of the model. The lm_head layer should now be sharded with colwise_gather_output, which should reduce memory usage and produce correct logits.

Extra Tips

Make sure to update the transformers library to the latest version to ensure that the fix is included.
If you are using a custom model, make sure to update the _tp_plan attribute accordingly.
You can also use the tp_plan="auto" argument to automatically shard the lm_head layer, but this may not work correctly if the model is not properly configured.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

lm_head should be sharded with colwise_gather_output when using tp_plan="auto", consistent with Qwen3_5MoeForCausalLM.

Fix: huggingface/transformers#45124

#model save/load #optimization #mixed precision #training loop #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45124: [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45124: [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING