transformers - ✅(Solved) Fix Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45125Fetched 2026-04-08 01:52:42
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
1
Participants
Timeline (top)
mentioned ×2subscribed ×2closed ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #45124: [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration

Description (problem / solution / changelog)

What does this PR do?

Adds _tp_plan = {"lm_head": "colwise_gather_output"} to Qwen3_5MoeForConditionalGeneration (the VL wrapper class).

The text-only Qwen3_5MoeForCausalLM already had _tp_plan, but the VL variant was missing it. This meant that when using tp_plan="auto", the lm_head on the VL model was not sharded — each GPU held a full copy and the all-gather behavior (colwise_gather_output) was not applied, which could produce incorrect logits under tensor parallelism.

Change: Applied in modular_qwen3_5_moe.py (source of truth) and regenerated modeling_qwen3_5_moe.py.

Already in place (no changes needed):

  • base_model_tp_plan on Qwen3_5MoeTextConfig covers full attention (q/k/v/o_proj, q/k_norm), MoE experts, and shared experts.

Out of scope (future work):

  • Linear attention (GatedDeltaNet) TP — blocked on causal_conv1d DTensor support.
  • Vision block TP — pending path resolution investigation.

Fixes #45125

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@3outeille @ArthurZucker (distributed / model loading)

Changed files

  • src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +1/-0)
  • src/transformers/models/qwen3_5_moe/modular_qwen3_5_moe.py (modified, +2/-0)

Code Example

from transformers import Qwen3_5MoeForConditionalGeneration

# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers main branch (post-Qwen3.5 MoE addition)
  • Any platform with multi-GPU setup

Who can help?

@3outeille @ArthurZucker

Information

  • My own modified scripts

Tasks

  • My own task or dataset (give details below)

Reproduction

Qwen3_5MoeForConditionalGeneration (the VL wrapper) is missing _tp_plan, while the text-only Qwen3_5MoeForCausalLM already has _tp_plan = {"lm_head": "colwise_gather_output"}.

from transformers import Qwen3_5MoeForConditionalGeneration

# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)

The lm_head Linear layer is not included in any TP plan for this class, so under tp_plan="auto" it remains replicated instead of being sharded with colwise_gather_output. This wastes memory and may produce incorrect logits since the all-gather is not applied.

Expected behavior

lm_head should be sharded with colwise_gather_output when using tp_plan="auto", consistent with Qwen3_5MoeForCausalLM.

Fix: huggingface/transformers#45124

extent analysis

Fix Plan

To fix the issue, we need to update the _tp_plan attribute in the Qwen3_5MoeForConditionalGeneration class to include the lm_head layer with colwise_gather_output sharding.

  • Update the Qwen3_5MoeForConditionalGeneration class to include the _tp_plan attribute:
from transformers import Qwen3_5MoeForConditionalGeneration

class Qwen3_5MoeForConditionalGeneration(Qwen3_5MoeForConditionalGeneration):
    _tp_plan = {"lm_head": "colwise_gather_output"}
  • Alternatively, you can also update the tp_plan argument when creating an instance of the model:
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", 
    tp_plan={"lm_head": "colwise_gather_output"}, 
    torch_dtype=torch.bfloat16
)

Verification

To verify that the fix worked, you can check the memory usage and the output of the model. The lm_head layer should now be sharded with colwise_gather_output, which should reduce memory usage and produce correct logits.

Extra Tips

  • Make sure to update the transformers library to the latest version to ensure that the fix is included.
  • If you are using a custom model, make sure to update the _tp_plan attribute accordingly.
  • You can also use the tp_plan="auto" argument to automatically shard the lm_head layer, but this may not work correctly if the model is not properly configured.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

lm_head should be sharded with colwise_gather_output when using tp_plan="auto", consistent with Qwen3_5MoeForCausalLM.

Fix: huggingface/transformers#45124

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING