transformers - ✅(Solved) Fix [Qwen3MoE] Potentially a bug on `Qwen3MoeSparseMoeBlock` [1 pull requests, 1 participants]

transformers2026-04-03 05:26:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45208•Fetched 2026-04-08 02:33:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

KbKuuhaku

Participants

KbKuuhaku

Timeline (top)

referenced ×2cross-referenced ×1

Error Message

TypeError: unsupported operand type(s) for +: 'Tensor' and 'tuple'

Fix Action

Fixed

Fixed by PR: [Qwen3MoE] Fix wrong return type annotation in Qwen3MoeSparseMoeBlock.forward (https://github.com/huggingface/transformers/pull/45211)

PR fix notes

PR #45211: [Qwen3MoE] Fix wrong return type annotation in Qwen3MoeSparseMoeBlock.forward

Repository: huggingface/transformers
Author: matdou
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45211

Description (problem / solution / changelog)

Fixes #45208

What does this PR do?

This PR corrects an incorrect return type in Qwen3MoeSparseMoeBlock.forward.

The method was annotated as returning tuple[torch.Tensor, torch.Tensor], while the implementation returns a torch.Tensor:

return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

In particular, downstream usage (e.g. in Qwen3MoeDecoderLayer) expects a tensor:

hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states

This PR updates the annotation to:

-> torch.Tensor

in both:

modeling_qwen3_moe.py
modular_qwen3_moe.py

No functional changes are introduced, this is a typing correction only.

Checklist

I confirm that this is not a pure code agent PR.
This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez

Changed files

src/transformers/models/qwen3_moe/modeling_qwen3_moe.py (modified, +1/-1)
src/transformers/models/qwen3_moe/modular_qwen3_moe.py (modified, +1/-1)
src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py (modified, +1/-1)
src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +1/-1)

Code Example

class Qwen3MoeSparseMoeBlock(nn.Module):
    def __init__(self, config: Qwen3MoeConfig):
        super().__init__()
        self.experts = Qwen3MoeExperts(config)
        self.gate = Qwen3MoeTopKRouter(config)

    def forward(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states_reshaped = hidden_states.view(-1, hidden_dim)
        _, routing_weights, selected_experts = self.gate(hidden_states_reshaped)
        final_hidden_states = self.experts(hidden_states_reshaped, selected_experts, routing_weights)
        return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

---

# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states

---

TypeError: unsupported operand type(s) for +: 'Tensor' and 'tuple'

RAW_BUFFERClick to expand / collapse

Hi,

I found a typing mismatch on Qwen3MoeSparseMoeBlock:

class Qwen3MoeSparseMoeBlock(nn.Module):
    def __init__(self, config: Qwen3MoeConfig):
        super().__init__()
        self.experts = Qwen3MoeExperts(config)
        self.gate = Qwen3MoeTopKRouter(config)

    def forward(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states_reshaped = hidden_states.view(-1, hidden_dim)
        _, routing_weights, selected_experts = self.gate(hidden_states_reshaped)
        final_hidden_states = self.experts(hidden_states_reshaped, selected_experts, routing_weights)
        return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

if the code is correct, the return type of forward should be torch.Tensor. However, i don't know whether returning routing_weights is also needed or not. Also, Qwen3MoeSparseMoeBlock is used in Qwen3MoeDecoderLayer as self.mlp, and there is a residual connection after self.mlp(hidden_states):

# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states

If we return a tuple of tensors, hidden_states = residual + hidden_states will give this error

TypeError: unsupported operand type(s) for +: 'Tensor' and 'tuple'

Did i miss something? Should we also return the routing_weights for computing loss during training?

extent analysis

TL;DR

The forward method of Qwen3MoeSparseMoeBlock should return a single torch.Tensor to match the expected input type for the residual connection in Qwen3MoeDecoderLayer.

Guidance

Verify the return type of the forward method in Qwen3MoeSparseMoeBlock to ensure it matches the expected input type for the residual connection.
Check if returning routing_weights is necessary for computing loss during training, and if so, consider modifying the Qwen3MoeDecoderLayer to handle the tuple return type.
Consider changing the return type of the forward method to torch.Tensor by removing the unnecessary return values.
Review the usage of Qwen3MoeSparseMoeBlock in Qwen3MoeDecoderLayer to ensure that the return type is handled correctly.

Example

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
    # ...
    return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

Notes

The necessity of returning routing_weights depends on the specific requirements of the model and the training process. If routing_weights are not needed for computing loss, the return type can be simplified to torch.Tensor.

Recommendation

Apply workaround: Modify the forward method to return a single torch.Tensor to match the expected input type for the residual connection, unless routing_weights are necessary for computing loss during training.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#cache error #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix [Qwen3MoE] Potentially a bug on `Qwen3MoeSparseMoeBlock` [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #45211: [Qwen3MoE] Fix wrong return type annotation in Qwen3MoeSparseMoeBlock.forward

Description (problem / solution / changelog)

What does this PR do?

Checklist

Who can review?

Changed files

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix [Qwen3MoE] Potentially a bug on `Qwen3MoeSparseMoeBlock` [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #45211: [Qwen3MoE] Fix wrong return type annotation in Qwen3MoeSparseMoeBlock.forward

Description (problem / solution / changelog)

What does this PR do?

Checklist

Who can review?

Changed files

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING