transformers - ✅(Solved) Fix [Qwen3MoE] Potentially a bug on `Qwen3MoeSparseMoeBlock` [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45208Fetched 2026-04-08 02:33:09
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
referenced ×2cross-referenced ×1

Error Message

TypeError: unsupported operand type(s) for +: 'Tensor' and 'tuple'

Fix Action

Fixed

PR fix notes

PR #45211: [Qwen3MoE] Fix wrong return type annotation in Qwen3MoeSparseMoeBlock.forward

Description (problem / solution / changelog)

Fixes #45208

What does this PR do?

This PR corrects an incorrect return type in Qwen3MoeSparseMoeBlock.forward.

The method was annotated as returning tuple[torch.Tensor, torch.Tensor], while the implementation returns a torch.Tensor:

return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

In particular, downstream usage (e.g. in Qwen3MoeDecoderLayer) expects a tensor:

hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states

This PR updates the annotation to:

-> torch.Tensor

in both:

  • modeling_qwen3_moe.py
  • modular_qwen3_moe.py

No functional changes are introduced, this is a typing correction only.


Checklist

  • I confirm that this is not a pure code agent PR.
  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez

Changed files

  • src/transformers/models/qwen3_moe/modeling_qwen3_moe.py (modified, +1/-1)
  • src/transformers/models/qwen3_moe/modular_qwen3_moe.py (modified, +1/-1)
  • src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py (modified, +1/-1)
  • src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +1/-1)

Code Example

class Qwen3MoeSparseMoeBlock(nn.Module):
    def __init__(self, config: Qwen3MoeConfig):
        super().__init__()
        self.experts = Qwen3MoeExperts(config)
        self.gate = Qwen3MoeTopKRouter(config)

    def forward(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states_reshaped = hidden_states.view(-1, hidden_dim)
        _, routing_weights, selected_experts = self.gate(hidden_states_reshaped)
        final_hidden_states = self.experts(hidden_states_reshaped, selected_experts, routing_weights)
        return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

---

# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states

---

TypeError: unsupported operand type(s) for +: 'Tensor' and 'tuple'
RAW_BUFFERClick to expand / collapse

Hi,

I found a typing mismatch on Qwen3MoeSparseMoeBlock:

class Qwen3MoeSparseMoeBlock(nn.Module):
    def __init__(self, config: Qwen3MoeConfig):
        super().__init__()
        self.experts = Qwen3MoeExperts(config)
        self.gate = Qwen3MoeTopKRouter(config)

    def forward(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states_reshaped = hidden_states.view(-1, hidden_dim)
        _, routing_weights, selected_experts = self.gate(hidden_states_reshaped)
        final_hidden_states = self.experts(hidden_states_reshaped, selected_experts, routing_weights)
        return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

if the code is correct, the return type of forward should be torch.Tensor. However, i don't know whether returning routing_weights is also needed or not. Also, Qwen3MoeSparseMoeBlock is used in Qwen3MoeDecoderLayer as self.mlp, and there is a residual connection after self.mlp(hidden_states):

# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states

If we return a tuple of tensors, hidden_states = residual + hidden_states will give this error

TypeError: unsupported operand type(s) for +: 'Tensor' and 'tuple'

Did i miss something? Should we also return the routing_weights for computing loss during training?

extent analysis

TL;DR

The forward method of Qwen3MoeSparseMoeBlock should return a single torch.Tensor to match the expected input type for the residual connection in Qwen3MoeDecoderLayer.

Guidance

  • Verify the return type of the forward method in Qwen3MoeSparseMoeBlock to ensure it matches the expected input type for the residual connection.
  • Check if returning routing_weights is necessary for computing loss during training, and if so, consider modifying the Qwen3MoeDecoderLayer to handle the tuple return type.
  • Consider changing the return type of the forward method to torch.Tensor by removing the unnecessary return values.
  • Review the usage of Qwen3MoeSparseMoeBlock in Qwen3MoeDecoderLayer to ensure that the return type is handled correctly.

Example

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
    # ...
    return final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)

Notes

The necessity of returning routing_weights depends on the specific requirements of the model and the training process. If routing_weights are not needed for computing loss, the return type can be simplified to torch.Tensor.

Recommendation

Apply workaround: Modify the forward method to return a single torch.Tensor to match the expected input type for the residual connection, unless routing_weights are necessary for computing loss during training.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING