When training Qwen 3.6 (model using as `qwen3_5_moe`) under FSDP2 with `CPUOffloadPolicy(...)` keeping `inv_freq` on the host, the rotary-embedding forward fails with a mixed-device `bmm`: ```none RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm) ``` The non-MoE [`Qwen3RotaryEmbedding.forward` (qwen3/modeling_qwen3.py:L138)](https://github.com/huggingface/transformers/blob/v5.5.4/src/transformers/models/qwen3/modeling_qwen3.py#L138) guards against this by appending `.to(x.device)` to `inv_freq_expanded`. The MoE variant [`Qwen3_5MoeTextRotaryEmbedding.forward` (qwen3_5_moe/modeling_qwen3_5_moe.py:L145)](https://github.com/huggingface/transformers/blob/v5.5.4/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py#L145) does not. The fix should be a one-line patch appending `.to(x.device)`.

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.forward` is not compatible with CPU offload

transformers2026-05-09 05:46:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

Fix Action

Fix / Workaround

The fix should be a one-line patch appending .to(x.device).

Code Example

import sys

import torch
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextConfig,
    Qwen3_5MoeTextRotaryEmbedding,
)

if not torch.cuda.is_available():
    raise NotImplementedError("This reproducer needs at least one CUDA device")

# Default config; the bug is in the rotary forward, so the precise model
# dims don't matter. Using defaults avoids any checkpoint download
cfg = Qwen3_5MoeTextConfig()
rotary = Qwen3_5MoeTextRotaryEmbedding(cfg)
print(f"rotary.inv_freq.device = {rotary.inv_freq.device}")
assert rotary.inv_freq.device.type == "cpu", (
    "Expected inv_freq to land on CPU when no device is passed; got"
    f" {rotary.inv_freq.device}. Bug premise no longer holds."
)

# GPU activation, simulating what FSDP2 forward routes into the rotary call
bsz, seq_len = 1, 8
x = torch.randn(bsz, seq_len, cfg.hidden_size, device="cuda", dtype=torch.bfloat16)
position_ids = torch.arange(seq_len, device="cuda")[None, :]

try:
    cos, sin = rotary(x, position_ids)
except RuntimeError as exc:
    msg = str(exc)
    if (
        "Expected all tensors to be on the same device" in msg
        and "wrapper_CUDA_bmm" in msg
    ):
        print(f"Reproduced: {type(exc).__name__}: {msg}")
        sys.exit(0)
    raise NotImplementedError(
        f"Unexpected RuntimeError (not the device-mismatch we expect): {msg}"
    ) from exc

raise NotImplementedError(
    f"Did not reproduce: rotary forward returned {cos.device=} {sin.device=}."
)

---

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.5.4
Platform: Linux-6.8.0-1043-nvidia-x86_64-with-glibc2.35
Python version: 3.12.13
Huggingface_hub version: 1.11.0
Safetensors version: 0.7.0
Accelerate version: 1.13.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@ArthurZucker @Cyrilvallez @3outeille @zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import sys

import torch
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextConfig,
    Qwen3_5MoeTextRotaryEmbedding,
)

if not torch.cuda.is_available():
    raise NotImplementedError("This reproducer needs at least one CUDA device")

# Default config; the bug is in the rotary forward, so the precise model
# dims don't matter. Using defaults avoids any checkpoint download
cfg = Qwen3_5MoeTextConfig()
rotary = Qwen3_5MoeTextRotaryEmbedding(cfg)
print(f"rotary.inv_freq.device = {rotary.inv_freq.device}")
assert rotary.inv_freq.device.type == "cpu", (
    "Expected inv_freq to land on CPU when no device is passed; got"
    f" {rotary.inv_freq.device}. Bug premise no longer holds."
)

# GPU activation, simulating what FSDP2 forward routes into the rotary call
bsz, seq_len = 1, 8
x = torch.randn(bsz, seq_len, cfg.hidden_size, device="cuda", dtype=torch.bfloat16)
position_ids = torch.arange(seq_len, device="cuda")[None, :]

try:
    cos, sin = rotary(x, position_ids)
except RuntimeError as exc:
    msg = str(exc)
    if (
        "Expected all tensors to be on the same device" in msg
        and "wrapper_CUDA_bmm" in msg
    ):
        print(f"Reproduced: {type(exc).__name__}: {msg}")
        sys.exit(0)
    raise NotImplementedError(
        f"Unexpected RuntimeError (not the device-mismatch we expect): {msg}"
    ) from exc

raise NotImplementedError(
    f"Did not reproduce: rotary forward returned {cos.device=} {sin.device=}."
)

Expected behavior

When training Qwen 3.6 (model using as qwen3_5_moe) under FSDP2 with CPUOffloadPolicy(...) keeping inv_freq on the host, the rotary-embedding forward fails with a mixed-device bmm:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

The non-MoE Qwen3RotaryEmbedding.forward (qwen3/modeling_qwen3.py:L138) guards against this by appending .to(x.device) to inv_freq_expanded. The MoE variant Qwen3_5MoeTextRotaryEmbedding.forward (qwen3_5_moe/modeling_qwen3_5_moe.py:L145) does not.

The fix should be a one-line patch appending .to(x.device).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

When training Qwen 3.6 (model using as qwen3_5_moe) under FSDP2 with CPUOffloadPolicy(...) keeping inv_freq on the host, the rotary-embedding forward fails with a mixed-device bmm:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

The fix should be a one-line patch appending .to(x.device).

#integration issue #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.forward` is not compatible with CPU offload

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.forward` is not compatible with CPU offload

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING