transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.forward` is not compatible with CPU offload

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

Fix Action

Fix / Workaround

The fix should be a one-line patch appending .to(x.device).

Code Example

import sys

import torch
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextConfig,
    Qwen3_5MoeTextRotaryEmbedding,
)

if not torch.cuda.is_available():
    raise NotImplementedError("This reproducer needs at least one CUDA device")

# Default config; the bug is in the rotary forward, so the precise model
# dims don't matter. Using defaults avoids any checkpoint download
cfg = Qwen3_5MoeTextConfig()
rotary = Qwen3_5MoeTextRotaryEmbedding(cfg)
print(f"rotary.inv_freq.device = {rotary.inv_freq.device}")
assert rotary.inv_freq.device.type == "cpu", (
    "Expected inv_freq to land on CPU when no device is passed; got"
    f" {rotary.inv_freq.device}. Bug premise no longer holds."
)

# GPU activation, simulating what FSDP2 forward routes into the rotary call
bsz, seq_len = 1, 8
x = torch.randn(bsz, seq_len, cfg.hidden_size, device="cuda", dtype=torch.bfloat16)
position_ids = torch.arange(seq_len, device="cuda")[None, :]

try:
    cos, sin = rotary(x, position_ids)
except RuntimeError as exc:
    msg = str(exc)
    if (
        "Expected all tensors to be on the same device" in msg
        and "wrapper_CUDA_bmm" in msg
    ):
        print(f"Reproduced: {type(exc).__name__}: {msg}")
        sys.exit(0)
    raise NotImplementedError(
        f"Unexpected RuntimeError (not the device-mismatch we expect): {msg}"
    ) from exc

raise NotImplementedError(
    f"Did not reproduce: rotary forward returned {cos.device=} {sin.device=}."
)

---

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.5.4
  • Platform: Linux-6.8.0-1043-nvidia-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • Huggingface_hub version: 1.11.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@ArthurZucker @Cyrilvallez @3outeille @zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import sys

import torch
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextConfig,
    Qwen3_5MoeTextRotaryEmbedding,
)

if not torch.cuda.is_available():
    raise NotImplementedError("This reproducer needs at least one CUDA device")

# Default config; the bug is in the rotary forward, so the precise model
# dims don't matter. Using defaults avoids any checkpoint download
cfg = Qwen3_5MoeTextConfig()
rotary = Qwen3_5MoeTextRotaryEmbedding(cfg)
print(f"rotary.inv_freq.device = {rotary.inv_freq.device}")
assert rotary.inv_freq.device.type == "cpu", (
    "Expected inv_freq to land on CPU when no device is passed; got"
    f" {rotary.inv_freq.device}. Bug premise no longer holds."
)

# GPU activation, simulating what FSDP2 forward routes into the rotary call
bsz, seq_len = 1, 8
x = torch.randn(bsz, seq_len, cfg.hidden_size, device="cuda", dtype=torch.bfloat16)
position_ids = torch.arange(seq_len, device="cuda")[None, :]

try:
    cos, sin = rotary(x, position_ids)
except RuntimeError as exc:
    msg = str(exc)
    if (
        "Expected all tensors to be on the same device" in msg
        and "wrapper_CUDA_bmm" in msg
    ):
        print(f"Reproduced: {type(exc).__name__}: {msg}")
        sys.exit(0)
    raise NotImplementedError(
        f"Unexpected RuntimeError (not the device-mismatch we expect): {msg}"
    ) from exc

raise NotImplementedError(
    f"Did not reproduce: rotary forward returned {cos.device=} {sin.device=}."
)

Expected behavior

When training Qwen 3.6 (model using as qwen3_5_moe) under FSDP2 with CPUOffloadPolicy(...) keeping inv_freq on the host, the rotary-embedding forward fails with a mixed-device bmm:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

The non-MoE Qwen3RotaryEmbedding.forward (qwen3/modeling_qwen3.py:L138) guards against this by appending .to(x.device) to inv_freq_expanded. The MoE variant Qwen3_5MoeTextRotaryEmbedding.forward (qwen3_5_moe/modeling_qwen3_5_moe.py:L145) does not.

The fix should be a one-line patch appending .to(x.device).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When training Qwen 3.6 (model using as qwen3_5_moe) under FSDP2 with CPUOffloadPolicy(...) keeping inv_freq on the host, the rotary-embedding forward fails with a mixed-device bmm:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cuda:1, different from other tensors on cpu (when checking argument in method wrapper_CUDA_bmm)

The non-MoE Qwen3RotaryEmbedding.forward (qwen3/modeling_qwen3.py:L138) guards against this by appending .to(x.device) to inv_freq_expanded. The MoE variant Qwen3_5MoeTextRotaryEmbedding.forward (qwen3_5_moe/modeling_qwen3_5_moe.py:L145) does not.

The fix should be a one-line patch appending .to(x.device).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING