`Qwen3_5MoeTextRotaryEmbedding(cfg, device="meta").to_empty(device="cuda")` should produce a rotary embedding whose `inv_freq` and `original_inv_freq` match the canonical `rope_init_fn(cfg, device=cuda)` values. As-is, both buffers contain whatever the caching allocator's free-list holds at allocation time. This is the same buffer-init-versus-device-materialization gap that [`#45861`](https://github.com/huggingface/transformers/pull/45861) fixes for the wrong-device case; the present report covers the same gap when the buffer lands on the right device but with uninitialized storage. Two natural fixes: 1. `Qwen3_5MoeTextRotaryEmbedding.forward` (no new contract): detect that `self.inv_freq` is on a different device than its config implies it should be, or has been materialized from meta, and re-run `rope_init_fn(self.config, x.device)` once on first forward. The companion `original_inv_freq` follows. 2. `Qwen3_5MoeTextRotaryEmbedding._init_from_config` (cleaner but more invasive): register a `_register_load_state_dict_pre_hook` or override `_apply` so that when the module is migrated off `meta` (via `.to_empty` or `.to`), the buffer values are recomputed via `rope_init_fn` against the new device. The same buffer-init pattern is used in the sibling Qwen3 families (Qwen3VL text rotary, Qwen3_5 omni rotary, etc.).

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.inv_freq` reads uninitialized memory after `meta → to_empty(cuda)` materialization [1 pull requests]

transformers2026-05-11 23:01:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fixed

Fixed by PR: Fix M-RoPE inv_freq device and meta → to_empty re-init in Qwen3-VL family (https://github.com/huggingface/transformers/pull/45903)

Code Example

import sys

import torch
from transformers import AutoConfig
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextRotaryEmbedding,
)

MODEL = "Qwen/Qwen3.6-35B-A3B"
BIN = 6  # Arbitrary inv_freq index for spot-check
CORRUPTION = -1.468446e34  # Exact fp32 value ref-worker processes saw in inv_freq[*]

cfg = AutoConfig.from_pretrained(MODEL, trust_remote_code=False)
text_cfg = cfg.text_config if hasattr(cfg, "text_config") else cfg

# Canonical: what rope_init_fn would compute fresh on cuda.
fresh_inv_freq, _ = Qwen3_5MoeTextRotaryEmbedding.compute_default_rope_parameters(
    text_cfg, device=torch.device("cuda:0")
)
canonical = float(fresh_inv_freq[BIN].item())
print(f"=== canonical inv_freq[{BIN}] computed fresh on cuda:0: {canonical:.10e}")

# --- Variant A: construct directly on cuda (no meta), as a baseline ---
direct = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device=torch.device("cuda:0"))
print(
    f"--- variant A (direct cuda init): "
    f"inv_freq[{BIN}]={float(direct.inv_freq[BIN].item()):.10e}  "
    f"original_inv_freq[{BIN}]={float(direct.original_inv_freq[BIN].item()):.10e}"
)

# --- Variant B: construct on meta, then .to_empty(cuda) (the suspect path)
rotary = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
rotary.to_empty(device="cuda")
inv_v = float(rotary.inv_freq[BIN].item())
orig_v = float(rotary.original_inv_freq[BIN].item())
print(
    f"--- variant B (meta → to_empty(cuda)): "
    f"inv_freq[{BIN}]={inv_v:.10e}  original_inv_freq[{BIN}]={orig_v:.10e}"
)

# --- Variant C: pre-fill the small-bucket allocator pool with a specific fp32 pattern,
# free without empty_cache(), then meta → to_empty.
# Demonstrates that the materialized buffer's value is whatever bytes
# the caching allocator's free-list happens to return
for n_elems in (32, 64, 128, 256):
    for _ in range(8):
        j = torch.full((n_elems,), CORRUPTION, device="cuda:0", dtype=torch.float32)
        del j
    rotary_c = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
    rotary_c.to_empty(device="cuda")
    n_corrupt = sum(
        1 for v in rotary_c.inv_freq.tolist()
        if abs(v - CORRUPTION) < 1e30 and abs(v) > 1.0
    )
    print(
        f"   variant C  junk_size={n_elems:>4}  "
        f"inv_freq[{BIN}]={float(rotary_c.inv_freq[BIN].item()):.6e}  "
        f"n_inv_bins_holding_corruption={n_corrupt}/32"
    )

---

=== canonical inv_freq[6] computed fresh on cuda:0: 4.8696752638e-02
--- variant A (direct cuda init): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=4.8696752638e-02
--- variant B (meta → to_empty(cuda)): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=0.0000000000e+00
   variant C  junk_size=  32  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size=  64  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 128  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 256  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.6.2
Platform: Linux-6.8.0-1043-nvidia-x86_64-with-glibc2.35
Python version: 3.12.13
Huggingface_hub version: 1.13.0
Safetensors version: 0.7.0
Accelerate version: 1.13.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@ArthurZucker @Cyrilvallez @zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Qwen3_5MoeTextRotaryEmbedding.__init__ registers two persistent=False buffers, inv_freq and original_inv_freq, whose values come from rope_init_fn(self.config, device). When the rotary is constructed on the meta device (as SkyRL does here) and then materialized to CUDA via Module.to_empty(device="cuda"), neither buffer's storage is re-initialized: to_empty allocates uninitialized GPU memory, and nothing re-runs rope_init_fn against the new device. forward then reads self.inv_freq and produces cos/sin from whatever garbage is sitting in that GPU storage.

The first read of self.inv_freq therefore returns whatever bytes the caching allocator's free-list happens to hold for that allocation, not the canonical 1.0 / (base ** (arange(0, dim, 2) / dim)) value.

Through an extensive debug script I saw all 32 bins of inv_freq returning a consistent -1.468446e+34 across ref-worker processes, which then overflowed in inv_freq * position_ids at position_id >= 23173, producing NaN cos/sin and a policy_kl: nan downstream in the training loop.

import sys

import torch
from transformers import AutoConfig
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextRotaryEmbedding,
)

MODEL = "Qwen/Qwen3.6-35B-A3B"
BIN = 6  # Arbitrary inv_freq index for spot-check
CORRUPTION = -1.468446e34  # Exact fp32 value ref-worker processes saw in inv_freq[*]

cfg = AutoConfig.from_pretrained(MODEL, trust_remote_code=False)
text_cfg = cfg.text_config if hasattr(cfg, "text_config") else cfg

# Canonical: what rope_init_fn would compute fresh on cuda.
fresh_inv_freq, _ = Qwen3_5MoeTextRotaryEmbedding.compute_default_rope_parameters(
    text_cfg, device=torch.device("cuda:0")
)
canonical = float(fresh_inv_freq[BIN].item())
print(f"=== canonical inv_freq[{BIN}] computed fresh on cuda:0: {canonical:.10e}")

# --- Variant A: construct directly on cuda (no meta), as a baseline ---
direct = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device=torch.device("cuda:0"))
print(
    f"--- variant A (direct cuda init): "
    f"inv_freq[{BIN}]={float(direct.inv_freq[BIN].item()):.10e}  "
    f"original_inv_freq[{BIN}]={float(direct.original_inv_freq[BIN].item()):.10e}"
)

# --- Variant B: construct on meta, then .to_empty(cuda) (the suspect path)
rotary = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
rotary.to_empty(device="cuda")
inv_v = float(rotary.inv_freq[BIN].item())
orig_v = float(rotary.original_inv_freq[BIN].item())
print(
    f"--- variant B (meta → to_empty(cuda)): "
    f"inv_freq[{BIN}]={inv_v:.10e}  original_inv_freq[{BIN}]={orig_v:.10e}"
)

# --- Variant C: pre-fill the small-bucket allocator pool with a specific fp32 pattern,
# free without empty_cache(), then meta → to_empty.
# Demonstrates that the materialized buffer's value is whatever bytes
# the caching allocator's free-list happens to return
for n_elems in (32, 64, 128, 256):
    for _ in range(8):
        j = torch.full((n_elems,), CORRUPTION, device="cuda:0", dtype=torch.float32)
        del j
    rotary_c = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
    rotary_c.to_empty(device="cuda")
    n_corrupt = sum(
        1 for v in rotary_c.inv_freq.tolist()
        if abs(v - CORRUPTION) < 1e30 and abs(v) > 1.0
    )
    print(
        f"   variant C  junk_size={n_elems:>4}  "
        f"inv_freq[{BIN}]={float(rotary_c.inv_freq[BIN].item()):.6e}  "
        f"n_inv_bins_holding_corruption={n_corrupt}/32"
    )

Output:

=== canonical inv_freq[6] computed fresh on cuda:0: 4.8696752638e-02
--- variant A (direct cuda init): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=4.8696752638e-02
--- variant B (meta → to_empty(cuda)): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=0.0000000000e+00
   variant C  junk_size=  32  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size=  64  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 128  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 256  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32

So we see:

Variant A's direct-on-cuda init is correct on both buffers.
Variant B shows the bug clearly without any pre-filling: original_inv_freq[6] = 0.0, not the canonical 0.0487.
- The inv_freq[6] = 0.0487 reading is coincidental aliasing: the allocator's free-list happens to hold bytes from the canonical computation we did three lines earlier.
Variant C confirms the read-uninitialized-memory mechanism by pre-filling the allocator's small-bucket pool with a specific fp32 pattern and showing it survives into the materialized buffer at every bucket size tried.

Expected behavior

Qwen3_5MoeTextRotaryEmbedding(cfg, device="meta").to_empty(device="cuda") should produce a rotary embedding whose inv_freq and original_inv_freq match the canonical rope_init_fn(cfg, device=cuda) values. As-is, both buffers contain whatever the caching allocator's free-list holds at allocation time. This is the same buffer-init-versus-device-materialization gap that #45861 fixes for the wrong-device case; the present report covers the same gap when the buffer lands on the right device but with uninitialized storage.

Two natural fixes:

Qwen3_5MoeTextRotaryEmbedding.forward (no new contract): detect that self.inv_freq is on a different device than its config implies it should be, or has been materialized from meta, and re-run rope_init_fn(self.config, x.device) once on first forward. The companion original_inv_freq follows.
Qwen3_5MoeTextRotaryEmbedding._init_from_config (cleaner but more invasive): register a _register_load_state_dict_pre_hook or override _apply so that when the module is migrated off meta (via .to_empty or .to), the buffer values are recomputed via rope_init_fn against the new device.

The same buffer-init pattern is used in the sibling Qwen3 families (Qwen3VL text rotary, Qwen3_5 omni rotary, etc.).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Two natural fixes:

Qwen3_5MoeTextRotaryEmbedding.forward (no new contract): detect that self.inv_freq is on a different device than its config implies it should be, or has been materialized from meta, and re-run rope_init_fn(self.config, x.device) once on first forward. The companion original_inv_freq follows.
Qwen3_5MoeTextRotaryEmbedding._init_from_config (cleaner but more invasive): register a _register_load_state_dict_pre_hook or override _apply so that when the module is migrated off meta (via .to_empty or .to), the buffer values are recomputed via rope_init_fn against the new device.

The same buffer-init pattern is used in the sibling Qwen3 families (Qwen3VL text rotary, Qwen3_5 omni rotary, etc.).

#training loop #container setup #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.inv_freq` reads uninitialized memory after `meta → to_empty(cuda)` materialization [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.inv_freq` reads uninitialized memory after `meta → to_empty(cuda)` materialization [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING