transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.inv_freq` reads uninitialized memory after `meta → to_empty(cuda)` materialization [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

Code Example

import sys

import torch
from transformers import AutoConfig
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextRotaryEmbedding,
)

MODEL = "Qwen/Qwen3.6-35B-A3B"
BIN = 6  # Arbitrary inv_freq index for spot-check
CORRUPTION = -1.468446e34  # Exact fp32 value ref-worker processes saw in inv_freq[*]

cfg = AutoConfig.from_pretrained(MODEL, trust_remote_code=False)
text_cfg = cfg.text_config if hasattr(cfg, "text_config") else cfg

# Canonical: what rope_init_fn would compute fresh on cuda.
fresh_inv_freq, _ = Qwen3_5MoeTextRotaryEmbedding.compute_default_rope_parameters(
    text_cfg, device=torch.device("cuda:0")
)
canonical = float(fresh_inv_freq[BIN].item())
print(f"=== canonical inv_freq[{BIN}] computed fresh on cuda:0: {canonical:.10e}")

# --- Variant A: construct directly on cuda (no meta), as a baseline ---
direct = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device=torch.device("cuda:0"))
print(
    f"--- variant A (direct cuda init): "
    f"inv_freq[{BIN}]={float(direct.inv_freq[BIN].item()):.10e}  "
    f"original_inv_freq[{BIN}]={float(direct.original_inv_freq[BIN].item()):.10e}"
)

# --- Variant B: construct on meta, then .to_empty(cuda) (the suspect path)
rotary = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
rotary.to_empty(device="cuda")
inv_v = float(rotary.inv_freq[BIN].item())
orig_v = float(rotary.original_inv_freq[BIN].item())
print(
    f"--- variant B (meta → to_empty(cuda)): "
    f"inv_freq[{BIN}]={inv_v:.10e}  original_inv_freq[{BIN}]={orig_v:.10e}"
)

# --- Variant C: pre-fill the small-bucket allocator pool with a specific fp32 pattern,
# free without empty_cache(), then meta → to_empty.
# Demonstrates that the materialized buffer's value is whatever bytes
# the caching allocator's free-list happens to return
for n_elems in (32, 64, 128, 256):
    for _ in range(8):
        j = torch.full((n_elems,), CORRUPTION, device="cuda:0", dtype=torch.float32)
        del j
    rotary_c = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
    rotary_c.to_empty(device="cuda")
    n_corrupt = sum(
        1 for v in rotary_c.inv_freq.tolist()
        if abs(v - CORRUPTION) < 1e30 and abs(v) > 1.0
    )
    print(
        f"   variant C  junk_size={n_elems:>4}  "
        f"inv_freq[{BIN}]={float(rotary_c.inv_freq[BIN].item()):.6e}  "
        f"n_inv_bins_holding_corruption={n_corrupt}/32"
    )

---

=== canonical inv_freq[6] computed fresh on cuda:0: 4.8696752638e-02
--- variant A (direct cuda init): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=4.8696752638e-02
--- variant B (meta → to_empty(cuda)): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=0.0000000000e+00
   variant C  junk_size=  32  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size=  64  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 128  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 256  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.6.2
  • Platform: Linux-6.8.0-1043-nvidia-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • Huggingface_hub version: 1.13.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@ArthurZucker @Cyrilvallez @zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Qwen3_5MoeTextRotaryEmbedding.__init__ registers two persistent=False buffers, inv_freq and original_inv_freq, whose values come from rope_init_fn(self.config, device). When the rotary is constructed on the meta device (as SkyRL does here) and then materialized to CUDA via Module.to_empty(device="cuda"), neither buffer's storage is re-initialized: to_empty allocates uninitialized GPU memory, and nothing re-runs rope_init_fn against the new device. forward then reads self.inv_freq and produces cos/sin from whatever garbage is sitting in that GPU storage.

The first read of self.inv_freq therefore returns whatever bytes the caching allocator's free-list happens to hold for that allocation, not the canonical 1.0 / (base ** (arange(0, dim, 2) / dim)) value.

Through an extensive debug script I saw all 32 bins of inv_freq returning a consistent -1.468446e+34 across ref-worker processes, which then overflowed in inv_freq * position_ids at position_id >= 23173, producing NaN cos/sin and a policy_kl: nan downstream in the training loop.

import sys

import torch
from transformers import AutoConfig
from transformers.models.qwen3_5_moe.modeling_qwen3_5_moe import (
    Qwen3_5MoeTextRotaryEmbedding,
)

MODEL = "Qwen/Qwen3.6-35B-A3B"
BIN = 6  # Arbitrary inv_freq index for spot-check
CORRUPTION = -1.468446e34  # Exact fp32 value ref-worker processes saw in inv_freq[*]

cfg = AutoConfig.from_pretrained(MODEL, trust_remote_code=False)
text_cfg = cfg.text_config if hasattr(cfg, "text_config") else cfg

# Canonical: what rope_init_fn would compute fresh on cuda.
fresh_inv_freq, _ = Qwen3_5MoeTextRotaryEmbedding.compute_default_rope_parameters(
    text_cfg, device=torch.device("cuda:0")
)
canonical = float(fresh_inv_freq[BIN].item())
print(f"=== canonical inv_freq[{BIN}] computed fresh on cuda:0: {canonical:.10e}")

# --- Variant A: construct directly on cuda (no meta), as a baseline ---
direct = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device=torch.device("cuda:0"))
print(
    f"--- variant A (direct cuda init): "
    f"inv_freq[{BIN}]={float(direct.inv_freq[BIN].item()):.10e}  "
    f"original_inv_freq[{BIN}]={float(direct.original_inv_freq[BIN].item()):.10e}"
)

# --- Variant B: construct on meta, then .to_empty(cuda) (the suspect path)
rotary = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
rotary.to_empty(device="cuda")
inv_v = float(rotary.inv_freq[BIN].item())
orig_v = float(rotary.original_inv_freq[BIN].item())
print(
    f"--- variant B (meta → to_empty(cuda)): "
    f"inv_freq[{BIN}]={inv_v:.10e}  original_inv_freq[{BIN}]={orig_v:.10e}"
)

# --- Variant C: pre-fill the small-bucket allocator pool with a specific fp32 pattern,
# free without empty_cache(), then meta → to_empty.
# Demonstrates that the materialized buffer's value is whatever bytes
# the caching allocator's free-list happens to return
for n_elems in (32, 64, 128, 256):
    for _ in range(8):
        j = torch.full((n_elems,), CORRUPTION, device="cuda:0", dtype=torch.float32)
        del j
    rotary_c = Qwen3_5MoeTextRotaryEmbedding(text_cfg, device="meta")
    rotary_c.to_empty(device="cuda")
    n_corrupt = sum(
        1 for v in rotary_c.inv_freq.tolist()
        if abs(v - CORRUPTION) < 1e30 and abs(v) > 1.0
    )
    print(
        f"   variant C  junk_size={n_elems:>4}  "
        f"inv_freq[{BIN}]={float(rotary_c.inv_freq[BIN].item()):.6e}  "
        f"n_inv_bins_holding_corruption={n_corrupt}/32"
    )

Output:

=== canonical inv_freq[6] computed fresh on cuda:0: 4.8696752638e-02
--- variant A (direct cuda init): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=4.8696752638e-02
--- variant B (meta → to_empty(cuda)): inv_freq[6]=4.8696752638e-02  original_inv_freq[6]=0.0000000000e+00
   variant C  junk_size=  32  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size=  64  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 128  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32
   variant C  junk_size= 256  inv_freq[6]=-1.468446e+34  n_inv_bins_holding_corruption=32/32

So we see:

  • Variant A's direct-on-cuda init is correct on both buffers.
  • Variant B shows the bug clearly without any pre-filling: original_inv_freq[6] = 0.0, not the canonical 0.0487.
    • The inv_freq[6] = 0.0487 reading is coincidental aliasing: the allocator's free-list happens to hold bytes from the canonical computation we did three lines earlier.
  • Variant C confirms the read-uninitialized-memory mechanism by pre-filling the allocator's small-bucket pool with a specific fp32 pattern and showing it survives into the materialized buffer at every bucket size tried.

Expected behavior

Qwen3_5MoeTextRotaryEmbedding(cfg, device="meta").to_empty(device="cuda") should produce a rotary embedding whose inv_freq and original_inv_freq match the canonical rope_init_fn(cfg, device=cuda) values. As-is, both buffers contain whatever the caching allocator's free-list holds at allocation time. This is the same buffer-init-versus-device-materialization gap that #45861 fixes for the wrong-device case; the present report covers the same gap when the buffer lands on the right device but with uninitialized storage.

Two natural fixes:

  1. Qwen3_5MoeTextRotaryEmbedding.forward (no new contract): detect that self.inv_freq is on a different device than its config implies it should be, or has been materialized from meta, and re-run rope_init_fn(self.config, x.device) once on first forward. The companion original_inv_freq follows.
  2. Qwen3_5MoeTextRotaryEmbedding._init_from_config (cleaner but more invasive): register a _register_load_state_dict_pre_hook or override _apply so that when the module is migrated off meta (via .to_empty or .to), the buffer values are recomputed via rope_init_fn against the new device.

The same buffer-init pattern is used in the sibling Qwen3 families (Qwen3VL text rotary, Qwen3_5 omni rotary, etc.).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Qwen3_5MoeTextRotaryEmbedding(cfg, device="meta").to_empty(device="cuda") should produce a rotary embedding whose inv_freq and original_inv_freq match the canonical rope_init_fn(cfg, device=cuda) values. As-is, both buffers contain whatever the caching allocator's free-list holds at allocation time. This is the same buffer-init-versus-device-materialization gap that #45861 fixes for the wrong-device case; the present report covers the same gap when the buffer lands on the right device but with uninitialized storage.

Two natural fixes:

  1. Qwen3_5MoeTextRotaryEmbedding.forward (no new contract): detect that self.inv_freq is on a different device than its config implies it should be, or has been materialized from meta, and re-run rope_init_fn(self.config, x.device) once on first forward. The companion original_inv_freq follows.
  2. Qwen3_5MoeTextRotaryEmbedding._init_from_config (cleaner but more invasive): register a _register_load_state_dict_pre_hook or override _apply so that when the module is migrated off meta (via .to_empty or .to), the buffer values are recomputed via rope_init_fn against the new device.

The same buffer-init pattern is used in the sibling Qwen3 families (Qwen3VL text rotary, Qwen3_5 omni rotary, etc.).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix `Qwen3_5MoeTextRotaryEmbedding.inv_freq` reads uninitialized memory after `meta → to_empty(cuda)` materialization [1 pull requests]