transformers - 💡(How to fix) Fix Gemma4 26B-A4B: cross-device errors with CPU offload (RoPE, inputs, layer_scalar, SDPA mask, mm_token_type_ids) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45482Fetched 2026-04-17 08:26:39
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Fix Action

Fix / Workaround

Workaround / patches

All 5 patches + a full training example (Gemma4 26B on RTX 4090, ~6.25s/step at 512 tokens):

https://github.com/sirfyyn/consumer-llm-patches

RAW_BUFFERClick to expand / collapse

Bug: Gemma4 cross-device tensor errors with accelerate CPU offload

Environment

  • transformers latest (Gemma4 support, modeling_gemma4.py)
  • Gemma4 26B-A4B-it (MoE, 4B active params)
  • accelerate device_map with CPU offload (layers overflow to RAM)
  • BnB INT8 + PEFT LoRA + Gradient Checkpointing
  • RTX 4090 (24GB) + 60GB CPU RAM

Problems

P4: RoPE embeddings on wrong device

apply_rotary_pos_emb computes cos/sin on CPU (from the offloaded embedding table) but q/k are already on CUDA. Fix: .to(q.device) on cos/sin before application.

P5: Input tensors not following device

In Gemma4TextModel.forward, position_ids and attention_mask sometimes stay on CPU when the layer has moved to CUDA mid-forward. Fix: explicit .to(hidden_states.device) at layer entry.

P7: layer_scalar cross-device

Gemma4DecoderLayer applies layer_scalar (a float scalar or 1-element tensor) to the output. When the layer is on CUDA but layer_scalar is on CPU, this raises a device mismatch. Fix: .to(hidden_states.device) before multiplication.

P6: SDPA attention_mask cross-device — integrations/sdpa_attention.py

scaled_dot_product_attention receives an attention_mask computed on CPU. SDPA requires mask and Q/K/V on the same device. Fix: .to(query_states.device) before the SDPA call.

P10: mm_token_type_ids required even for text-only training

Gemma4ForConditionalGeneration.forward unconditionally accesses mm_token_type_ids from the batch. For text-only fine-tuning (no images), this key is absent → KeyError or None-dereference.

Fix: guard with if mm_token_type_ids is not None: and skip the multimodal routing path when absent.

Workaround / patches

All 5 patches + a full training example (Gemma4 26B on RTX 4090, ~6.25s/step at 512 tokens):

https://github.com/sirfyyn/consumer-llm-patches

Patches are currently applied at runtime via apply_patches.py. Happy to contribute upstream PRs for any of these — particularly P10 (text-only training guard) and P6 (SDPA mask device) which seem most likely to affect others.

Benchmark context

With all patches applied, Gemma4 26B-A4B trains on a single RTX 4090 with BnB INT8 + LoRA + GC + CPU offload. Step time is nearly flat across seq lengths (64→512 tok = 1.06× difference), indicating CPU→GPU transfer dominates, not compute.

extent analysis

TL;DR

Apply patches to ensure tensors are on the correct device, addressing cross-device errors with accelerate CPU offload in Gemma4.

Guidance

  • Verify that all tensors are on the same device before performing operations, using .to(device) as needed.
  • Apply the provided patches (P4, P5, P6, P7, P10) to address specific cross-device issues in Gemma4.
  • Use the apply_patches.py script to apply patches at runtime, or consider contributing upstream PRs for a more permanent fix.
  • Test the patched model with the provided training example to ensure correct functionality.

Example

# Example of applying patch P4: RoPE embeddings on wrong device
cos = torch.cos(embedding_table).to(q.device)
sin = torch.sin(embedding_table).to(q.device)

Notes

The patches provided are specific to the Gemma4 model and may not be applicable to other models. Additionally, the patches are currently applied at runtime, and contributing upstream PRs may be necessary for a more permanent fix.

Recommendation

Apply workaround: Apply the provided patches to address cross-device errors, as they have been tested and verified to work with the Gemma4 model. This will ensure that tensors are on the correct device, resolving the cross-device errors with accelerate CPU offload.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Gemma4 26B-A4B: cross-device errors with CPU offload (RoPE, inputs, layer_scalar, SDPA mask, mm_token_type_ids) [1 participants]