transformers - 💡(How to fix) Fix Gemma4 26B-A4B: cross-device errors with CPU offload (RoPE, inputs, layer_scalar, SDPA mask, mm_token_type_ids) [1 participants]

transformers2026-04-16 15:57:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45482•Fetched 2026-04-17 08:26:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sirfyyn

Participants

sirfyyn

Fix Action

Fix / Workaround

Workaround / patches

All 5 patches + a full training example (Gemma4 26B on RTX 4090, ~6.25s/step at 512 tokens):

https://github.com/sirfyyn/consumer-llm-patches

RAW_BUFFERClick to expand / collapse

Bug: Gemma4 cross-device tensor errors with accelerate CPU offload

Environment

transformers latest (Gemma4 support, modeling_gemma4.py)
Gemma4 26B-A4B-it (MoE, 4B active params)
accelerate device_map with CPU offload (layers overflow to RAM)
BnB INT8 + PEFT LoRA + Gradient Checkpointing
RTX 4090 (24GB) + 60GB CPU RAM

Problems

P4: RoPE embeddings on wrong device

apply_rotary_pos_emb computes cos/sin on CPU (from the offloaded embedding table) but q/k are already on CUDA. Fix: .to(q.device) on cos/sin before application.

P5: Input tensors not following device

In Gemma4TextModel.forward, position_ids and attention_mask sometimes stay on CPU when the layer has moved to CUDA mid-forward. Fix: explicit .to(hidden_states.device) at layer entry.

P7: `layer_scalar` cross-device

Gemma4DecoderLayer applies layer_scalar (a float scalar or 1-element tensor) to the output. When the layer is on CUDA but layer_scalar is on CPU, this raises a device mismatch. Fix: .to(hidden_states.device) before multiplication.

P6: SDPA `attention_mask` cross-device — `integrations/sdpa_attention.py`

scaled_dot_product_attention receives an attention_mask computed on CPU. SDPA requires mask and Q/K/V on the same device. Fix: .to(query_states.device) before the SDPA call.

P10: `mm_token_type_ids` required even for text-only training

Gemma4ForConditionalGeneration.forward unconditionally accesses mm_token_type_ids from the batch. For text-only fine-tuning (no images), this key is absent → KeyError or None-dereference.

Fix: guard with if mm_token_type_ids is not None: and skip the multimodal routing path when absent.

Workaround / patches

All 5 patches + a full training example (Gemma4 26B on RTX 4090, ~6.25s/step at 512 tokens):

https://github.com/sirfyyn/consumer-llm-patches

Patches are currently applied at runtime via apply_patches.py. Happy to contribute upstream PRs for any of these — particularly P10 (text-only training guard) and P6 (SDPA mask device) which seem most likely to affect others.

Benchmark context

With all patches applied, Gemma4 26B-A4B trains on a single RTX 4090 with BnB INT8 + LoRA + GC + CPU offload. Step time is nearly flat across seq lengths (64→512 tok = 1.06× difference), indicating CPU→GPU transfer dominates, not compute.

extent analysis

TL;DR

Apply patches to ensure tensors are on the correct device, addressing cross-device errors with accelerate CPU offload in Gemma4.

Guidance

Verify that all tensors are on the same device before performing operations, using .to(device) as needed.
Apply the provided patches (P4, P5, P6, P7, P10) to address specific cross-device issues in Gemma4.
Use the apply_patches.py script to apply patches at runtime, or consider contributing upstream PRs for a more permanent fix.
Test the patched model with the provided training example to ensure correct functionality.

Example

# Example of applying patch P4: RoPE embeddings on wrong device
cos = torch.cos(embedding_table).to(q.device)
sin = torch.sin(embedding_table).to(q.device)

Notes

The patches provided are specific to the Gemma4 model and may not be applicable to other models. Additionally, the patches are currently applied at runtime, and contributing upstream PRs may be necessary for a more permanent fix.

Recommendation

Apply workaround: Apply the provided patches to address cross-device errors, as they have been tested and verified to work with the Gemma4 model. This will ensure that tensors are on the correct device, resolving the cross-device errors with accelerate CPU offload.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#installation #tensor shape #autograd error #model save/load #optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Gemma4 26B-A4B: cross-device errors with CPU offload (RoPE, inputs, layer_scalar, SDPA mask, mm_token_type_ids) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Workaround / patches

Bug: Gemma4 cross-device tensor errors with accelerate CPU offload

Environment

Problems

P4: RoPE embeddings on wrong device

P5: Input tensors not following device

P7: `layer_scalar` cross-device

P6: SDPA `attention_mask` cross-device — `integrations/sdpa_attention.py`

P10: `mm_token_type_ids` required even for text-only training

Workaround / patches

Benchmark context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Gemma4 26B-A4B: cross-device errors with CPU offload (RoPE, inputs, layer_scalar, SDPA mask, mm_token_type_ids) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Workaround / patches

Bug: Gemma4 cross-device tensor errors with accelerate CPU offload

Environment

Problems

P4: RoPE embeddings on wrong device

P5: Input tensors not following device

P7: layer_scalar cross-device

P6: SDPA attention_mask cross-device — integrations/sdpa_attention.py

P10: mm_token_type_ids required even for text-only training

Workaround / patches

Benchmark context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

P7: `layer_scalar` cross-device

P6: SDPA `attention_mask` cross-device — `integrations/sdpa_attention.py`

P10: `mm_token_type_ids` required even for text-only training