transformers - ✅(Solved) Fix `integrations/flash_attention.py` crashes with `AttributeError` on `s_aux=None` for sink-less models [4 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45588Fetched 2026-04-23 07:22:56
View on GitHub
Comments
1
Participants
2
Timeline
12
Reactions
0
Timeline (top)
cross-referenced ×4mentioned ×3subscribed ×3commented ×1

Error Message

import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-4-E4B-it" # Any sink-less model tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, dtype=torch.bfloat16, attn_implementation="flash_attention_2", ).cuda()

inputs = tok("Hello", return_tensors="pt").to("cuda") model(**inputs) # AttributeError: 'NoneType' object has no attribute 'to'

Fix Action

Fixed

PR fix notes

PR #45589: Fix AttributeError on s_aux=None in flash_attention_forward

Description (problem / solution / changelog)

Fixes https://github.com/huggingface/transformers/issues/45588

@ArthurZucker @yonigozlan @molbap

Changed files

  • src/transformers/integrations/flash_attention.py (modified, +5/-1)

PR #45590: fix #45588: guard s_aux against None in flash_attention_forward

Description (problem / solution / changelog)

Fix for #45588

Bug

flash_attention_forward unconditionally calls s_aux.to(query.dtype), but s_aux defaults to None and sink-less models (e.g. Gemma 4) never pass s_aux, causing:

AttributeError: 'NoneType' object has no attribute 'to'

Fix

Add a guard to only convert s_aux when it is not None:

# Before
s_aux=s_aux.to(query.dtype)

# After
s_aux=s_aux.to(query.dtype) if s_aux is not None else None

This pattern was already used in flash_paged.py (see PR #40434).

Testing

  • Python syntax check passed
  • Code follows existing pattern in codebase

Notes

  • Huggingface transformers v5.6.0
  • Affects sink-less models using flash_attention_2

Automated high-quality fix

Changed files

  • src/transformers/integrations/flash_attention.py (modified, +1/-1)

PR #2813: limit transformer version, until they fixed issues/45588

Description (problem / solution / changelog)

huggingface/transformers/issues/45588

Changed files

  • requirements.txt (modified, +1/-1)

PR #123: limit transformer version, until they fixed issues/45588

Description (problem / solution / changelog)

huggingface/transformers/issues/45588

Changed files

  • pyproject.toml (modified, +1/-1)

Code Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-4-E4B-it"  # Any sink-less model
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda()

inputs = tok("Hello", return_tensors="pt").to("cuda")
model(**inputs)  # AttributeError: 'NoneType' object has no attribute 'to'

---

File ".../transformers/integrations/flash_attention.py", line 84, in flash_attention_forward
    s_aux=s_aux.to(query.dtype),  # FA only accepts half precision
          ^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'to'
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.6.0
  • Platform: Linux-6.8.0-1043-nvidia-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • Huggingface_hub version: 1.11.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu129 (CUDA)
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@ArthurZucker @yonigozlan @molbap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-4-E4B-it"  # Any sink-less model
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda()

inputs = tok("Hello", return_tensors="pt").to("cuda")
model(**inputs)  # AttributeError: 'NoneType' object has no attribute 'to'

Expected behavior

flash_attention_forward unconditionally calls s_aux.to(query.dtype), even though s_aux: torch.Tensor | None = None is optional and defaults to None. Models that do not have attention sinks (e.g. Gemma 4) never pass s_aux= from their attention forward, so the keyword argument stays None and training/inference crashes.

Offending line: https://github.com/huggingface/transformers/blob/v5.6.0/src/transformers/integrations/flash_attention.py#L84

File ".../transformers/integrations/flash_attention.py", line 84, in flash_attention_forward
    s_aux=s_aux.to(query.dtype),  # FA only accepts half precision
          ^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'to'

This is the same bug that https://github.com/huggingface/transformers/pull/40434 fixed for flash_paged.py by adding a guard so s_aux is only forwarded when set.

extent analysis

TL;DR

The most likely fix is to add a guard to check if s_aux is not None before calling to on it in the flash_attention_forward function.

Guidance

  • The error occurs because s_aux is None and the code tries to call to on it, which is not allowed.
  • To fix this, a conditional check should be added to ensure s_aux is not None before attempting to call to on it.
  • The fix should be applied to the flash_attention_forward function in the flash_attention.py file.
  • A similar fix was already applied to flash_paged.py in pull request #40434, which can be used as a reference.

Example

if s_aux is not None:
    s_aux = s_aux.to(query.dtype)

Notes

  • This fix assumes that s_aux can be safely ignored when it is None, which is the case for models without attention sinks.
  • The fix should be applied to the transformers library, specifically to the flash_attention.py file.

Recommendation

  • Apply workaround: add a conditional check to ensure s_aux is not None before calling to on it, as shown in the example code. This is because the issue is specific to the flash_attention_forward function and can be fixed with a simple guard clause.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

flash_attention_forward unconditionally calls s_aux.to(query.dtype), even though s_aux: torch.Tensor | None = None is optional and defaults to None. Models that do not have attention sinks (e.g. Gemma 4) never pass s_aux= from their attention forward, so the keyword argument stays None and training/inference crashes.

Offending line: https://github.com/huggingface/transformers/blob/v5.6.0/src/transformers/integrations/flash_attention.py#L84

File ".../transformers/integrations/flash_attention.py", line 84, in flash_attention_forward
    s_aux=s_aux.to(query.dtype),  # FA only accepts half precision
          ^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'to'

This is the same bug that https://github.com/huggingface/transformers/pull/40434 fixed for flash_paged.py by adding a guard so s_aux is only forwarded when set.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING