transformers - 💡(How to fix) Fix Feature request: Flash Attention 2 support for T5Gemma 2 [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45522Fetched 2026-04-20 11:58:45
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
commented ×1cross-referenced ×1subscribed ×1

Error Message

ValueError: T5Gemma2ForConditionalGeneration does not support Flash Attention 2 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/google/t5gemma-2-4b-4b/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Fix Action

Fix / Workaround

Currently, loading the model with FA2 fails at dispatch time:

Raised from transformers.modeling_utils._flash_attn_can_dispatch (invoked via _check_and_adjust_attn_implementation). Filing here per the error message's instruction.

Code Example

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/t5gemma-2-4b-4b", attn_implementation="flash_attention_2"
)

---

ValueError: T5Gemma2ForConditionalGeneration does not support Flash Attention 2 yet.
Please request to add support where the model is hosted, on its model hub page:
https://huggingface.co/google/t5gemma-2-4b-4b/discussions/new
or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new
RAW_BUFFERClick to expand / collapse

Feature request

Add Flash Attention 2 support for T5Gemma2ForConditionalGeneration (and companion variants: encoder, decoder, etc., wherever attn_implementation="flash_attention_2" currently raises).

Currently, loading the model with FA2 fails at dispatch time:

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/t5gemma-2-4b-4b", attn_implementation="flash_attention_2"
)
ValueError: T5Gemma2ForConditionalGeneration does not support Flash Attention 2 yet.
Please request to add support where the model is hosted, on its model hub page:
https://huggingface.co/google/t5gemma-2-4b-4b/discussions/new
or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Raised from transformers.modeling_utils._flash_attn_can_dispatch (invoked via _check_and_adjust_attn_implementation). Filing here per the error message's instruction.

Motivation

T5Gemma 2 is advertised as a 128K-context encoder-decoder model; the technical report and model card claim 128K context support and publish Ruler-128K / MRCR-128K benchmark results. At those sequence lengths, FA2 is effectively table stakes for practical inference — eager attention is O(seq²) memory and becomes infeasible past a few K on modern GPUs.

Related: we filed #45521 for a separate bug in the eager attention path (fails above ~4K tokens at batch=1). Even when that's fixed, eager/sdpa-only long-context inference will be memory-bound well before 128K. FA2 would unblock the advertised context window in practice.

Why this is non-trivial (and what might already be reusable)

T5Gemma 2 uses merged self+cross attention in the decoder (§2 of the paper, Section on "T5Gemma 2 architecture"): decoder self-attention and cross-attention to the encoder output are fused into a single joint attention op per layer. This is the novel architectural contribution and the main blocker vs. other Gemma-family models that already have FA2.

Pieces that should be reusable from related model integrations:

  • Interleaved local/global + sliding window (5:1 ratio, sliding_window: 1024, _sliding_window_pattern: 6 in config) — Gemma 3 already supports this with FA2; the T5Gemma 2 decoder inherits the same patterns
  • RoPE with split base frequencies (local=10k, global=1M) — also standard Gemma 3
  • QK-norm + GQA — standard
  • Encoder side (bidirectional) — straightforward FA2 varlen usage

The merged self+cross path is what needs new integration work — likely segment-ids or varlen concatenation of (past_self_KV ∥ encoder_KV) with a two-region mask (causal+SWA on the self part, full on the cross part) piped into FA2.

Your contribution

Happy to test against a PR branch end-to-end on real long-context data (TReB English split has samples up to 28K tokens; we have an existing harness that covers 2.5K / 3.5K / 5K / 6.5K / 7.5K / 10K / 15K / 20K / 25K token lengths with known-passing expected outputs from sdpa/eager below the #45521 threshold). Can provide throughput + memory numbers on H100 NVL before/after.

Not volunteering to author the integration myself — don't have deep familiarity with the FA2 varlen / segment-id APIs and the merged-attention masking logic needs a reviewer who knows T5Gemma 2's design intent.

Related

  • huggingface/transformers#45521 — the eager/sdpa 4K crash (separate issue)
  • huggingface/transformers PR #41834 — original T5Gemma 2 integration
  • Gemma 3's FA2 implementation (pattern reference)

Thanks!

extent analysis

TL;DR

The most likely fix is to add Flash Attention 2 support for T5Gemma2ForConditionalGeneration by integrating the novel merged self+cross attention architecture in the decoder with FA2.

Guidance

  • The error message indicates that T5Gemma2ForConditionalGeneration does not currently support Flash Attention 2, so the first step is to add this support.
  • The merged self+cross attention path in the decoder needs new integration work, likely involving segment-ids or varlen concatenation of (past_self_KV ∥ encoder_KV) with a two-region mask.
  • Reusable pieces from related model integrations include interleaved local/global + sliding window, RoPE with split base frequencies, QK-norm + GQA, and encoder side (bidirectional) FA2 usage.
  • Testing against a PR branch end-to-end on real long-context data will be necessary to verify the fix.

Example

No code example is provided as the issue requires a deeper understanding of the FA2 varlen / segment-id APIs and the merged-attention masking logic.

Notes

The integration of Flash Attention 2 with the merged self+cross attention architecture in the decoder is non-trivial and requires careful consideration of the masking logic and segment-ids.

Recommendation

Apply a workaround by adding Flash Attention 2 support for T5Gemma2ForConditionalGeneration as described above, as this will unblock the advertised context window in practice and enable efficient inference for long-context sequences.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING