transformers - 💡(How to fix) Fix Feature request: Flash Attention 2 support for T5Gemma 2 [1 comments, 1 participants]

Error Message

ValueError: T5Gemma2ForConditionalGeneration does not support Flash Attention 2 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/google/t5gemma-2-4b-4b/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Code Example

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/t5gemma-2-4b-4b", attn_implementation="flash_attention_2"
)

---

ValueError: T5Gemma2ForConditionalGeneration does not support Flash Attention 2 yet.
Please request to add support where the model is hosted, on its model hub page:
https://huggingface.co/google/t5gemma-2-4b-4b/discussions/new
or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Feature request

Add Flash Attention 2 support for T5Gemma2ForConditionalGeneration (and companion variants: encoder, decoder, etc., wherever attn_implementation="flash_attention_2" currently raises).

Currently, loading the model with FA2 fails at dispatch time:

from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/t5gemma-2-4b-4b", attn_implementation="flash_attention_2"
)

ValueError: T5Gemma2ForConditionalGeneration does not support Flash Attention 2 yet.
Please request to add support where the model is hosted, on its model hub page:
https://huggingface.co/google/t5gemma-2-4b-4b/discussions/new
or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

Raised from transformers.modeling_utils._flash_attn_can_dispatch (invoked via _check_and_adjust_attn_implementation). Filing here per the error message's instruction.

Motivation

T5Gemma 2 is advertised as a 128K-context encoder-decoder model; the technical report and model card claim 128K context support and publish Ruler-128K / MRCR-128K benchmark results. At those sequence lengths, FA2 is effectively table stakes for practical inference — eager attention is O(seq²) memory and becomes infeasible past a few K on modern GPUs.

Related: we filed #45521 for a separate bug in the eager attention path (fails above ~4K tokens at batch=1). Even when that's fixed, eager/sdpa-only long-context inference will be memory-bound well before 128K. FA2 would unblock the advertised context window in practice.

Why this is non-trivial (and what might already be reusable)

T5Gemma 2 uses merged self+cross attention in the decoder (§2 of the paper, Section on "T5Gemma 2 architecture"): decoder self-attention and cross-attention to the encoder output are fused into a single joint attention op per layer. This is the novel architectural contribution and the main blocker vs. other Gemma-family models that already have FA2.

Pieces that should be reusable from related model integrations:

Interleaved local/global + sliding window (5:1 ratio, sliding_window: 1024, _sliding_window_pattern: 6 in config) — Gemma 3 already supports this with FA2; the T5Gemma 2 decoder inherits the same patterns
RoPE with split base frequencies (local=10k, global=1M) — also standard Gemma 3
QK-norm + GQA — standard
Encoder side (bidirectional) — straightforward FA2 varlen usage

The merged self+cross path is what needs new integration work — likely segment-ids or varlen concatenation of (past_self_KV ∥ encoder_KV) with a two-region mask (causal+SWA on the self part, full on the cross part) piped into FA2.

Your contribution

Happy to test against a PR branch end-to-end on real long-context data (TReB English split has samples up to 28K tokens; we have an existing harness that covers 2.5K / 3.5K / 5K / 6.5K / 7.5K / 10K / 15K / 20K / 25K token lengths with known-passing expected outputs from sdpa/eager below the #45521 threshold). Can provide throughput + memory numbers on H100 NVL before/after.

Not volunteering to author the integration myself — don't have deep familiarity with the FA2 varlen / segment-id APIs and the merged-attention masking logic needs a reviewer who knows T5Gemma 2's design intent.

huggingface/transformers#45521 — the eager/sdpa 4K crash (separate issue)
huggingface/transformers PR #41834 — original T5Gemma 2 integration
Gemma 3's FA2 implementation (pattern reference)

Thanks!

extent analysis

TL;DR

The most likely fix is to add Flash Attention 2 support for T5Gemma2ForConditionalGeneration by integrating the novel merged self+cross attention architecture in the decoder with FA2.

Guidance

The error message indicates that T5Gemma2ForConditionalGeneration does not currently support Flash Attention 2, so the first step is to add this support.
The merged self+cross attention path in the decoder needs new integration work, likely involving segment-ids or varlen concatenation of (past_self_KV ∥ encoder_KV) with a two-region mask.
Reusable pieces from related model integrations include interleaved local/global + sliding window, RoPE with split base frequencies, QK-norm + GQA, and encoder side (bidirectional) FA2 usage.
Testing against a PR branch end-to-end on real long-context data will be necessary to verify the fix.

Example

No code example is provided as the issue requires a deeper understanding of the FA2 varlen / segment-id APIs and the merged-attention masking logic.

Notes

The integration of Flash Attention 2 with the merged self+cross attention architecture in the decoder is non-trivial and requires careful consideration of the masking logic and segment-ids.

Recommendation

Apply a workaround by adding Flash Attention 2 support for T5Gemma2ForConditionalGeneration as described above, as this will unblock the advertised context window in practice and enable efficient inference for long-context sequences.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Feature request: Flash Attention 2 support for T5Gemma 2 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Feature request

Motivation

Why this is non-trivial (and what might already be reusable)

Your contribution

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Feature request: Flash Attention 2 support for T5Gemma 2 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Feature request

Motivation

Why this is non-trivial (and what might already be reusable)

Your contribution

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING