transformers - ✅(Solved) Fix Gemma4 `use_bidirectional_attention="all"` still builds causal attention masks [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#46077Fetched 2026-05-20 03:39:20
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1mentioned ×1subscribed ×1

Fix Action

Fixed

PR fix notes

PR #46079: Fix Gemma4 use_bidirectional_attention="all" mask behavior

Description (problem / solution / changelog)

Fixes #46077

This makes Gemma4TextConfig(use_bidirectional_attention="all") set config.is_causal = False.

Gemma4 already marks attention modules as non-causal for "all", but mask creation goes through the shared masking helpers, which use config.is_causal to switch from causal masks to bidirectional masks. Without the config flag, "all" can still produce causal masks.

The change is intentionally small: one config line plus a regression test that checks Gemma4 eager attentions are actually unmasked for future tokens.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

AI assistance was used to investigate and draft the patch. I reviewed the changed lines and ran the checks below.

Before submitting

  • This PR fixes a typo or improves the docs.
  • I read the contributor guideline Pull Request section.
  • This was discussed/approved via GitHub issue: #ISSUE_NUMBER
  • Documentation changes are not needed; this fixes behavior to match existing bidirectional attention docs.
  • I wrote a regression test.

Duplicate-work check

I did not find a direct duplicate. Related but different work:

  • #45201 / #45202: Gemma4 FlashAttention global_head_dim=512 compatibility.
  • #45482: Gemma4 CPU offload/device mismatch issues.
  • #43705: general merged is_causal support for decoder-only models as encoders.

Tests

TRANSFORMERS_TEST_DEVICE=cpu PYTHONPATH=src uv run --no-sync python -m pytest tests/models/gemma4/test_modeling_gemma4.py::Gemma4TextModelTest::test_all_bidirectional_attention_uses_bidirectional_mask -q

Result: 1 passed, 26 warnings in 4.23s.

make style

Result: passed.

Changed files

  • src/transformers/models/gemma4/configuration_gemma4.py (modified, +1/-0)
  • tests/models/gemma4/test_modeling_gemma4.py (modified, +15/-0)

Code Example

from transformers import Gemma4TextConfig

config = Gemma4TextConfig(use_bidirectional_attention="all")
print(config.use_bidirectional_attention)
print(hasattr(config, "is_causal"), getattr(config, "is_causal", None))

---

all
False None

---

config = Gemma4TextConfig(use_bidirectional_attention="all")
assert config.is_causal is False
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.8.1

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Gemma4TextConfig(use_bidirectional_attention="all") makes each Gemma4TextAttention module non-causal, but it does not set config.is_causal = False. The mask helpers use config.is_causal to decide whether create_causal_mask/create_sliding_window_causal_mask should fall back to bidirectional masks, so the "all" setting can still build causal masks.

Minimal config check on current main:

from transformers import Gemma4TextConfig

config = Gemma4TextConfig(use_bidirectional_attention="all")
print(config.use_bidirectional_attention)
print(hasattr(config, "is_causal"), getattr(config, "is_causal", None))

Current output:

all
False None

Expected behavior

When Gemma4TextConfig(use_bidirectional_attention="all") is used, the config should set is_causal = False so the existing masking utilities create bidirectional full/sliding masks. In practice:

config = Gemma4TextConfig(use_bidirectional_attention="all")
assert config.is_causal is False

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When Gemma4TextConfig(use_bidirectional_attention="all") is used, the config should set is_causal = False so the existing masking utilities create bidirectional full/sliding masks. In practice:

config = Gemma4TextConfig(use_bidirectional_attention="all")
assert config.is_causal is False

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING