transformers - ✅(Solved) Fix Diverging attention kernels due to `allow_is_bidirectional_skip` branching on torch.compile [1 pull requests, 9 comments, 6 participants]

transformers2026-02-20 21:01:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44188•Fetched 2026-04-08 00:29:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×9subscribed ×8mentioned ×6cross-referenced ×4

Fix Action

Fixed

Fixed by PR: Fix: bidirectional mask skip when attention dropout is active (#44188) (https://github.com/huggingface/transformers/pull/44202)

PR fix notes

PR #44202: Fix: bidirectional mask skip when attention dropout is active (#44188)

Repository: huggingface/transformers
Author: GS-GOAT
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44202

Description (problem / solution / changelog)

What does this PR do?

When torch.compile is used, _ignore_bidirectional_mask_sdpa behaves differently than in eager mode due to the is_tracing() check. In eager mode (no tracing), the bidirectional mask is skipped (returns None), allowing SDPA to dispatch to flash attention. Under torch.compile (is_tracing() returns True), the mask is materialized, causing SDPA to use the memory-efficient backend instead.

When attention dropout is active (like, BERT's default attention_probs_dropout_prob=0.1), these two backends handle dropout RNG differently, leading to large numerical divergences.

Fix: adding a dropout condition to _ignore_bidirectional_mask_sdpa - when dropout > 0, mask creation is never skipped, ensuring the same SDPA backend is used in both eager and compiled modes. The dropout value is extracted from the model-config (supporting both attention_probs_dropout_prob and attention_dropout attribute names).

Fixes #44188

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. - https://github.com/huggingface/transformers/issues/44188
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@vasqu @ArthurZucker @CyrilVallez

Changed files

src/transformers/masking_utils.py (modified, +26/-1)
tests/utils/test_masking_utils.py (modified, +37/-0)

RAW_BUFFERClick to expand / collapse

System Info

Hi, while we were updating the PyTorch transformers pin to v5.2.0, our regression tests caught a numerics issue between eager and compiled, the difference is very substantial (3.3 vs the typical e-4 accepted difference). Digging into it: https://github.com/pytorch/pytorch/pull/175274#issuecomment-3930952666, we found the cause to be in these lines (added in https://github.com/huggingface/transformers/pull/41265):

https://github.com/huggingface/transformers/blob/147b7aa040812b079f467e777a2d2e1284167de0/src/transformers/masking_utils.py#L490-L491

We set allow_is_bidirectional_skip=True in a few places: https://github.com/huggingface/transformers/blob/147b7aa040812b079f467e777a2d2e1284167de0/src/transformers/masking_utils.py#L996-L997

And in _ignore_bidirectional_mask_sdpa, we branch logic on whether we compile or not: https://github.com/huggingface/transformers/blob/147b7aa040812b079f467e777a2d2e1284167de0/src/transformers/masking_utils.py#L324-L332

This issue was found on BERT but it seems like it would affect other models too.

We've also verified that removing the branching fixes the numerical difference. I'm creating this issue to ask about the best way forward here. From the PR that added it, it looks like this was necessary specifically for executorch, but the algorithm difference is also affected all other APIs that fall under is_tracing . Can we restrict the check?

Who can help?

@vasqu @ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I believe the description is enough, but I can provide a simpler repro on request

Expected behavior

transformers users probably shouldn't run into large numeric differences when compiling, at least not by default

extent analysis

Fix Plan

Restrict the `allow_is_bidirectional_skip` check

The issue is caused by the allow_is_bidirectional_skip=True check in masking_utils.py. To fix this, we can restrict the check to only apply when is_tracing is True.

Step-by-Step Solution

Update masking_utils.py:

if is_tracing and allow_is_bidirectional_skip:
    # original code here

Remove the branching logic:

# Remove the following code block
if not is_tracing:
    # original code here

Update the allow_is_bidirectional_skip check:

if is_tracing:
    allow_is_bidirectional_skip = True
else:
    allow_is_bidirectional_skip = False

Test the fix: Run the regression tests to verify that the fix works as expected.

Verification

Run the regression tests: Make sure the tests pass without any numerical differences.
Verify the fix: Check that the fix does not introduce any regressions in other parts of the codebase.

Extra Tips

Make sure to update the documentation to reflect the change in behavior.
Consider adding a test case to cover the restricted check.
If you're using a CI/CD pipeline, update the pipeline to run the regression tests after the fix is applied.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

transformers users probably shouldn't run into large numeric differences when compiling, at least not by default

#api #ssr #installation #tensor shape #autograd error #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Diverging attention kernels due to `allow_is_bidirectional_skip` branching on torch.compile [1 pull requests, 9 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44202: Fix: bidirectional mask skip when attention dropout is active (#44188)

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Who can review?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Restrict the `allow_is_bidirectional_skip` check

Step-by-Step Solution

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Diverging attention kernels due to `allow_is_bidirectional_skip` branching on torch.compile [1 pull requests, 9 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44202: Fix: bidirectional mask skip when attention dropout is active (#44188)

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Who can review?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Restrict the allow_is_bidirectional_skip check

Step-by-Step Solution

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Restrict the `allow_is_bidirectional_skip` check