transformers - ✅(Solved) Fix Diverging attention kernels due to `allow_is_bidirectional_skip` branching on torch.compile [1 pull requests, 9 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44188Fetched 2026-04-08 00:29:56
View on GitHub
Comments
9
Participants
6
Timeline
30
Reactions
0
Author
Timeline (top)
commented ×9subscribed ×8mentioned ×6cross-referenced ×4

Fix Action

Fixed

PR fix notes

PR #44202: Fix: bidirectional mask skip when attention dropout is active (#44188)

Description (problem / solution / changelog)

What does this PR do?

When torch.compile is used, _ignore_bidirectional_mask_sdpa behaves differently than in eager mode due to the is_tracing() check. In eager mode (no tracing), the bidirectional mask is skipped (returns None), allowing SDPA to dispatch to flash attention. Under torch.compile (is_tracing() returns True), the mask is materialized, causing SDPA to use the memory-efficient backend instead.

When attention dropout is active (like, BERT's default attention_probs_dropout_prob=0.1), these two backends handle dropout RNG differently, leading to large numerical divergences.

Fix: adding a dropout condition to _ignore_bidirectional_mask_sdpa - when dropout > 0, mask creation is never skipped, ensuring the same SDPA backend is used in both eager and compiled modes. The dropout value is extracted from the model-config (supporting both attention_probs_dropout_prob and attention_dropout attribute names).

Fixes #44188

Before submitting

Who can review?

@vasqu @ArthurZucker @CyrilVallez

Changed files

  • src/transformers/masking_utils.py (modified, +26/-1)
  • tests/utils/test_masking_utils.py (modified, +37/-0)
RAW_BUFFERClick to expand / collapse

System Info

Hi, while we were updating the PyTorch transformers pin to v5.2.0, our regression tests caught a numerics issue between eager and compiled, the difference is very substantial (3.3 vs the typical e-4 accepted difference). Digging into it: https://github.com/pytorch/pytorch/pull/175274#issuecomment-3930952666, we found the cause to be in these lines (added in https://github.com/huggingface/transformers/pull/41265):

https://github.com/huggingface/transformers/blob/147b7aa040812b079f467e777a2d2e1284167de0/src/transformers/masking_utils.py#L490-L491

We set allow_is_bidirectional_skip=True in a few places: https://github.com/huggingface/transformers/blob/147b7aa040812b079f467e777a2d2e1284167de0/src/transformers/masking_utils.py#L996-L997

And in _ignore_bidirectional_mask_sdpa, we branch logic on whether we compile or not: https://github.com/huggingface/transformers/blob/147b7aa040812b079f467e777a2d2e1284167de0/src/transformers/masking_utils.py#L324-L332

This issue was found on BERT but it seems like it would affect other models too.

We've also verified that removing the branching fixes the numerical difference. I'm creating this issue to ask about the best way forward here. From the PR that added it, it looks like this was necessary specifically for executorch, but the algorithm difference is also affected all other APIs that fall under is_tracing . Can we restrict the check?

Who can help?

@vasqu @ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I believe the description is enough, but I can provide a simpler repro on request

Expected behavior

transformers users probably shouldn't run into large numeric differences when compiling, at least not by default

extent analysis

Fix Plan

Restrict the allow_is_bidirectional_skip check

The issue is caused by the allow_is_bidirectional_skip=True check in masking_utils.py. To fix this, we can restrict the check to only apply when is_tracing is True.

Step-by-Step Solution

  1. Update masking_utils.py:
if is_tracing and allow_is_bidirectional_skip:
    # original code here
  1. Remove the branching logic:
# Remove the following code block
if not is_tracing:
    # original code here
  1. Update the allow_is_bidirectional_skip check:
if is_tracing:
    allow_is_bidirectional_skip = True
else:
    allow_is_bidirectional_skip = False
  1. Test the fix: Run the regression tests to verify that the fix works as expected.

Verification

  1. Run the regression tests: Make sure the tests pass without any numerical differences.
  2. Verify the fix: Check that the fix does not introduce any regressions in other parts of the codebase.

Extra Tips

  • Make sure to update the documentation to reflect the change in behavior.
  • Consider adding a test case to cover the restricted check.
  • If you're using a CI/CD pipeline, update the pipeline to run the regression tests after the fix is applied.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

transformers users probably shouldn't run into large numeric differences when compiling, at least not by default

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix Diverging attention kernels due to `allow_is_bidirectional_skip` branching on torch.compile [1 pull requests, 9 comments, 6 participants]