transformers - 💡(How to fix) Fix GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends [3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44380Fetched 2026-04-08 00:28:52
View on GitHub
Comments
3
Participants
4
Timeline
17
Reactions
0
Author
Timeline (top)
subscribed ×6mentioned ×5commented ×3closed ×1

Fix Action

Fixed

  • Closed with commit: 8757098be4d2386d45900b30168855996ca22810
RAW_BUFFERClick to expand / collapse

System Info

None

Who can help?

@ArthurZucker Hi, I'm new to LLMs and currently learning GPT2 model. I found that

The GPT2 attention configuration options: • scale_attn_weights • scale_attn_by_inverse_layer_idx

are respected in eager attention mode but silently ignored when using AttentionInterface backends such as "sdpa" or "flash_attention_2".

In eager mode: the scaling logic is applied inside eager_attention_forward: • division by sqrt(head_dim) if scale_attn_weights=True • division by (layer_idx+1) if scale_attn_by_inverse_layer_idx=True

However, when using sdpa: torch._C._nn.scaled_dot_product_attention(query_states_3, key_states_3, value_states_3, attn_mask = attention_mask_1, dropout_p = 0.0, scale = None, is_causal = False) which seems to ignore the above config.

I realize that the default configuration (scale_attn_weights =True, scale_attn_by_inverse_layer_idx=False) produces the same results, so I’m not sure whether this is intentional or should be considered a bug.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

None

Expected behavior

Different attention implementations should produce semantically equivalent results and respect model configuration parameters.

extent analysis

Fix Plan

Update AttentionInterface Backends

To fix the issue, we need to update the AttentionInterface backends (sdpa and flash_attention_2) to respect the attention configuration options.

Code Changes

# In sdpa.py (or flash_attention_2.py)
def scaled_dot_product_attention(query_states, key_states, value_states, attn_mask, dropout_p, scale=None, is_causal=False):
    # Apply scaling logic if scale_attn_weights=True
    if scale_attn_weights:
        scale = 1 / math.sqrt(query_states.shape[-1])
    
    # Apply scaling logic if scale_attn_by_inverse_layer_idx=True
    if scale_attn_by_inverse_layer_idx:
        scale = 1 / (layer_idx + 1)
    
    # Perform scaled dot-product attention
    output = torch._C._nn.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask, dropout_p, scale, is_causal)
    return output

Configuration Changes

Update the GPT2 model configuration to respect the attention configuration options:

# In gpt2.py
class GPT2Config:
    def __init__(self):
        self.scale_attn_weights = True
        self.scale_attn_by_inverse_layer_idx = False

Verification

To verify that the fix worked, run the GPT2 model with different attention implementations and check that the results are semantically equivalent.

# In main.py
gpt2_config = GPT2Config()
gpt2_model = GPT2(gpt2_config)

# Run GPT2 model with eager attention mode
eager_output = gpt2_model.forward(eager_attention=True)

# Run GPT2 model with sdpa attention backend
sdpa_output = gpt2_model.forward(eager_attention=False, attention_backend="sdpa")

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Different attention implementations should produce semantically equivalent results and respect model configuration parameters.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends [3 comments, 4 participants]