transformers - 💡(How to fix) Fix GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends [3 comments, 4 participants]

transformers2026-03-02 03:31:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44380•Fetched 2026-04-08 00:28:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×6mentioned ×5commented ×3closed ×1

Fix Action

Fixed

Closed with commit: 8757098be4d2386d45900b30168855996ca22810

RAW_BUFFERClick to expand / collapse

System Info

None

Who can help?

@ArthurZucker Hi, I'm new to LLMs and currently learning GPT2 model. I found that

The GPT2 attention configuration options: • scale_attn_weights • scale_attn_by_inverse_layer_idx

are respected in eager attention mode but silently ignored when using AttentionInterface backends such as "sdpa" or "flash_attention_2".

In eager mode: the scaling logic is applied inside eager_attention_forward: • division by sqrt(head_dim) if scale_attn_weights=True • division by (layer_idx+1) if scale_attn_by_inverse_layer_idx=True

However, when using sdpa: torch._C._nn.scaled_dot_product_attention(query_states_3, key_states_3, value_states_3, attn_mask = attention_mask_1, dropout_p = 0.0, scale = None, is_causal = False) which seems to ignore the above config.

I realize that the default configuration (scale_attn_weights =True, scale_attn_by_inverse_layer_idx=False) produces the same results, so I’m not sure whether this is intentional or should be considered a bug.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

None

Expected behavior

Different attention implementations should produce semantically equivalent results and respect model configuration parameters.

extent analysis

Fix Plan

Update AttentionInterface Backends

To fix the issue, we need to update the AttentionInterface backends (sdpa and flash_attention_2) to respect the attention configuration options.

Code Changes

# In sdpa.py (or flash_attention_2.py)
def scaled_dot_product_attention(query_states, key_states, value_states, attn_mask, dropout_p, scale=None, is_causal=False):
    # Apply scaling logic if scale_attn_weights=True
    if scale_attn_weights:
        scale = 1 / math.sqrt(query_states.shape[-1])
    
    # Apply scaling logic if scale_attn_by_inverse_layer_idx=True
    if scale_attn_by_inverse_layer_idx:
        scale = 1 / (layer_idx + 1)
    
    # Perform scaled dot-product attention
    output = torch._C._nn.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask, dropout_p, scale, is_causal)
    return output

Configuration Changes

Update the GPT2 model configuration to respect the attention configuration options:

# In gpt2.py
class GPT2Config:
    def __init__(self):
        self.scale_attn_weights = True
        self.scale_attn_by_inverse_layer_idx = False

Verification

To verify that the fix worked, run the GPT2 model with different attention implementations and check that the results are semantically equivalent.

# In main.py
gpt2_config = GPT2Config()
gpt2_model = GPT2(gpt2_config)

# Run GPT2 model with eager attention mode
eager_output = gpt2_model.forward(eager_attention=True)

# Run GPT2 model with sdpa attention backend
sdpa_output = gpt2_model.forward(eager_attention=False, attention_backend="sdpa")

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Different attention implementations should produce semantically equivalent results and respect model configuration parameters.

#api #ssr #installation #tensor shape #autograd error #latency issue #model loading #dependency error #configuration error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends [3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Update AttentionInterface Backends

Code Changes

Configuration Changes

Verification

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends [3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Update AttentionInterface Backends

Code Changes

Configuration Changes

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING