pytorch - 💡(How to fix) Fix RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output when using attn_mask in SDPA backward [2 comments, 2 participants]

pytorch2026-03-24 07:52:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178251•Fetched 2026-04-08 01:21:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Lyxien

Participants

github-actions[bot]

Lyxien

Timeline (top)

mentioned ×32subscribed ×32labeled ×6commented ×2

Error Message

RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output

Root Cause

What I observed

using attn_mask
Forward output is normal
Backward produces NaN gradients
The issue seems to happen in SDPA backward
math backend is not practical in my case because it easily causes OOM

Code Example

RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output

---

with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask=attn_mask,
        dropout_p=dropout_p,
        is_causal=causal
    )

if attn_mask is not None:
    print(
        f"q mean={q.mean().item()}, "
        f"k mean={k.mean().item()}, "
        f"v mean={v.mean().item()}, "
        f"out mean={out.mean().item()} "
        f"q.shape={q.shape} "
        f"k.shape={k.shape} "
        f"mask.shape={attn_mask.shape}"
    )

    assert not torch.isnan(attn_mask).any().item(), "attn_mask is nan"
    assert not torch.isinf(attn_mask).any().item(), "attn_mask is inf"

    full_false_rows = torch.where(~attn_mask.any(dim=-1))[0]
    assert len(full_false_rows) == 0, "attn_mask has all-false row"

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Hi, I encountered a problem when using torch.nn.functional.scaled_dot_product_attention with attn_mask.

Problem

The forward pass looks normal, but during backward, the gradients of q/k related weights become NaN, and the training fails with:

RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output

What I observed

using attn_mask
Forward output is normal
Backward produces NaN gradients
The issue seems to happen in SDPA backward
math backend is not practical in my case because it easily causes OOM

Important notes

I have already checked the common cause mentioned in related issues:

attn_mask does not contain NaN
attn_mask does not contain Inf
I am very sure there are no fully masked rows in the mask, I believe the inputs are valid

Minimal code

with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask=attn_mask,
        dropout_p=dropout_p,
        is_causal=causal
    )

if attn_mask is not None:
    print(
        f"q mean={q.mean().item()}, "
        f"k mean={k.mean().item()}, "
        f"v mean={v.mean().item()}, "
        f"out mean={out.mean().item()} "
        f"q.shape={q.shape} "
        f"k.shape={k.shape} "
        f"mask.shape={attn_mask.shape}"
    )

    assert not torch.isnan(attn_mask).any().item(), "attn_mask is nan"
    assert not torch.isinf(attn_mask).any().item(), "attn_mask is inf"

    full_false_rows = torch.where(~attn_mask.any(dim=-1))[0]
    assert len(full_false_rows) == 0, "attn_mask has all-false row"

Question

How should this kind of issue be handled correctly?

Versions

pytorch 2.5.1 + cu118 Python 3.11.2 OS: Ubuntu 18.04.6 LTS (x86_64)

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93 @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @drisspg @liangel-02 @howardzhang-cv

extent analysis

Fix Plan

The issue seems to be related to numerical instability in the scaled_dot_product_attention function. To fix this, we can try the following steps:

Clip the attention weights: Clip the attention weights to prevent NaN values.
Use a smaller dropout probability: Reduce the dropout probability to prevent NaN values.
Scale the input values: Scale the input values to prevent overflow.

Here are the concrete steps and code snippets:

Code Changes

import torch
import torch.nn.functional as F

# Define a function to clip attention weights
def clip_attention_weights(weights):
    return torch.clamp(weights, min=-100, max=100)

# Define a function to scale input values
def scale_input_values(q, k, v):
    q_scale = q.abs().max()
    k_scale = k.abs().max()
    v_scale = v.abs().max()
    q = q / q_scale
    k = k / k_scale
    v = v / v_scale
    return q, k, v

# Modify the original code
with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
    q, k, v = scale_input_values(q, k, v)
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask=attn_mask,
        dropout_p=min(dropout_p, 0.1),  # reduce dropout probability
        is_causal=causal
    )
    attention_weights = out[1]  # get attention weights
    attention_weights = clip_attention_weights(attention_weights)  # clip attention weights

Verification

To verify that the fix worked, you can check the gradients of the q/k related weights during backward pass. If the gradients are no longer NaN, the fix is successful.

# Verify that the gradients are not NaN
q_grad = q.grad
k_grad = k.grad
assert not torch.isnan(q_grad).any().item(), "q grad is nan"
assert not torch.isnan(k_grad).any().item(), "k grad is nan"

Extra Tips

Make sure to test the model with different input values to ensure that the fix works for all cases.
If the issue persists, try reducing the model's learning rate or using a different optimizer.
Consider using a more robust attention mechanism, such as multi-head attention, to improve the model's stability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API routing #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output when using attn_mask in SDPA backward [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output when using attn_mask in SDPA backward [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING