pytorch - 💡(How to fix) Fix RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output when using attn_mask in SDPA backward [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178251Fetched 2026-04-08 01:21:00
View on GitHub
Comments
2
Participants
2
Timeline
72
Reactions
0
Author
Timeline (top)
mentioned ×32subscribed ×32labeled ×6commented ×2

Error Message

RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output

Root Cause

What I observed

  • using attn_mask
  • Forward output is normal
  • Backward produces NaN gradients
  • The issue seems to happen in SDPA backward
  • math backend is not practical in my case because it easily causes OOM

Code Example

RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output

---

with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask=attn_mask,
        dropout_p=dropout_p,
        is_causal=causal
    )

if attn_mask is not None:
    print(
        f"q mean={q.mean().item()}, "
        f"k mean={k.mean().item()}, "
        f"v mean={v.mean().item()}, "
        f"out mean={out.mean().item()} "
        f"q.shape={q.shape} "
        f"k.shape={k.shape} "
        f"mask.shape={attn_mask.shape}"
    )

    assert not torch.isnan(attn_mask).any().item(), "attn_mask is nan"
    assert not torch.isinf(attn_mask).any().item(), "attn_mask is inf"

    full_false_rows = torch.where(~attn_mask.any(dim=-1))[0]
    assert len(full_false_rows) == 0, "attn_mask has all-false row"
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Hi, I encountered a problem when using torch.nn.functional.scaled_dot_product_attention with attn_mask.

Problem

The forward pass looks normal, but during backward, the gradients of q/k related weights become NaN, and the training fails with:

RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output

What I observed

  • using attn_mask
  • Forward output is normal
  • Backward produces NaN gradients
  • The issue seems to happen in SDPA backward
  • math backend is not practical in my case because it easily causes OOM

Important notes

I have already checked the common cause mentioned in related issues:

  • attn_mask does not contain NaN
  • attn_mask does not contain Inf
  • I am very sure there are no fully masked rows in the mask, I believe the inputs are valid

Minimal code

with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask=attn_mask,
        dropout_p=dropout_p,
        is_causal=causal
    )

if attn_mask is not None:
    print(
        f"q mean={q.mean().item()}, "
        f"k mean={k.mean().item()}, "
        f"v mean={v.mean().item()}, "
        f"out mean={out.mean().item()} "
        f"q.shape={q.shape} "
        f"k.shape={k.shape} "
        f"mask.shape={attn_mask.shape}"
    )

    assert not torch.isnan(attn_mask).any().item(), "attn_mask is nan"
    assert not torch.isinf(attn_mask).any().item(), "attn_mask is inf"

    full_false_rows = torch.where(~attn_mask.any(dim=-1))[0]
    assert len(full_false_rows) == 0, "attn_mask has all-false row"

Question

  • How should this kind of issue be handled correctly?

Versions

pytorch 2.5.1 + cu118 Python 3.11.2 OS: Ubuntu 18.04.6 LTS (x86_64)

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93 @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @drisspg @liangel-02 @howardzhang-cv

extent analysis

Fix Plan

The issue seems to be related to numerical instability in the scaled_dot_product_attention function. To fix this, we can try the following steps:

  • Clip the attention weights: Clip the attention weights to prevent NaN values.
  • Use a smaller dropout probability: Reduce the dropout probability to prevent NaN values.
  • Scale the input values: Scale the input values to prevent overflow.

Here are the concrete steps and code snippets:

Code Changes

import torch
import torch.nn.functional as F

# Define a function to clip attention weights
def clip_attention_weights(weights):
    return torch.clamp(weights, min=-100, max=100)

# Define a function to scale input values
def scale_input_values(q, k, v):
    q_scale = q.abs().max()
    k_scale = k.abs().max()
    v_scale = v.abs().max()
    q = q / q_scale
    k = k / k_scale
    v = v / v_scale
    return q, k, v

# Modify the original code
with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
    q, k, v = scale_input_values(q, k, v)
    out = F.scaled_dot_product_attention(
        q, k, v,
        attn_mask=attn_mask,
        dropout_p=min(dropout_p, 0.1),  # reduce dropout probability
        is_causal=causal
    )
    attention_weights = out[1]  # get attention weights
    attention_weights = clip_attention_weights(attention_weights)  # clip attention weights

Verification

To verify that the fix worked, you can check the gradients of the q/k related weights during backward pass. If the gradients are no longer NaN, the fix is successful.

# Verify that the gradients are not NaN
q_grad = q.grad
k_grad = k.grad
assert not torch.isnan(q_grad).any().item(), "q grad is nan"
assert not torch.isnan(k_grad).any().item(), "k grad is nan"

Extra Tips

  • Make sure to test the model with different input values to ensure that the fix works for all cases.
  • If the issue persists, try reducing the model's learning rate or using a different optimizer.
  • Consider using a more robust attention mechanism, such as multi-head attention, to improve the model's stability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix RuntimeError: Function 'ScaledDotProductEfficientAttentionBackward0' returned nan values in its 0th output when using attn_mask in SDPA backward [2 comments, 2 participants]