pytorch - 💡(How to fix) Fix `torch.compile` produces different output for single-head attention with `matmul → div → softmax → dropout → matmul` pattern (SFDP pattern 11) compared to eager mode [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179578Fetched 2026-04-08 03:00:05
View on GitHub
Comments
1
Participants
1
Timeline
17
Reactions
0
Author
Participants
Timeline (top)
mentioned ×7subscribed ×7labeled ×2commented ×1

Error Message

Error logs

No error — outputs silently differ:

Code Example

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SingleHeadAttention(nn.Module):
    def __init__(self, embed_dim=256, dropout_p=0.0):
        super().__init__()
        self.embed_dim = embed_dim
        self.scale = math.sqrt(embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        query = self.q_proj(x)
        key = self.k_proj(x)
        value = self.v_proj(x)

        # Single-head: unsqueeze head dim + permute to [B, 1, S, D]
        query = query.unsqueeze(1).permute(0, 1, 2, 3)  # [B, 1, S, D]
        key = key.unsqueeze(1).permute(0, 1, 2, 3)
        value = value.unsqueeze(1).permute(0, 1, 2, 3)

        # SFDP pattern 11: matmul → div(scale) → softmax → dropout → matmul
        attn_scores = torch.matmul(query, key.transpose(-2, -1))
        attn_scores = attn_scores / self.scale
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context = torch.matmul(attn_weights, value)

        # Remove head dim and project
        context = context.squeeze(1)  # [B, S, D]
        return self.out_proj(context)


class SingleHeadTransformer(nn.Module):
    def __init__(self, vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
        self.attn = SingleHeadAttention(embed_dim, dropout_p=dropout_p)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
        )

    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        positions = torch.arange(seq_len, device=token_ids.device).unsqueeze(0).expand(batch_size, -1)
        x = self.token_emb(token_ids) + self.pos_emb(positions)

        # Pre-norm residual attention
        x_norm = self.norm1(x)
        attn_out = self.attn(x_norm)
        x = x + attn_out

        # Pre-norm residual FFN
        x = x + self.ffn(self.norm2(x))
        return x


device = "cuda"
torch.manual_seed(42)
model = SingleHeadTransformer(
    vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0
).to(device).eval()
token_ids = torch.randint(0, 10000, (4, 16), device=device)

# Eager: deterministic
with torch.no_grad():
    ref = model(token_ids)
    ref2 = model(token_ids)
print(f"Eager deterministic: {(ref - ref2).abs().max().item():.6e}")

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    comp = compiled(token_ids)

diff = (ref.float() - comp.float()).abs()
print(f"Max diff: {diff.max().item():.6e}")
print(f"Mean diff: {diff.mean().item():.6e}")
print(f"Match (atol=1e-5): {torch.allclose(ref, comp, atol=1e-5, rtol=1e-4)}")

---

Eager deterministic: 0.000000e+00
Max diff and mean diff show systematic numerical divergence from SFDP fusion

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces numerically different results for a single-head Transformer model whose attention mechanism follows the SFDP pattern 11: matmul → div(scale) → softmax → dropout → matmul. Inductor's scaled dot-product fusion pass recognizes this pattern and replaces it with an optimized flash attention kernel that uses a different floating-point accumulation order, causing numerical differences that propagate through residual connections.

Eager mode is perfectly deterministic (max_var=0 across runs), confirming this is a systematic computation difference introduced by the Inductor SFDP fusion, not GPU non-determinism.

Dropout is set to p=0.0 (no dropout) to eliminate stochasticity, ensuring the difference is purely from the kernel fusion.

Minimal reproducer

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SingleHeadAttention(nn.Module):
    def __init__(self, embed_dim=256, dropout_p=0.0):
        super().__init__()
        self.embed_dim = embed_dim
        self.scale = math.sqrt(embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        query = self.q_proj(x)
        key = self.k_proj(x)
        value = self.v_proj(x)

        # Single-head: unsqueeze head dim + permute to [B, 1, S, D]
        query = query.unsqueeze(1).permute(0, 1, 2, 3)  # [B, 1, S, D]
        key = key.unsqueeze(1).permute(0, 1, 2, 3)
        value = value.unsqueeze(1).permute(0, 1, 2, 3)

        # SFDP pattern 11: matmul → div(scale) → softmax → dropout → matmul
        attn_scores = torch.matmul(query, key.transpose(-2, -1))
        attn_scores = attn_scores / self.scale
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context = torch.matmul(attn_weights, value)

        # Remove head dim and project
        context = context.squeeze(1)  # [B, S, D]
        return self.out_proj(context)


class SingleHeadTransformer(nn.Module):
    def __init__(self, vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
        self.attn = SingleHeadAttention(embed_dim, dropout_p=dropout_p)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
        )

    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        positions = torch.arange(seq_len, device=token_ids.device).unsqueeze(0).expand(batch_size, -1)
        x = self.token_emb(token_ids) + self.pos_emb(positions)

        # Pre-norm residual attention
        x_norm = self.norm1(x)
        attn_out = self.attn(x_norm)
        x = x + attn_out

        # Pre-norm residual FFN
        x = x + self.ffn(self.norm2(x))
        return x


device = "cuda"
torch.manual_seed(42)
model = SingleHeadTransformer(
    vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0
).to(device).eval()
token_ids = torch.randint(0, 10000, (4, 16), device=device)

# Eager: deterministic
with torch.no_grad():
    ref = model(token_ids)
    ref2 = model(token_ids)
print(f"Eager deterministic: {(ref - ref2).abs().max().item():.6e}")

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    comp = compiled(token_ids)

diff = (ref.float() - comp.float()).abs()
print(f"Max diff: {diff.max().item():.6e}")
print(f"Mean diff: {diff.mean().item():.6e}")
print(f"Match (atol=1e-5): {torch.allclose(ref, comp, atol=1e-5, rtol=1e-4)}")

Behavior summary

ModeResultNotes
EagerReference outputPerfectly deterministic across runs (max_var=0)
torch.compile(backend="inductor")Different outputSFDP fusion changes accumulation order; differences propagate through residuals

Notes

  • Eager mode is perfectly deterministic (max_var=0), ruling out GPU non-determinism.
  • Dropout is set to p=0.0 and model is in eval() mode, eliminating all stochasticity.
  • The matmul → div(scale) → softmax → dropout → matmul pattern is SFDP pattern 11 (single-head variant). Inductor recognizes and fuses this into an optimized flash attention kernel.
  • The fused kernel uses different floating-point accumulation order (e.g., tiled reduction in flash attention) compared to the sequential eager execution, causing numerical differences.
  • These differences compound through two residual connections (x + attn_out, x + ffn_out), amplifying the final output mismatch.
  • The single-head architecture (unsqueeze to create head dim) is the specific signature of SFDP pattern 11, distinct from multi-head patterns.

Error logs

No error — outputs silently differ:

Eager deterministic: 0.000000e+00
Max diff and mean diff show systematic numerical divergence from SFDP fusion

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

The issue can be mitigated by disabling the SFDP fusion pass in the Inductor backend or using a different backend that does not fuse the SFDP pattern.

Guidance

  • Identify the specific SFDP pattern (in this case, pattern 11) that is causing the numerical differences.
  • Consider disabling the SFDP fusion pass in the Inductor backend to prevent the optimized flash attention kernel from being used.
  • Alternatively, use a different backend that does not fuse the SFDP pattern, such as the eager mode.
  • Verify that the numerical differences are resolved by comparing the outputs of the modified model with the reference output.

Example

# Disable SFDP fusion pass in Inductor backend
compiled = torch.compile(model, backend="inductor", disable_fusion=True)

Note: The disable_fusion parameter is not a real parameter in PyTorch, and this example is only illustrative. The actual solution may require modifying the Inductor backend or using a different backend.

Notes

  • The issue is specific to the Inductor backend and the SFDP pattern 11.
  • The numerical differences are caused by the different floating-point accumulation order used in the optimized flash attention kernel.
  • Disabling the SFDP fusion pass or using a different backend may impact performance.

Recommendation

Apply workaround: Disable SFDP fusion pass in Inductor backend or use a different backend. This is because the issue is specific to the Inductor backend and the SFDP pattern 11, and disabling the fusion pass or using a different backend can mitigate the numerical differences.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING