pytorch - 💡(How to fix) Fix `torch.compile` produces different output for single-head attention with `matmul → div → softmax → dropout → matmul` pattern (SFDP pattern 11) compared to eager mode [1 comments, 1 participants]

pytorch2026-04-07 13:19:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#179578•Fetched 2026-04-08 03:00:05

View on GitHub

Comments

Participants

Timeline

Reactions

Author

prorrice

Participants

prorrice

Timeline (top)

mentioned ×7subscribed ×7labeled ×2commented ×1

Error Message

Error logs

No error — outputs silently differ:

Code Example

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SingleHeadAttention(nn.Module):
    def __init__(self, embed_dim=256, dropout_p=0.0):
        super().__init__()
        self.embed_dim = embed_dim
        self.scale = math.sqrt(embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        query = self.q_proj(x)
        key = self.k_proj(x)
        value = self.v_proj(x)

        # Single-head: unsqueeze head dim + permute to [B, 1, S, D]
        query = query.unsqueeze(1).permute(0, 1, 2, 3)  # [B, 1, S, D]
        key = key.unsqueeze(1).permute(0, 1, 2, 3)
        value = value.unsqueeze(1).permute(0, 1, 2, 3)

        # SFDP pattern 11: matmul → div(scale) → softmax → dropout → matmul
        attn_scores = torch.matmul(query, key.transpose(-2, -1))
        attn_scores = attn_scores / self.scale
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context = torch.matmul(attn_weights, value)

        # Remove head dim and project
        context = context.squeeze(1)  # [B, S, D]
        return self.out_proj(context)


class SingleHeadTransformer(nn.Module):
    def __init__(self, vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
        self.attn = SingleHeadAttention(embed_dim, dropout_p=dropout_p)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
        )

    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        positions = torch.arange(seq_len, device=token_ids.device).unsqueeze(0).expand(batch_size, -1)
        x = self.token_emb(token_ids) + self.pos_emb(positions)

        # Pre-norm residual attention
        x_norm = self.norm1(x)
        attn_out = self.attn(x_norm)
        x = x + attn_out

        # Pre-norm residual FFN
        x = x + self.ffn(self.norm2(x))
        return x


device = "cuda"
torch.manual_seed(42)
model = SingleHeadTransformer(
    vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0
).to(device).eval()
token_ids = torch.randint(0, 10000, (4, 16), device=device)

# Eager: deterministic
with torch.no_grad():
    ref = model(token_ids)
    ref2 = model(token_ids)
print(f"Eager deterministic: {(ref - ref2).abs().max().item():.6e}")

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    comp = compiled(token_ids)

diff = (ref.float() - comp.float()).abs()
print(f"Max diff: {diff.max().item():.6e}")
print(f"Mean diff: {diff.mean().item():.6e}")
print(f"Match (atol=1e-5): {torch.allclose(ref, comp, atol=1e-5, rtol=1e-4)}")

---

Eager deterministic: 0.000000e+00
Max diff and mean diff show systematic numerical divergence from SFDP fusion

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces numerically different results for a single-head Transformer model whose attention mechanism follows the SFDP pattern 11: matmul → div(scale) → softmax → dropout → matmul. Inductor's scaled dot-product fusion pass recognizes this pattern and replaces it with an optimized flash attention kernel that uses a different floating-point accumulation order, causing numerical differences that propagate through residual connections.

Eager mode is perfectly deterministic (max_var=0 across runs), confirming this is a systematic computation difference introduced by the Inductor SFDP fusion, not GPU non-determinism.

Dropout is set to p=0.0 (no dropout) to eliminate stochasticity, ensuring the difference is purely from the kernel fusion.

Minimal reproducer

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SingleHeadAttention(nn.Module):
    def __init__(self, embed_dim=256, dropout_p=0.0):
        super().__init__()
        self.embed_dim = embed_dim
        self.scale = math.sqrt(embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        batch_size, seq_len, _ = x.shape

        query = self.q_proj(x)
        key = self.k_proj(x)
        value = self.v_proj(x)

        # Single-head: unsqueeze head dim + permute to [B, 1, S, D]
        query = query.unsqueeze(1).permute(0, 1, 2, 3)  # [B, 1, S, D]
        key = key.unsqueeze(1).permute(0, 1, 2, 3)
        value = value.unsqueeze(1).permute(0, 1, 2, 3)

        # SFDP pattern 11: matmul → div(scale) → softmax → dropout → matmul
        attn_scores = torch.matmul(query, key.transpose(-2, -1))
        attn_scores = attn_scores / self.scale
        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context = torch.matmul(attn_weights, value)

        # Remove head dim and project
        context = context.squeeze(1)  # [B, S, D]
        return self.out_proj(context)


class SingleHeadTransformer(nn.Module):
    def __init__(self, vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
        self.attn = SingleHeadAttention(embed_dim, dropout_p=dropout_p)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
        )

    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        positions = torch.arange(seq_len, device=token_ids.device).unsqueeze(0).expand(batch_size, -1)
        x = self.token_emb(token_ids) + self.pos_emb(positions)

        # Pre-norm residual attention
        x_norm = self.norm1(x)
        attn_out = self.attn(x_norm)
        x = x + attn_out

        # Pre-norm residual FFN
        x = x + self.ffn(self.norm2(x))
        return x


device = "cuda"
torch.manual_seed(42)
model = SingleHeadTransformer(
    vocab_size=10000, embed_dim=256, max_seq_len=64, dropout_p=0.0
).to(device).eval()
token_ids = torch.randint(0, 10000, (4, 16), device=device)

# Eager: deterministic
with torch.no_grad():
    ref = model(token_ids)
    ref2 = model(token_ids)
print(f"Eager deterministic: {(ref - ref2).abs().max().item():.6e}")

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    comp = compiled(token_ids)

diff = (ref.float() - comp.float()).abs()
print(f"Max diff: {diff.max().item():.6e}")
print(f"Mean diff: {diff.mean().item():.6e}")
print(f"Match (atol=1e-5): {torch.allclose(ref, comp, atol=1e-5, rtol=1e-4)}")

Behavior summary

Mode	Result	Notes
Eager	Reference output	Perfectly deterministic across runs (max_var=0)
`torch.compile(backend="inductor")`	Different output	SFDP fusion changes accumulation order; differences propagate through residuals

Notes

Eager mode is perfectly deterministic (max_var=0), ruling out GPU non-determinism.
Dropout is set to p=0.0 and model is in eval() mode, eliminating all stochasticity.
The matmul → div(scale) → softmax → dropout → matmul pattern is SFDP pattern 11 (single-head variant). Inductor recognizes and fuses this into an optimized flash attention kernel.
The fused kernel uses different floating-point accumulation order (e.g., tiled reduction in flash attention) compared to the sequential eager execution, causing numerical differences.
These differences compound through two residual connections (x + attn_out, x + ffn_out), amplifying the final output mismatch.
The single-head architecture (unsqueeze to create head dim) is the specific signature of SFDP pattern 11, distinct from multi-head patterns.

Error logs

No error — outputs silently differ:

Eager deterministic: 0.000000e+00
Max diff and mean diff show systematic numerical divergence from SFDP fusion

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

The issue can be mitigated by disabling the SFDP fusion pass in the Inductor backend or using a different backend that does not fuse the SFDP pattern.

Guidance

Identify the specific SFDP pattern (in this case, pattern 11) that is causing the numerical differences.
Consider disabling the SFDP fusion pass in the Inductor backend to prevent the optimized flash attention kernel from being used.
Alternatively, use a different backend that does not fuse the SFDP pattern, such as the eager mode.
Verify that the numerical differences are resolved by comparing the outputs of the modified model with the reference output.

Example

# Disable SFDP fusion pass in Inductor backend
compiled = torch.compile(model, backend="inductor", disable_fusion=True)

Note: The disable_fusion parameter is not a real parameter in PyTorch, and this example is only illustrative. The actual solution may require modifying the Inductor backend or using a different backend.

Notes

The issue is specific to the Inductor backend and the SFDP pattern 11.
The numerical differences are caused by the different floating-point accumulation order used in the optimized flash attention kernel.
Disabling the SFDP fusion pass or using a different backend may impact performance.

Recommendation

Apply workaround: Disable SFDP fusion pass in Inductor backend or use a different backend. This is because the issue is specific to the Inductor backend and the SFDP pattern 11, and disabling the fusion pass or using a different backend can mitigate the numerical differences.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API routing #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.compile` produces different output for single-head attention with `matmul → div → softmax → dropout → matmul` pattern (SFDP pattern 11) compared to eager mode [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` produces different output for single-head attention with `matmul → div → softmax → dropout → matmul` pattern (SFDP pattern 11) compared to eager mode [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING