pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for addmm fusion pattern (`mm + bias → addmm`) compared to eager mode [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179567Fetched 2026-04-08 03:00:21
View on GitHub
Comments
1
Participants
1
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
mentioned ×3subscribed ×3labeled ×2commented ×1

Error Message

import torch import torch.nn as nn

class DualPatternModel(nn.Module): def init(self): super().init() # CNN feature extraction with pointwise convs self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.conv_pw1 = nn.Conv2d(32, 64, kernel_size=1) # Pointwise 1x1 self.conv_pw2 = nn.Conv2d(64, 128, kernel_size=1) # Pointwise 1x1 self.pool = nn.MaxPool2d(2, 2) # 3 sequential mm+bias layers (explicit addmm pattern) self.fc1_weight = nn.Parameter(torch.randn(128 * 4 * 4, 512)) self.fc1_bias = nn.Parameter(torch.randn(512)) self.fc2_weight = nn.Parameter(torch.randn(512, 256)) self.fc2_bias = nn.Parameter(torch.randn(256)) self.fc3_weight = nn.Parameter(torch.randn(256, 10)) self.fc3_bias = nn.Parameter(torch.randn(10))

def _gelu_erf(self, x):
    return 0.5 * x * (1.0 + torch.erf(x * 0.7071067811865476))

def forward(self, x):
    # Conv + GELU via erf + pool
    x = self.pool(self._gelu_erf(self.conv1(x)))       # [4,32,16,16]
    x = self.pool(self._gelu_erf(self.conv_pw1(x)))    # [4,64,8,8]
    x = self.pool(self._gelu_erf(self.conv_pw2(x)))    # [4,128,4,4]
    x = x.flatten(1)                                    # [4, 2048]
    # addmm pattern 1: mm(x, W) + b
    x = torch.mm(x, self.fc1_weight) + self.fc1_bias
    x = torch.relu(x)
    # addmm pattern 2: b + mm(x, W)
    x = self.fc2_bias + torch.mm(x, self.fc2_weight)
    x = torch.relu(x)
    # addmm pattern 1 again
    x = torch.mm(x, self.fc3_weight) + self.fc3_bias
    return x

device = "cuda" torch.manual_seed(42) model = DualPatternModel().to(device).eval() x = torch.randn(4, 3, 32, 32, device=device)

Eager: deterministic

with torch.no_grad(): ref1 = model(x) ref2 = model(x) print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

Compiled: different result

torch._dynamo.reset() compiled = torch.compile(model, backend="inductor") with torch.no_grad(): out = compiled(x)

diff = (ref1.float() - out.float()).abs() print(f"max_diff={diff.max().item():.6e}") print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]") print(f"Relative error: {diff.max().item() / ref1.abs().max().item():.6e}")

Root Cause

The root cause is that the Inductor fuses mm + add into a single addmm CUBLAS call with a different floating-point accumulation order. While individual layer differences are small, they compound through 3 sequential mm+bias layers with ReLU activations, amplifying the final discrepancy beyond acceptable float32 tolerance (rtol=1.3e-6).

Code Example

import torch
import torch.nn as nn

class DualPatternModel(nn.Module):
    def __init__(self):
        super().__init__()
        # CNN feature extraction with pointwise convs
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv_pw1 = nn.Conv2d(32, 64, kernel_size=1)   # Pointwise 1x1
        self.conv_pw2 = nn.Conv2d(64, 128, kernel_size=1)   # Pointwise 1x1
        self.pool = nn.MaxPool2d(2, 2)
        # 3 sequential mm+bias layers (explicit addmm pattern)
        self.fc1_weight = nn.Parameter(torch.randn(128 * 4 * 4, 512))
        self.fc1_bias = nn.Parameter(torch.randn(512))
        self.fc2_weight = nn.Parameter(torch.randn(512, 256))
        self.fc2_bias = nn.Parameter(torch.randn(256))
        self.fc3_weight = nn.Parameter(torch.randn(256, 10))
        self.fc3_bias = nn.Parameter(torch.randn(10))

    def _gelu_erf(self, x):
        return 0.5 * x * (1.0 + torch.erf(x * 0.7071067811865476))

    def forward(self, x):
        # Conv + GELU via erf + pool
        x = self.pool(self._gelu_erf(self.conv1(x)))       # [4,32,16,16]
        x = self.pool(self._gelu_erf(self.conv_pw1(x)))    # [4,64,8,8]
        x = self.pool(self._gelu_erf(self.conv_pw2(x)))    # [4,128,4,4]
        x = x.flatten(1)                                    # [4, 2048]
        # addmm pattern 1: mm(x, W) + b
        x = torch.mm(x, self.fc1_weight) + self.fc1_bias
        x = torch.relu(x)
        # addmm pattern 2: b + mm(x, W)
        x = self.fc2_bias + torch.mm(x, self.fc2_weight)
        x = torch.relu(x)
        # addmm pattern 1 again
        x = torch.mm(x, self.fc3_weight) + self.fc3_bias
        return x

device = "cuda"
torch.manual_seed(42)
model = DualPatternModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Relative error: {diff.max().item() / ref1.abs().max().item():.6e}")

---

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces numerically different results compared to eager mode when the model uses explicit torch.mm(x, weight.t()) + bias and bias + torch.mm(x, weight.t()) addmm fusion patterns. The model is a CNN with pointwise (1×1) convolutions, GELU activation via erf approximation, flattening, and 3 sequential fully-connected layers using the explicit mm+bias pattern.

The root cause is that the Inductor fuses mm + add into a single addmm CUBLAS call with a different floating-point accumulation order. While individual layer differences are small, they compound through 3 sequential mm+bias layers with ReLU activations, amplifying the final discrepancy beyond acceptable float32 tolerance (rtol=1.3e-6).

Eager mode is perfectly deterministic (max_var=0 across runs), ruling out GPU non-determinism. The compiled output systematically differs.

Minimal reproducer

import torch
import torch.nn as nn

class DualPatternModel(nn.Module):
    def __init__(self):
        super().__init__()
        # CNN feature extraction with pointwise convs
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv_pw1 = nn.Conv2d(32, 64, kernel_size=1)   # Pointwise 1x1
        self.conv_pw2 = nn.Conv2d(64, 128, kernel_size=1)   # Pointwise 1x1
        self.pool = nn.MaxPool2d(2, 2)
        # 3 sequential mm+bias layers (explicit addmm pattern)
        self.fc1_weight = nn.Parameter(torch.randn(128 * 4 * 4, 512))
        self.fc1_bias = nn.Parameter(torch.randn(512))
        self.fc2_weight = nn.Parameter(torch.randn(512, 256))
        self.fc2_bias = nn.Parameter(torch.randn(256))
        self.fc3_weight = nn.Parameter(torch.randn(256, 10))
        self.fc3_bias = nn.Parameter(torch.randn(10))

    def _gelu_erf(self, x):
        return 0.5 * x * (1.0 + torch.erf(x * 0.7071067811865476))

    def forward(self, x):
        # Conv + GELU via erf + pool
        x = self.pool(self._gelu_erf(self.conv1(x)))       # [4,32,16,16]
        x = self.pool(self._gelu_erf(self.conv_pw1(x)))    # [4,64,8,8]
        x = self.pool(self._gelu_erf(self.conv_pw2(x)))    # [4,128,4,4]
        x = x.flatten(1)                                    # [4, 2048]
        # addmm pattern 1: mm(x, W) + b
        x = torch.mm(x, self.fc1_weight) + self.fc1_bias
        x = torch.relu(x)
        # addmm pattern 2: b + mm(x, W)
        x = self.fc2_bias + torch.mm(x, self.fc2_weight)
        x = torch.relu(x)
        # addmm pattern 1 again
        x = torch.mm(x, self.fc3_weight) + self.fc3_bias
        return x

device = "cuda"
torch.manual_seed(42)
model = DualPatternModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Relative error: {diff.max().item() / ref1.abs().max().item():.6e}")

Behavior summary

ModeResultNotes
EagerReference outputPerfectly deterministic across runs (max_var=0)
torch.compile(backend="inductor")Different outputNumerical values differ beyond float32 tolerance

Notes

  • Eager mode is perfectly deterministic (max_var=0 across multiple runs), ruling out GPU non-determinism.
  • The difference compounds through 3 sequential mm+bias layers with ReLU activations — each layer amplifies small accumulation differences.
  • The mm + bias pattern is fused to addmm by Inductor, which may select a different CUBLAS algorithm with different accumulation order.
  • GELU via erf approximation in the conv layers may contribute additional precision differences in the feature extraction pipeline.

Error logs

No error — the compiled model produces a result, but it differs numerically from eager:

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu

extent analysis

TL;DR

The most likely fix is to avoid fusing mm + add into a single addmm CUBLAS call by modifying the model to use a consistent addmm pattern or by disabling the fusion in the Inductor backend.

Guidance

  • Verify that the issue is indeed caused by the fusion of mm + add into addmm by checking the Inductor backend's documentation for options to disable this fusion.
  • Consider modifying the model to use a consistent addmm pattern, such as torch.mm(x, weight.t()) + bias, to reduce the accumulation differences.
  • Check the PyTorch version and Inductor backend version for any known issues or updates that may address this problem.
  • If possible, test the model with a different backend or a different version of PyTorch to see if the issue persists.

Example

# Consistent addmm pattern
x = torch.mm(x, self.fc1_weight) + self.fc1_bias
x = torch.mm(x, self.fc2_weight) + self.fc2_bias
x = torch.mm(x, self.fc3_weight) + self.fc3_bias

Notes

The issue may be specific to the Inductor backend and the addmm fusion, so modifying the model or disabling the fusion may resolve the issue. However, this may also affect the performance of the model.

Recommendation

Apply a workaround by modifying the model to use a consistent addmm pattern, as this is a more targeted solution that addresses the root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING