pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for addmm fusion pattern (`mm + bias → addmm`) compared to eager mode [1 comments, 1 participants]

pytorch2026-04-07 12:57:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#179567•Fetched 2026-04-08 03:00:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

himi1008

Participants

himi1008

Timeline (top)

mentioned ×3subscribed ×3labeled ×2commented ×1

Error Message

import torch import torch.nn as nn

class DualPatternModel(nn.Module): def init(self): super().init() # CNN feature extraction with pointwise convs self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1) self.conv_pw1 = nn.Conv2d(32, 64, kernel_size=1) # Pointwise 1x1 self.conv_pw2 = nn.Conv2d(64, 128, kernel_size=1) # Pointwise 1x1 self.pool = nn.MaxPool2d(2, 2) # 3 sequential mm+bias layers (explicit addmm pattern) self.fc1_weight = nn.Parameter(torch.randn(128 * 4 * 4, 512)) self.fc1_bias = nn.Parameter(torch.randn(512)) self.fc2_weight = nn.Parameter(torch.randn(512, 256)) self.fc2_bias = nn.Parameter(torch.randn(256)) self.fc3_weight = nn.Parameter(torch.randn(256, 10)) self.fc3_bias = nn.Parameter(torch.randn(10))

def _gelu_erf(self, x):
    return 0.5 * x * (1.0 + torch.erf(x * 0.7071067811865476))

def forward(self, x):
    # Conv + GELU via erf + pool
    x = self.pool(self._gelu_erf(self.conv1(x)))       # [4,32,16,16]
    x = self.pool(self._gelu_erf(self.conv_pw1(x)))    # [4,64,8,8]
    x = self.pool(self._gelu_erf(self.conv_pw2(x)))    # [4,128,4,4]
    x = x.flatten(1)                                    # [4, 2048]
    # addmm pattern 1: mm(x, W) + b
    x = torch.mm(x, self.fc1_weight) + self.fc1_bias
    x = torch.relu(x)
    # addmm pattern 2: b + mm(x, W)
    x = self.fc2_bias + torch.mm(x, self.fc2_weight)
    x = torch.relu(x)
    # addmm pattern 1 again
    x = torch.mm(x, self.fc3_weight) + self.fc3_bias
    return x

device = "cuda" torch.manual_seed(42) model = DualPatternModel().to(device).eval() x = torch.randn(4, 3, 32, 32, device=device)

Eager: deterministic

with torch.no_grad(): ref1 = model(x) ref2 = model(x) print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

Compiled: different result

torch._dynamo.reset() compiled = torch.compile(model, backend="inductor") with torch.no_grad(): out = compiled(x)

diff = (ref1.float() - out.float()).abs() print(f"max_diff={diff.max().item():.6e}") print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]") print(f"Relative error: {diff.max().item() / ref1.abs().max().item():.6e}")

Root Cause

The root cause is that the Inductor fuses mm + add into a single addmm CUBLAS call with a different floating-point accumulation order. While individual layer differences are small, they compound through 3 sequential mm+bias layers with ReLU activations, amplifying the final discrepancy beyond acceptable float32 tolerance (rtol=1.3e-6).

Code Example

import torch
import torch.nn as nn

class DualPatternModel(nn.Module):
    def __init__(self):
        super().__init__()
        # CNN feature extraction with pointwise convs
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv_pw1 = nn.Conv2d(32, 64, kernel_size=1)   # Pointwise 1x1
        self.conv_pw2 = nn.Conv2d(64, 128, kernel_size=1)   # Pointwise 1x1
        self.pool = nn.MaxPool2d(2, 2)
        # 3 sequential mm+bias layers (explicit addmm pattern)
        self.fc1_weight = nn.Parameter(torch.randn(128 * 4 * 4, 512))
        self.fc1_bias = nn.Parameter(torch.randn(512))
        self.fc2_weight = nn.Parameter(torch.randn(512, 256))
        self.fc2_bias = nn.Parameter(torch.randn(256))
        self.fc3_weight = nn.Parameter(torch.randn(256, 10))
        self.fc3_bias = nn.Parameter(torch.randn(10))

    def _gelu_erf(self, x):
        return 0.5 * x * (1.0 + torch.erf(x * 0.7071067811865476))

    def forward(self, x):
        # Conv + GELU via erf + pool
        x = self.pool(self._gelu_erf(self.conv1(x)))       # [4,32,16,16]
        x = self.pool(self._gelu_erf(self.conv_pw1(x)))    # [4,64,8,8]
        x = self.pool(self._gelu_erf(self.conv_pw2(x)))    # [4,128,4,4]
        x = x.flatten(1)                                    # [4, 2048]
        # addmm pattern 1: mm(x, W) + b
        x = torch.mm(x, self.fc1_weight) + self.fc1_bias
        x = torch.relu(x)
        # addmm pattern 2: b + mm(x, W)
        x = self.fc2_bias + torch.mm(x, self.fc2_weight)
        x = torch.relu(x)
        # addmm pattern 1 again
        x = torch.mm(x, self.fc3_weight) + self.fc3_bias
        return x

device = "cuda"
torch.manual_seed(42)
model = DualPatternModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Relative error: {diff.max().item() / ref1.abs().max().item():.6e}")

---

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces numerically different results compared to eager mode when the model uses explicit torch.mm(x, weight.t()) + bias and bias + torch.mm(x, weight.t()) addmm fusion patterns. The model is a CNN with pointwise (1×1) convolutions, GELU activation via erf approximation, flattening, and 3 sequential fully-connected layers using the explicit mm+bias pattern.

Eager mode is perfectly deterministic (max_var=0 across runs), ruling out GPU non-determinism. The compiled output systematically differs.

Minimal reproducer

import torch
import torch.nn as nn

class DualPatternModel(nn.Module):
    def __init__(self):
        super().__init__()
        # CNN feature extraction with pointwise convs
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv_pw1 = nn.Conv2d(32, 64, kernel_size=1)   # Pointwise 1x1
        self.conv_pw2 = nn.Conv2d(64, 128, kernel_size=1)   # Pointwise 1x1
        self.pool = nn.MaxPool2d(2, 2)
        # 3 sequential mm+bias layers (explicit addmm pattern)
        self.fc1_weight = nn.Parameter(torch.randn(128 * 4 * 4, 512))
        self.fc1_bias = nn.Parameter(torch.randn(512))
        self.fc2_weight = nn.Parameter(torch.randn(512, 256))
        self.fc2_bias = nn.Parameter(torch.randn(256))
        self.fc3_weight = nn.Parameter(torch.randn(256, 10))
        self.fc3_bias = nn.Parameter(torch.randn(10))

    def _gelu_erf(self, x):
        return 0.5 * x * (1.0 + torch.erf(x * 0.7071067811865476))

    def forward(self, x):
        # Conv + GELU via erf + pool
        x = self.pool(self._gelu_erf(self.conv1(x)))       # [4,32,16,16]
        x = self.pool(self._gelu_erf(self.conv_pw1(x)))    # [4,64,8,8]
        x = self.pool(self._gelu_erf(self.conv_pw2(x)))    # [4,128,4,4]
        x = x.flatten(1)                                    # [4, 2048]
        # addmm pattern 1: mm(x, W) + b
        x = torch.mm(x, self.fc1_weight) + self.fc1_bias
        x = torch.relu(x)
        # addmm pattern 2: b + mm(x, W)
        x = self.fc2_bias + torch.mm(x, self.fc2_weight)
        x = torch.relu(x)
        # addmm pattern 1 again
        x = torch.mm(x, self.fc3_weight) + self.fc3_bias
        return x

device = "cuda"
torch.manual_seed(42)
model = DualPatternModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Relative error: {diff.max().item() / ref1.abs().max().item():.6e}")

Behavior summary

Mode	Result	Notes
Eager	Reference output	Perfectly deterministic across runs (max_var=0)
`torch.compile(backend="inductor")`	Different output	Numerical values differ beyond float32 tolerance

Notes

Eager mode is perfectly deterministic (max_var=0 across multiple runs), ruling out GPU non-determinism.
The difference compounds through 3 sequential mm+bias layers with ReLU activations — each layer amplifies small accumulation differences.
The mm + bias pattern is fused to addmm by Inductor, which may select a different CUBLAS algorithm with different accumulation order.
GELU via erf approximation in the conv layers may contribute additional precision differences in the feature extraction pipeline.

Error logs

No error — the compiled model produces a result, but it differs numerically from eager:

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu

extent analysis

TL;DR

The most likely fix is to avoid fusing mm + add into a single addmm CUBLAS call by modifying the model to use a consistent addmm pattern or by disabling the fusion in the Inductor backend.

Guidance

Verify that the issue is indeed caused by the fusion of mm + add into addmm by checking the Inductor backend's documentation for options to disable this fusion.
Consider modifying the model to use a consistent addmm pattern, such as torch.mm(x, weight.t()) + bias, to reduce the accumulation differences.
Check the PyTorch version and Inductor backend version for any known issues or updates that may address this problem.
If possible, test the model with a different backend or a different version of PyTorch to see if the issue persists.

Example

# Consistent addmm pattern
x = torch.mm(x, self.fc1_weight) + self.fc1_bias
x = torch.mm(x, self.fc2_weight) + self.fc2_bias
x = torch.mm(x, self.fc3_weight) + self.fc3_bias

Notes

The issue may be specific to the Inductor backend and the addmm fusion, so modifying the model or disabling the fusion may resolve the issue. However, this may also affect the performance of the model.

Recommendation

Apply a workaround by modifying the model to use a consistent addmm pattern, as this is a more targeted solution that addresses the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for addmm fusion pattern (`mm + bias → addmm`) compared to eager mode [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Eager: deterministic

Compiled: different result

Root Cause

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for addmm fusion pattern (`mm + bias → addmm`) compared to eager mode [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Eager: deterministic

Compiled: different result

Root Cause

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING