pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for batch pointwise fusion pattern (`select → unsqueeze → expand → pointwise`) compared to eager mode [1 comments, 1 participants]

pytorch2026-04-07 13:00:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#179568•Fetched 2026-04-08 03:00:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

himi1008

Participants

himi1008

Timeline (top)

mentioned ×7subscribed ×7labeled ×2commented ×1

Error Message

Error logs

No error — the compiled model produces a result, but it differs numerically from eager:

Root Cause

Eager mode is perfectly deterministic (max_var=0 across runs), confirming the mismatch is not caused by GPU non-determinism.
The pattern specifically targets BatchPointwiseMathOpsPostGradFusion which fuses pointwise operations sharing a common aten.select parent.
The fusion reorders or combines the batch of pointwise operations, changing the computation order and introducing floating-point accumulation differences.
The div operation with small denominators (base.abs() + 1e-6) may amplify small differences from the fusion.

Code Example

import torch
import torch.nn as nn

class SelectParentPointwiseModel(nn.Module):
    def __init__(self, channels=64):
        super().__init__()
        self.conv = nn.Conv2d(3, channels, kernel_size=3, padding=1)
        self.bn = nn.BatchNorm2d(channels)
        self.scale = nn.Parameter(torch.randn(channels, 1, 1))
        self.shift = nn.Parameter(torch.randn(channels, 1, 1))

    def forward(self, x):
        base = self.bn(self.conv(x))  # [B, C, H, W]

        # Multiple aten.select on channel dim → unsqueeze → expand
        s0 = base.select(1, 0).unsqueeze(1).expand_as(base)
        s1 = base.select(1, 1).unsqueeze(1).expand_as(base)
        s2 = base.select(1, 2).unsqueeze(1).expand_as(base)
        s3 = base.select(1, 3).unsqueeze(1).expand_as(base)

        # Batch pointwise ops: add, mul, div, torch.add with alpha
        p1 = s0 + base                              # add
        p2 = s1 * base                              # mul
        p3 = s2 / (base.abs() + 1e-6)               # div
        p4 = torch.add(s3, base, alpha=0.5)          # add with alpha
        p5 = s0 * self.scale + self.shift            # scale + shift
        p6 = s1 + s2 * base                          # mixed

        combined = p1 + p2 + p3 + p4 + p5 + p6
        return combined

device = "cuda"
torch.manual_seed(42)
model = SelectParentPointwiseModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Match: {torch.allclose(ref1, out, atol=1e-5, rtol=1e-4)}")

---

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch from post-grad batch pointwise fusion)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces numerically different results compared to eager mode when the model performs multiple aten.select operations on the channel dimension of a Conv2d+BatchNorm2d output, followed by unsqueeze, expand, and batch pointwise math operations (add, mul, div, torch.add with alpha). The pattern targets Inductor's BatchPointwiseMathOpsPostGradFusion post-grad optimization pass.

Eager mode is perfectly deterministic (max_var=0 across runs), ruling out GPU non-determinism. The compiled output systematically differs due to the post-grad fusion of batch pointwise operations that share a common parent tensor from aten.select.

Minimal reproducer

import torch
import torch.nn as nn

class SelectParentPointwiseModel(nn.Module):
    def __init__(self, channels=64):
        super().__init__()
        self.conv = nn.Conv2d(3, channels, kernel_size=3, padding=1)
        self.bn = nn.BatchNorm2d(channels)
        self.scale = nn.Parameter(torch.randn(channels, 1, 1))
        self.shift = nn.Parameter(torch.randn(channels, 1, 1))

    def forward(self, x):
        base = self.bn(self.conv(x))  # [B, C, H, W]

        # Multiple aten.select on channel dim → unsqueeze → expand
        s0 = base.select(1, 0).unsqueeze(1).expand_as(base)
        s1 = base.select(1, 1).unsqueeze(1).expand_as(base)
        s2 = base.select(1, 2).unsqueeze(1).expand_as(base)
        s3 = base.select(1, 3).unsqueeze(1).expand_as(base)

        # Batch pointwise ops: add, mul, div, torch.add with alpha
        p1 = s0 + base                              # add
        p2 = s1 * base                              # mul
        p3 = s2 / (base.abs() + 1e-6)               # div
        p4 = torch.add(s3, base, alpha=0.5)          # add with alpha
        p5 = s0 * self.scale + self.shift            # scale + shift
        p6 = s1 + s2 * base                          # mixed

        combined = p1 + p2 + p3 + p4 + p5 + p6
        return combined

device = "cuda"
torch.manual_seed(42)
model = SelectParentPointwiseModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Match: {torch.allclose(ref1, out, atol=1e-5, rtol=1e-4)}")

Behavior summary

Mode	Result	Notes
Eager	Reference output	Perfectly deterministic across runs (max_var=0)
`torch.compile(backend="inductor")`	Different output	Numerical values differ beyond float32 tolerance

Notes

Eager mode is perfectly deterministic (max_var=0 across runs), confirming the mismatch is not caused by GPU non-determinism.
The pattern specifically targets BatchPointwiseMathOpsPostGradFusion which fuses pointwise operations sharing a common aten.select parent.
The fusion reorders or combines the batch of pointwise operations, changing the computation order and introducing floating-point accumulation differences.
The div operation with small denominators (base.abs() + 1e-6) may amplify small differences from the fusion.

Error logs

No error — the compiled model produces a result, but it differs numerically from eager:

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch from post-grad batch pointwise fusion)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

Disable the BatchPointwiseMathOpsPostGradFusion post-grad optimization pass in the inductor backend to prevent numerical differences.

Guidance

Identify the specific optimization pass causing the issue: BatchPointwiseMathOpsPostGradFusion.
Consider disabling this pass when using torch.compile with the inductor backend to maintain deterministic results.
Be aware that disabling this pass may impact performance, as it is designed to fuse and optimize batch pointwise operations.
Verify the fix by comparing the output of the compiled model with the eager mode output using torch.allclose.

Example

No code example is provided, as the fix involves disabling a specific optimization pass, which may require modifying the torch.compile or inductor backend configuration.

Notes

The provided PyTorch version is a development version, and this issue may be resolved in a future release. Additionally, the inductor backend is still under development, and its behavior may change in future versions.

Recommendation

Apply workaround: disable the BatchPointwiseMathOpsPostGradFusion pass when using torch.compile with the inductor backend, as it is the most likely cause of the numerical differences. This will ensure deterministic results, but may impact performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for batch pointwise fusion pattern (`select → unsqueeze → expand → pointwise`) compared to eager mode [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for batch pointwise fusion pattern (`select → unsqueeze → expand → pointwise`) compared to eager mode [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING