pytorch - 💡(How to fix) Fix `torch.compile` produces numerically different results for batch pointwise fusion pattern (`select → unsqueeze → expand → pointwise`) compared to eager mode [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179568Fetched 2026-04-08 03:00:20
View on GitHub
Comments
1
Participants
1
Timeline
17
Reactions
0
Author
Participants
Timeline (top)
mentioned ×7subscribed ×7labeled ×2commented ×1

Error Message

Error logs

No error — the compiled model produces a result, but it differs numerically from eager:

Root Cause

  • Eager mode is perfectly deterministic (max_var=0 across runs), confirming the mismatch is not caused by GPU non-determinism.
  • The pattern specifically targets BatchPointwiseMathOpsPostGradFusion which fuses pointwise operations sharing a common aten.select parent.
  • The fusion reorders or combines the batch of pointwise operations, changing the computation order and introducing floating-point accumulation differences.
  • The div operation with small denominators (base.abs() + 1e-6) may amplify small differences from the fusion.

Code Example

import torch
import torch.nn as nn

class SelectParentPointwiseModel(nn.Module):
    def __init__(self, channels=64):
        super().__init__()
        self.conv = nn.Conv2d(3, channels, kernel_size=3, padding=1)
        self.bn = nn.BatchNorm2d(channels)
        self.scale = nn.Parameter(torch.randn(channels, 1, 1))
        self.shift = nn.Parameter(torch.randn(channels, 1, 1))

    def forward(self, x):
        base = self.bn(self.conv(x))  # [B, C, H, W]

        # Multiple aten.select on channel dim → unsqueeze → expand
        s0 = base.select(1, 0).unsqueeze(1).expand_as(base)
        s1 = base.select(1, 1).unsqueeze(1).expand_as(base)
        s2 = base.select(1, 2).unsqueeze(1).expand_as(base)
        s3 = base.select(1, 3).unsqueeze(1).expand_as(base)

        # Batch pointwise ops: add, mul, div, torch.add with alpha
        p1 = s0 + base                              # add
        p2 = s1 * base                              # mul
        p3 = s2 / (base.abs() + 1e-6)               # div
        p4 = torch.add(s3, base, alpha=0.5)          # add with alpha
        p5 = s0 * self.scale + self.shift            # scale + shift
        p6 = s1 + s2 * base                          # mixed

        combined = p1 + p2 + p3 + p4 + p5 + p6
        return combined

device = "cuda"
torch.manual_seed(42)
model = SelectParentPointwiseModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Match: {torch.allclose(ref1, out, atol=1e-5, rtol=1e-4)}")

---

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch from post-grad batch pointwise fusion)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces numerically different results compared to eager mode when the model performs multiple aten.select operations on the channel dimension of a Conv2d+BatchNorm2d output, followed by unsqueeze, expand, and batch pointwise math operations (add, mul, div, torch.add with alpha). The pattern targets Inductor's BatchPointwiseMathOpsPostGradFusion post-grad optimization pass.

Eager mode is perfectly deterministic (max_var=0 across runs), ruling out GPU non-determinism. The compiled output systematically differs due to the post-grad fusion of batch pointwise operations that share a common parent tensor from aten.select.

Minimal reproducer

import torch
import torch.nn as nn

class SelectParentPointwiseModel(nn.Module):
    def __init__(self, channels=64):
        super().__init__()
        self.conv = nn.Conv2d(3, channels, kernel_size=3, padding=1)
        self.bn = nn.BatchNorm2d(channels)
        self.scale = nn.Parameter(torch.randn(channels, 1, 1))
        self.shift = nn.Parameter(torch.randn(channels, 1, 1))

    def forward(self, x):
        base = self.bn(self.conv(x))  # [B, C, H, W]

        # Multiple aten.select on channel dim → unsqueeze → expand
        s0 = base.select(1, 0).unsqueeze(1).expand_as(base)
        s1 = base.select(1, 1).unsqueeze(1).expand_as(base)
        s2 = base.select(1, 2).unsqueeze(1).expand_as(base)
        s3 = base.select(1, 3).unsqueeze(1).expand_as(base)

        # Batch pointwise ops: add, mul, div, torch.add with alpha
        p1 = s0 + base                              # add
        p2 = s1 * base                              # mul
        p3 = s2 / (base.abs() + 1e-6)               # div
        p4 = torch.add(s3, base, alpha=0.5)          # add with alpha
        p5 = s0 * self.scale + self.shift            # scale + shift
        p6 = s1 + s2 * base                          # mixed

        combined = p1 + p2 + p3 + p4 + p5 + p6
        return combined

device = "cuda"
torch.manual_seed(42)
model = SelectParentPointwiseModel().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref1 = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref1 - ref2).abs().max().item():.6e}")

# Compiled: different result
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out = compiled(x)

diff = (ref1.float() - out.float()).abs()
print(f"max_diff={diff.max().item():.6e}")
print(f"Output range: [{ref1.min().item():.2f}, {ref1.max().item():.2f}]")
print(f"Match: {torch.allclose(ref1, out, atol=1e-5, rtol=1e-4)}")

Behavior summary

ModeResultNotes
EagerReference outputPerfectly deterministic across runs (max_var=0)
torch.compile(backend="inductor")Different outputNumerical values differ beyond float32 tolerance

Notes

  • Eager mode is perfectly deterministic (max_var=0 across runs), confirming the mismatch is not caused by GPU non-determinism.
  • The pattern specifically targets BatchPointwiseMathOpsPostGradFusion which fuses pointwise operations sharing a common aten.select parent.
  • The fusion reorders or combines the batch of pointwise operations, changing the computation order and introducing floating-point accumulation differences.
  • The div operation with small denominators (base.abs() + 1e-6) may amplify small differences from the fusion.

Error logs

No error — the compiled model produces a result, but it differs numerically from eager:

Eager deterministic: 0.000000e+00
max_diff > 0 (systematic mismatch from post-grad batch pointwise fusion)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

Disable the BatchPointwiseMathOpsPostGradFusion post-grad optimization pass in the inductor backend to prevent numerical differences.

Guidance

  • Identify the specific optimization pass causing the issue: BatchPointwiseMathOpsPostGradFusion.
  • Consider disabling this pass when using torch.compile with the inductor backend to maintain deterministic results.
  • Be aware that disabling this pass may impact performance, as it is designed to fuse and optimize batch pointwise operations.
  • Verify the fix by comparing the output of the compiled model with the eager mode output using torch.allclose.

Example

No code example is provided, as the fix involves disabling a specific optimization pass, which may require modifying the torch.compile or inductor backend configuration.

Notes

The provided PyTorch version is a development version, and this issue may be resolved in a future release. Additionally, the inductor backend is still under development, and its behavior may change in future versions.

Recommendation

Apply workaround: disable the BatchPointwiseMathOpsPostGradFusion pass when using torch.compile with the inductor backend, as it is the most likely cause of the numerical differences. This will ensure deterministic results, but may impact performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING