pytorch - ✅(Solved) Fix `torch.compile` produces significantly different output for model with multiple `amax`/`amin` reductions on same tensor compared to eager mode [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179577Fetched 2026-04-08 03:00:06
View on GitHub
Comments
0
Participants
1
Timeline
83
Reactions
0
Author
Participants
Timeline (top)
mentioned ×38subscribed ×38labeled ×7

Error Message

The difference is large (max_diff in the hundreds), indicating an algorithmic error in the fused reduction kernel — not mere floating-point precision drift. | torch.compile(backend="inductor") | Significantly different | max_diff in the hundreds — algorithmic error, not precision drift |

  • The magnitude of the difference (hundreds, not 1e-5) indicates an algorithmic error in the reuse_partial optimization, not floating-point accumulation differences.

Error logs

No error — outputs silently differ with very large max_diff:

PR fix notes

PR #179719: Skip div-to-mul-reciprocal when division_rounding is enabled

Description (problem / solution / changelog)

When eager_numerics.division_rounding is enabled, the div-by-constant optimization in div_prim() converted f / constant to f * (1/constant) at the IR level. This eliminated the division before codegen could apply tl.div_rn, making the division_rounding flag ineffective for constant divisors.

Skip the reciprocal optimization when division_rounding is enabled so the division flows through to tl.div_rn in Triton codegen.

Inspired by https://github.com/pytorch/pytorch/issues/179577, note that this PR can't fix https://github.com/pytorch/pytorch/issues/179577 since it was caused by the design of the model which has x/1e-8

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • test/inductor/test_cuda_repro.py (modified, +11/-0)
  • torch/_inductor/lowering.py (modified, +8/-2)

Code Example

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiReductionModel(nn.Module):
    def __init__(self, in_channels=3, hidden_dim=64):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, hidden_dim, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(hidden_dim)
        self.conv2 = nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.conv3 = nn.Conv2d(hidden_dim, hidden_dim // 2, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(hidden_dim // 2)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        features = self.relu(x)

        # Multiple reductions on the SAME tensor (triggers reuse_partial)
        spatial_amax = torch.amax(features, dim=[2, 3], keepdim=True)   # per-channel spatial max
        global_amax = torch.amax(features)                               # global max scalar
        channel_amin = torch.amin(features, dim=[1], keepdim=True)       # per-spatial-loc channel min
        global_min_val = torch.min(features)                             # global min scalar

        # Normalize and center using reduction results
        amax_normalized = features / (spatial_amax + 1e-8)
        amax_scaled = amax_normalized * global_amax
        amin_centered = features - channel_amin
        amin_scaled = amin_centered / (torch.abs(global_min_val) + 1e-8)

        # Combine normalized and centered features
        combined = amax_scaled + amin_scaled

        output = self.conv3(combined)
        output = self.bn2(output)
        output = self.relu(output)
        return output


device = "cuda"
torch.manual_seed(42)
model = MultiReductionModel().to(device).eval()
x = torch.randn(2, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref - ref2).abs().max().item():.6e}")

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    comp = compiled(x)

diff = (ref.float() - comp.float()).abs()
print(f"Max diff: {diff.max().item():.6f}")
print(f"Mean diff: {diff.mean().item():.6e}")
print(f"Match (atol=1e-5): {torch.allclose(ref, comp, atol=1e-5, rtol=1e-4)}")
# Expected: max_diff in the hundreds — very large mismatch

---

Eager deterministic: 0.000000e+00
Max diff: ~896.0+ (very large systematic mismatch)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces significantly different results for a CNN model that performs multiple reduction operations (torch.amax, torch.amin, torch.min) on the same feature tensor with different dimension arguments, then uses the reduction results for normalization and centering. The pattern targets Inductor's reuse_partial optimization, which attempts to reuse partial reduction results when multiple reductions share the same source tensor.

The difference is large (max_diff in the hundreds), indicating an algorithmic error in the fused reduction kernel — not mere floating-point precision drift.

Eager mode is perfectly deterministic (max_var=0 across runs), confirming this is a systematic computation difference introduced by the Inductor reuse_partial optimization.

Minimal reproducer

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiReductionModel(nn.Module):
    def __init__(self, in_channels=3, hidden_dim=64):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, hidden_dim, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(hidden_dim)
        self.conv2 = nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.conv3 = nn.Conv2d(hidden_dim, hidden_dim // 2, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(hidden_dim // 2)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        features = self.relu(x)

        # Multiple reductions on the SAME tensor (triggers reuse_partial)
        spatial_amax = torch.amax(features, dim=[2, 3], keepdim=True)   # per-channel spatial max
        global_amax = torch.amax(features)                               # global max scalar
        channel_amin = torch.amin(features, dim=[1], keepdim=True)       # per-spatial-loc channel min
        global_min_val = torch.min(features)                             # global min scalar

        # Normalize and center using reduction results
        amax_normalized = features / (spatial_amax + 1e-8)
        amax_scaled = amax_normalized * global_amax
        amin_centered = features - channel_amin
        amin_scaled = amin_centered / (torch.abs(global_min_val) + 1e-8)

        # Combine normalized and centered features
        combined = amax_scaled + amin_scaled

        output = self.conv3(combined)
        output = self.bn2(output)
        output = self.relu(output)
        return output


device = "cuda"
torch.manual_seed(42)
model = MultiReductionModel().to(device).eval()
x = torch.randn(2, 3, 32, 32, device=device)

# Eager: deterministic
with torch.no_grad():
    ref = model(x)
    ref2 = model(x)
print(f"Eager deterministic: {(ref - ref2).abs().max().item():.6e}")

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    comp = compiled(x)

diff = (ref.float() - comp.float()).abs()
print(f"Max diff: {diff.max().item():.6f}")
print(f"Mean diff: {diff.mean().item():.6e}")
print(f"Match (atol=1e-5): {torch.allclose(ref, comp, atol=1e-5, rtol=1e-4)}")
# Expected: max_diff in the hundreds — very large mismatch

Behavior summary

ModeResultNotes
EagerReference outputPerfectly deterministic across runs (max_var=0)
torch.compile(backend="inductor")Significantly differentmax_diff in the hundreds — algorithmic error, not precision drift

Notes

  • Eager mode is perfectly deterministic (max_var=0), ruling out GPU non-determinism.
  • The magnitude of the difference (hundreds, not 1e-5) indicates an algorithmic error in the reuse_partial optimization, not floating-point accumulation differences.
  • Inductor's reuse_partial optimization attempts to share partial reduction computations when multiple reductions operate on the same source tensor with overlapping dimension sets (e.g., amax(dim=[2,3]) and amax() global both reduce spatial dimensions).
  • The fused reduction kernel may incorrectly carry forward an intermediate partial result, causing the features / (spatial_amax + 1e-8) * global_amax arithmetic to produce wildly different values.
  • Different reduction dimension combinations on the same tensor (dim=[2,3], dim=[1], global) form the critical pattern that exposes this bug.

Error logs

No error — outputs silently differ with very large max_diff:

Eager deterministic: 0.000000e+00
Max diff: ~896.0+ (very large systematic mismatch)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

The most likely fix is to disable the reuse_partial optimization in the Inductor backend or update to a version where this issue is resolved.

Guidance

  • Verify that the issue is indeed caused by the reuse_partial optimization by disabling it and checking if the results match the eager mode output.
  • Check the PyTorch version and update to the latest version if possible, as this issue may have been fixed in a later release.
  • If updating is not possible, consider using a different backend or disabling the reuse_partial optimization as a temporary workaround.
  • Test the model with different reduction dimension combinations to ensure the issue is resolved.

Example

# Disable reuse_partial optimization (example, actual implementation may vary)
compiled = torch.compile(model, backend="inductor", full_reductions=True)

Note: The above example is hypothetical, as the actual implementation of disabling reuse_partial optimization is not provided in the issue.

Notes

The provided information suggests an algorithmic error in the reuse_partial optimization, which may be specific to the Inductor backend. Disabling this optimization or updating to a later version may resolve the issue. However, without further information on the PyTorch version or the ability to update, a definitive fix cannot be provided.

Recommendation

Apply workaround: Disable the reuse_partial optimization or use a different backend, as updating to a fixed version may not be possible. This will likely resolve the issue, but may impact performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING