pytorch - ✅(Solved) Fix `torch.compile` produces wrong results for `adaptive_avg_pool2d` + `flatten` + `sum` fusion [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180956Fetched 2026-04-22 07:43:24
View on GitHub
Comments
2
Participants
3
Timeline
120
Reactions
0
Author
Assignees
Timeline (top)
subscribed ×55mentioned ×54labeled ×5commented ×2

Error Message

This is a silent correctness bug — no error is raised, the output is simply wrong. The error is large (up to 1e5 in max-autotune mode) and persists in float64, ruling out precision noise. | batch > 1 | Yes | batch=1 always correct; error grows with batch |

Root Cause

The bug is not in adaptive_avg_pool2d itself — pool-only output is correct. It occurs when inductor fuses the pool with flatten + sum into a single reduction kernel. The tiling/indexing for the fused kernel appears to miscalculate offsets for batch elements beyond the first (batch[0] is always correct, batch[1+] are wrong).

The alignment to multiples of 16 and the 1024-element threshold suggest a Triton reduction tile size boundary issue in the fused kernel's index computation.

Fix Action

Fixed

PR fix notes

PR #180898: [inductor] Fix bug with contiguous checks and comprehensive_padding

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #180898

Fixes #180848 Fixes #180956

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • test/inductor/test_torchinductor.py (modified, +16/-0)
  • torch/_inductor/ir.py (modified, +26/-3)

Code Example

import torch

x = torch.randn(3, 33, 8, 8, dtype=torch.float64, device='cuda')

def f(x):
    return torch.nn.functional.adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1)

# Eager — correct
print(f"eager = {f(x).tolist()}")

# Inductor — wrong
torch._dynamo.reset()
compiled = torch.compile(f, backend='inductor')(x)
print(f"inductor = {compiled.tolist()}")
print(f"max_diff = {(f(x) - compiled).abs().max().item():.2e}")
# max_diff ~ 5e+00 to 8e+01 depending on seed
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile(backend='inductor') silently produces wrong results when adaptive_avg_pool2d is fused with flatten + sum (or mean). The bug triggers when batch > 1 and the flattened reduction dimension (C × pool_h × pool_w) exceeds 1024 with channels not a multiple of 16.

This is a silent correctness bug — no error is raised, the output is simply wrong. The error is large (up to 1e5 in max-autotune mode) and persists in float64, ruling out precision noise.

Minimal repro

import torch

x = torch.randn(3, 33, 8, 8, dtype=torch.float64, device='cuda')

def f(x):
    return torch.nn.functional.adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1)

# Eager — correct
print(f"eager = {f(x).tolist()}")

# Inductor — wrong
torch._dynamo.reset()
compiled = torch.compile(f, backend='inductor')(x)
print(f"inductor = {compiled.tolist()}")
print(f"max_diff = {(f(x) - compiled).abs().max().item():.2e}")
# max_diff ~ 5e+00 to 8e+01 depending on seed

Expected behavior

Inductor output should match eager output. adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1) is a standard global-pool-then-reduce pattern used in CNN classifiers.

Trigger conditions

ConditionRequired?Notes
batch > 1Yesbatch=1 always correct; error grows with batch
C % 16 != 0YesC=16,32,64,128,256,512 all safe
C × pool_h × pool_w > 1024Yese.g. C=21, pool=7 → 21×49=1029 triggers; C=20 → 980 safe
adaptive_avg_pool2d fused with flatten+sumYesPool-only or flatten+sum-only are both correct separately

Root cause analysis

The bug is not in adaptive_avg_pool2d itself — pool-only output is correct. It occurs when inductor fuses the pool with flatten + sum into a single reduction kernel. The tiling/indexing for the fused kernel appears to miscalculate offsets for batch elements beyond the first (batch[0] is always correct, batch[1+] are wrong).

The alignment to multiples of 16 and the 1024-element threshold suggest a Triton reduction tile size boundary issue in the fused kernel's index computation.

Additional observations

  • All compile modes affected: default, reduce-overhead, max-autotune (worst: diff up to 2.2e5)
  • float64 diffs range from 1e0 to 1e2 — not a precision issue
  • Replacing sum with mean also triggers the bug
  • pool + sum(spatial_dims) without flatten is correct — the flatten reshape changes the reduction structure
  • Non-square spatial sizes (e.g., 8×10) also trigger
  • Different pool output sizes (5, 7) all trigger under the same conditions

Versions

  • PyTorch: 2.13.0.dev20260420+cu126 (nightly)
  • CUDA: 12.6
  • GPU: Tesla T4
  • OS: Ubuntu 20.04, Python 3.11

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix for the silent correctness bug in torch.compile(backend='inductor') when adaptive_avg_pool2d is fused with flatten + sum is to avoid fusing these operations or use a different backend until the issue is resolved.

Guidance

  • Verify that the issue occurs only when the batch size is greater than 1 and the flattened reduction dimension exceeds 1024 with channels not a multiple of 16.
  • Check if using a different backend, such as the default backend, produces the correct results.
  • Consider avoiding the fusion of adaptive_avg_pool2d with flatten + sum by applying these operations separately.
  • Test the code with different input sizes and shapes to ensure the issue is consistently reproduced.

Example

import torch

x = torch.randn(3, 33, 8, 8, dtype=torch.float64, device='cuda')

def f(x):
    pool = torch.nn.functional.adaptive_avg_pool2d(x, 7)
    flatten = pool.flatten(1)
    return flatten.sum(dim=-1)

# Eager — correct
print(f"eager = {f(x).tolist()}")

# Inductor — wrong
torch._dynamo.reset()
compiled = torch.compile(f, backend='inductor')(x)
print(f"inductor = {compiled.tolist()}")
print(f"max_diff = {(f(x) - compiled).abs().max().item():.2e}")

Notes

The issue appears to be related to the tiling/indexing for the fused kernel in the inductor backend. The alignment to multiples of 16 and the 1024-element threshold suggest a Triton reduction tile size boundary issue.

Recommendation

Apply a workaround by avoiding the fusion of adaptive_avg_pool2d with flatten + sum until the issue is resolved in a future version of PyTorch. This can be done by applying these operations separately, as shown in the example above.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Inductor output should match eager output. adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1) is a standard global-pool-then-reduce pattern used in CNN classifiers.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING