pytorch - ✅(Solved) Fix `torch.compile` produces wrong results for `adaptive_avg_pool2d` + `flatten` + `sum` fusion [1 pull requests, 2 comments, 3 participants]

Q: Expected behavior

Inductor output should match eager output. `adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1)` is a standard global-pool-then-reduce pattern used in CNN classifiers.

pytorch2026-04-21 03:42:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180956•Fetched 2026-04-22 07:43:24

View on GitHub

Comments

Participants

Timeline

120

Reactions

Author

Participants

Assignees

Timeline (top)

subscribed ×55mentioned ×54labeled ×5commented ×2

Error Message

This is a silent correctness bug — no error is raised, the output is simply wrong. The error is large (up to 1e5 in max-autotune mode) and persists in float64, ruling out precision noise. | batch > 1 | Yes | batch=1 always correct; error grows with batch |

Root Cause

The bug is not in adaptive_avg_pool2d itself — pool-only output is correct. It occurs when inductor fuses the pool with flatten + sum into a single reduction kernel. The tiling/indexing for the fused kernel appears to miscalculate offsets for batch elements beyond the first (batch[0] is always correct, batch[1+] are wrong).

The alignment to multiples of 16 and the 1024-element threshold suggest a Triton reduction tile size boundary issue in the fused kernel's index computation.

Fix Action

Fixed

Fixed by PR: [inductor] Fix bug with contiguous checks and comprehensive_padding (https://github.com/pytorch/pytorch/pull/180898)
Closed with commit: 420f50ff37787786821a4b9543ade688de165d8d

PR fix notes

PR #180898: [inductor] Fix bug with contiguous checks and comprehensive_padding

Repository: pytorch/pytorch
Author: jansel
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/180898

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

-> #180898

Fixes #180848 Fixes #180956

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

test/inductor/test_torchinductor.py (modified, +16/-0)
torch/_inductor/ir.py (modified, +26/-3)

Code Example

import torch

x = torch.randn(3, 33, 8, 8, dtype=torch.float64, device='cuda')

def f(x):
    return torch.nn.functional.adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1)

# Eager — correct
print(f"eager = {f(x).tolist()}")

# Inductor — wrong
torch._dynamo.reset()
compiled = torch.compile(f, backend='inductor')(x)
print(f"inductor = {compiled.tolist()}")
print(f"max_diff = {(f(x) - compiled).abs().max().item():.2e}")
# max_diff ~ 5e+00 to 8e+01 depending on seed

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile(backend='inductor') silently produces wrong results when adaptive_avg_pool2d is fused with flatten + sum (or mean). The bug triggers when batch > 1 and the flattened reduction dimension (C × pool_h × pool_w) exceeds 1024 with channels not a multiple of 16.

This is a silent correctness bug — no error is raised, the output is simply wrong. The error is large (up to 1e5 in max-autotune mode) and persists in float64, ruling out precision noise.

Minimal repro

import torch

x = torch.randn(3, 33, 8, 8, dtype=torch.float64, device='cuda')

def f(x):
    return torch.nn.functional.adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1)

# Eager — correct
print(f"eager = {f(x).tolist()}")

# Inductor — wrong
torch._dynamo.reset()
compiled = torch.compile(f, backend='inductor')(x)
print(f"inductor = {compiled.tolist()}")
print(f"max_diff = {(f(x) - compiled).abs().max().item():.2e}")
# max_diff ~ 5e+00 to 8e+01 depending on seed

Expected behavior

Inductor output should match eager output. adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1) is a standard global-pool-then-reduce pattern used in CNN classifiers.

Trigger conditions

Condition	Required?	Notes
`batch > 1`	Yes	`batch=1` always correct; error grows with batch
`C % 16 != 0`	Yes	C=16,32,64,128,256,512 all safe
`C × pool_h × pool_w > 1024`	Yes	e.g. C=21, pool=7 → 21×49=1029 triggers; C=20 → 980 safe
`adaptive_avg_pool2d` fused with `flatten+sum`	Yes	Pool-only or flatten+sum-only are both correct separately

Root cause analysis

The alignment to multiples of 16 and the 1024-element threshold suggest a Triton reduction tile size boundary issue in the fused kernel's index computation.

Additional observations

All compile modes affected: default, reduce-overhead, max-autotune (worst: diff up to 2.2e5)
float64 diffs range from 1e0 to 1e2 — not a precision issue
Replacing sum with mean also triggers the bug
pool + sum(spatial_dims) without flatten is correct — the flatten reshape changes the reduction structure
Non-square spatial sizes (e.g., 8×10) also trigger
Different pool output sizes (5, 7) all trigger under the same conditions

Versions

PyTorch: 2.13.0.dev20260420+cu126 (nightly)
CUDA: 12.6
GPU: Tesla T4
OS: Ubuntu 20.04, Python 3.11

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix for the silent correctness bug in torch.compile(backend='inductor') when adaptive_avg_pool2d is fused with flatten + sum is to avoid fusing these operations or use a different backend until the issue is resolved.

Guidance

Verify that the issue occurs only when the batch size is greater than 1 and the flattened reduction dimension exceeds 1024 with channels not a multiple of 16.
Check if using a different backend, such as the default backend, produces the correct results.
Consider avoiding the fusion of adaptive_avg_pool2d with flatten + sum by applying these operations separately.
Test the code with different input sizes and shapes to ensure the issue is consistently reproduced.

Example

import torch

x = torch.randn(3, 33, 8, 8, dtype=torch.float64, device='cuda')

def f(x):
    pool = torch.nn.functional.adaptive_avg_pool2d(x, 7)
    flatten = pool.flatten(1)
    return flatten.sum(dim=-1)

# Eager — correct
print(f"eager = {f(x).tolist()}")

# Inductor — wrong
torch._dynamo.reset()
compiled = torch.compile(f, backend='inductor')(x)
print(f"inductor = {compiled.tolist()}")
print(f"max_diff = {(f(x) - compiled).abs().max().item():.2e}")

Notes

The issue appears to be related to the tiling/indexing for the fused kernel in the inductor backend. The alignment to multiples of 16 and the 1024-element threshold suggest a Triton reduction tile size boundary issue.

Recommendation

Apply a workaround by avoiding the fusion of adaptive_avg_pool2d with flatten + sum until the issue is resolved in a future version of PyTorch. This can be done by applying these operations separately, as shown in the example above.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Inductor output should match eager output. adaptive_avg_pool2d(x, 7).flatten(1).sum(dim=-1) is a standard global-pool-then-reduce pattern used in CNN classifiers.

#tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix `torch.compile` produces wrong results for `adaptive_avg_pool2d` + `flatten` + `sum` fusion [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #180898: [inductor] Fix bug with contiguous checks and comprehensive_padding

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Minimal repro

Expected behavior

Trigger conditions

Root cause analysis

Additional observations

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix `torch.compile` produces wrong results for `adaptive_avg_pool2d` + `flatten` + `sum` fusion [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #180898: [inductor] Fix bug with contiguous checks and comprehensive_padding

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Minimal repro

Expected behavior

Trigger conditions

Root cause analysis

Additional observations

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING