pytorch - 💡(How to fix) Fix torch.compile: mean/sum on non-contiguous tensor is much slower than eager [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181442Fetched 2026-04-25 06:02:31
View on GitHub
Comments
0
Participants
1
Timeline
45
Reactions
0
Participants
Timeline (top)
mentioned ×19subscribed ×19labeled ×7

Code Example

import statistics, torch
x = torch.empty_strided((640, 256), (1, 640),
                        dtype=torch.float32, device="cuda").normal_()
# dim 0 (size 640) is contiguous; dim 1 (size 256, reduction axis) has stride 640
compiled = torch.compile(lambda t: t.mean(dim=1),
                         dynamic=False, mode="reduce-overhead")
torch.testing.assert_close(compiled(x), x.mean(dim=1), rtol=1e-3, atol=1e-4)
def bench(fn, n=2000, w=200):
    for _ in range(w): fn()
    torch.cuda.synchronize()
    s = [torch.cuda.Event(enable_timing=True) for _ in range(n)]
    e = [torch.cuda.Event(enable_timing=True) for _ in range(n)]
    for i in range(n):
        s[i].record(); fn(); e[i].record()
    torch.cuda.synchronize()
    return statistics.median(a.elapsed_time(b) * 1e3 for a, b in zip(s, e))
for name, fn in [("eager  ", lambda: x.mean(dim=1)),
                 ("compile", lambda: compiled(x))]:
    t = bench(fn)
    print(f"{name}  {t:8.2f} us   {640*256*4/(t*1e-6)/1e9:6.1f} GB/s")

---

eager       19.23 us     34.1 GB/s
compile     64.51 us     10.2 GB/s

---

eager       19.31 us     33.9 GB/s
compile     61.07 us     10.7 GB/s
RAW_BUFFERClick to expand / collapse

Summary

For a simple x.mean(dim=1) on a 640 KB fp32 tensor whose reduction axis is non-contiguous, torch.compile generates two Triton kernels (split reduction) and runs materially slower than eager, which uses a single column-reduction kernel. The extra kernel launch + intermediate buffer plus an uncoalesced access pattern combine to produce a regression that scales with how small and how "outer" the reduction is. Even running with TORCHINDUCTOR_SPLIT_REDUCTIONS=0 (or torch._inductor.config.split_reductions = False) makes Inductor emit a single kernel for the outer case, but it's still slower than eager a lot.

Minimal repro

import statistics, torch
x = torch.empty_strided((640, 256), (1, 640),
                        dtype=torch.float32, device="cuda").normal_()
# dim 0 (size 640) is contiguous; dim 1 (size 256, reduction axis) has stride 640
compiled = torch.compile(lambda t: t.mean(dim=1),
                         dynamic=False, mode="reduce-overhead")
torch.testing.assert_close(compiled(x), x.mean(dim=1), rtol=1e-3, atol=1e-4)
def bench(fn, n=2000, w=200):
    for _ in range(w): fn()
    torch.cuda.synchronize()
    s = [torch.cuda.Event(enable_timing=True) for _ in range(n)]
    e = [torch.cuda.Event(enable_timing=True) for _ in range(n)]
    for i in range(n):
        s[i].record(); fn(); e[i].record()
    torch.cuda.synchronize()
    return statistics.median(a.elapsed_time(b) * 1e3 for a, b in zip(s, e))
for name, fn in [("eager  ", lambda: x.mean(dim=1)),
                 ("compile", lambda: compiled(x))]:
    t = bench(fn)
    print(f"{name}  {t:8.2f} us   {640*256*4/(t*1e-6)/1e9:6.1f} GB/s")

output:

eager       19.23 us     34.1 GB/s
compile     64.51 us     10.2 GB/s

Running with TORCHINDUCTOR_SPLIT_REDUCTIONS=0:

eager       19.31 us     33.9 GB/s
compile     61.07 us     10.7 GB/s

cc @jerryzh168 @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix is to investigate and optimize the torch.compile configuration to match the performance of eager mode for non-contiguous reduction axes.

Guidance

  • Investigate the impact of TORCHINDUCTOR_SPLIT_REDUCTIONS=0 on the performance, as it seems to improve the compiled mode performance but still lags behind eager mode.
  • Verify if the issue is specific to the reduction axis being non-contiguous and if there are any workarounds for this case.
  • Compare the performance of torch.compile with different configurations, such as dynamic=True or mode="default", to see if there's a better setup for this specific use case.
  • Consider filing a bug report or feature request to improve torch.compile's performance for non-contiguous reduction axes.

Notes

The provided code snippet and benchmarking results suggest a significant performance regression when using torch.compile for non-contiguous reduction axes. However, without further investigation, it's unclear if this is a fundamental limitation or an optimization opportunity.

Recommendation

Apply workaround: Investigate and optimize torch.compile configuration to match eager mode performance, as the current configuration seems to introduce significant overhead.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING