pytorch - 💡(How to fix) Fix `torch.scatter_reduce` CPU performance regression (~15–58×) introduced in 2.13.0.

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

torch.scatter_reduce on CPU is 15–58× slower in the 2.13.0.dev20260512+cu130 nightly compared to 20260428 and 20260505. GPU time is identical across all three builds, isolating the regression to the CPU dispatch or algorithm path. Two shapes and reduce ops are affected. Both were fine on Apr 28 and May 5 and first regressed on May 12.

Code Example

import torch
import time
import statistics

def bench_cpu(fn, *args, n=30, **kwargs):
    fn(*args, **kwargs)  # warmup
    samples = [0.0] * n
    for i in range(n):
        t0 = time.perf_counter()
        fn(*args, **kwargs)
        samples[i] = time.perf_counter() - t0
    return statistics.median(samples) * 1e3  # ms

rng = torch.Generator()
rng.manual_seed(880014)

# Shape A: float32 amax
input_a = torch.randn(128, 64, 128, dtype=torch.float32)
index_a = torch.randint(0, 64, (128, 256, 128), dtype=torch.int64)
src_a   = torch.randn(128, 256, 128, dtype=torch.float32)
t_a = bench_cpu(torch.scatter_reduce,
                input_a.clone(), 1, index_a, src_a,
                reduce="amax", include_self=False)
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.6 ms; regressed to ~24 ms on May 12

# Shape B: float16 mean
input_b = torch.randn(32, 128, 512, dtype=torch.float16)
index_b = torch.randint(0, 128, (32, 256, 512), dtype=torch.int64)
src_b   = torch.randn(32, 256, 512, dtype=torch.float16)
t_b = bench_cpu(torch.scatter_reduce,
                input_b.clone(), 1, index_b, src_b,
                reduce="mean", include_self=True)
print(f"Shape B: {t_b:.2f} ms")   # expect ~1.3 ms; regressed to ~78 ms on May 12

---

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.scatter_reduce on CPU is 15–58× slower in the 2.13.0.dev20260512+cu130 nightly compared to 20260428 and 20260505. GPU time is identical across all three builds, isolating the regression to the CPU dispatch or algorithm path. Two shapes and reduce ops are affected. Both were fine on Apr 28 and May 5 and first regressed on May 12.

To Reproduce

import torch
import time
import statistics

def bench_cpu(fn, *args, n=30, **kwargs):
    fn(*args, **kwargs)  # warmup
    samples = [0.0] * n
    for i in range(n):
        t0 = time.perf_counter()
        fn(*args, **kwargs)
        samples[i] = time.perf_counter() - t0
    return statistics.median(samples) * 1e3  # ms

rng = torch.Generator()
rng.manual_seed(880014)

# Shape A: float32 amax
input_a = torch.randn(128, 64, 128, dtype=torch.float32)
index_a = torch.randint(0, 64, (128, 256, 128), dtype=torch.int64)
src_a   = torch.randn(128, 256, 128, dtype=torch.float32)
t_a = bench_cpu(torch.scatter_reduce,
                input_a.clone(), 1, index_a, src_a,
                reduce="amax", include_self=False)
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.6 ms; regressed to ~24 ms on May 12

# Shape B: float16 mean
input_b = torch.randn(32, 128, 512, dtype=torch.float16)
index_b = torch.randint(0, 128, (32, 256, 512), dtype=torch.int64)
src_b   = torch.randn(32, 256, 512, dtype=torch.float16)
t_b = bench_cpu(torch.scatter_reduce,
                input_b.clone(), 1, index_b, src_b,
                reduce="mean", include_self=True)
print(f"Shape B: {t_b:.2f} ms")   # expect ~1.3 ms; regressed to ~78 ms on May 12

Expected Behavior

CPU median latency consistent with the Apr 28 baseline:

Shapedtypereduceinclude_selfExpected
[128, 64, 128] input, [128, 256, 128] src, dim=1float32amaxFalse~1.6 ms
[32, 128, 512] input, [32, 256, 512] src, dim=1float16meanTrue~1.3 ms

Actual Behavior

ShapeBuildCPU timeSlowdown
Shape A — float32 amax202604281.59 ms
Shape A — float32 amax202605051.57 ms
Shape A — float32 amax2026051224.00 ms15×
Shape B — float16 mean202604281.33 ms
Shape B — float16 mean202605051.31 ms
Shape B — float16 mean2026051277.98 ms58×

GPU time is ~0.14 ms across all three builds for both shapes — the regression is CPU-only.

Versions

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0

Bisect range: works in 2.13.0.dev20260505+cu130, broken in 2.13.0.dev20260512+cu130.

cc @jerryzh168 @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01 @pragupta

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING