pytorch - 💡(How to fix) Fix `torch.scatter_reduce` CPU performance regression (~15

Fix Action

Fix / Workaround

torch.scatter_reduce on CPU is 15–58× slower in the 2.13.0.dev20260512+cu130 nightly compared to 20260428 and 20260505. GPU time is identical across all three builds, isolating the regression to the CPU dispatch or algorithm path. Two shapes and reduce ops are affected. Both were fine on Apr 28 and May 5 and first regressed on May 12.

Code Example

import torch
import time
import statistics

def bench_cpu(fn, *args, n=30, **kwargs):
    fn(*args, **kwargs)  # warmup
    samples = [0.0] * n
    for i in range(n):
        t0 = time.perf_counter()
        fn(*args, **kwargs)
        samples[i] = time.perf_counter() - t0
    return statistics.median(samples) * 1e3  # ms

rng = torch.Generator()
rng.manual_seed(880014)

# Shape A: float32 amax
input_a = torch.randn(128, 64, 128, dtype=torch.float32)
index_a = torch.randint(0, 64, (128, 256, 128), dtype=torch.int64)
src_a   = torch.randn(128, 256, 128, dtype=torch.float32)
t_a = bench_cpu(torch.scatter_reduce,
                input_a.clone(), 1, index_a, src_a,
                reduce="amax", include_self=False)
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.6 ms; regressed to ~24 ms on May 12

# Shape B: float16 mean
input_b = torch.randn(32, 128, 512, dtype=torch.float16)
index_b = torch.randint(0, 128, (32, 256, 512), dtype=torch.int64)
src_b   = torch.randn(32, 256, 512, dtype=torch.float16)
t_b = bench_cpu(torch.scatter_reduce,
                input_b.clone(), 1, index_b, src_b,
                reduce="mean", include_self=True)
print(f"Shape B: {t_b:.2f} ms")   # expect ~1.3 ms; regressed to ~78 ms on May 12

---

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0

🐛 Describe the bug

To Reproduce

import torch
import time
import statistics

def bench_cpu(fn, *args, n=30, **kwargs):
    fn(*args, **kwargs)  # warmup
    samples = [0.0] * n
    for i in range(n):
        t0 = time.perf_counter()
        fn(*args, **kwargs)
        samples[i] = time.perf_counter() - t0
    return statistics.median(samples) * 1e3  # ms

rng = torch.Generator()
rng.manual_seed(880014)

# Shape A: float32 amax
input_a = torch.randn(128, 64, 128, dtype=torch.float32)
index_a = torch.randint(0, 64, (128, 256, 128), dtype=torch.int64)
src_a   = torch.randn(128, 256, 128, dtype=torch.float32)
t_a = bench_cpu(torch.scatter_reduce,
                input_a.clone(), 1, index_a, src_a,
                reduce="amax", include_self=False)
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.6 ms; regressed to ~24 ms on May 12

# Shape B: float16 mean
input_b = torch.randn(32, 128, 512, dtype=torch.float16)
index_b = torch.randint(0, 128, (32, 256, 512), dtype=torch.int64)
src_b   = torch.randn(32, 256, 512, dtype=torch.float16)
t_b = bench_cpu(torch.scatter_reduce,
                input_b.clone(), 1, index_b, src_b,
                reduce="mean", include_self=True)
print(f"Shape B: {t_b:.2f} ms")   # expect ~1.3 ms; regressed to ~78 ms on May 12

Expected Behavior

CPU median latency consistent with the Apr 28 baseline:

Shape	dtype	reduce	include_self	Expected
`[128, 64, 128]` input, `[128, 256, 128]` src, dim=1	float32	amax	False	~1.6 ms
`[32, 128, 512]` input, `[32, 256, 512]` src, dim=1	float16	mean	True	~1.3 ms

Actual Behavior

Shape	Build	CPU time	Slowdown
Shape A — float32 amax	20260428	1.59 ms	—
Shape A — float32 amax	20260505	1.57 ms	—
Shape A — float32 amax	20260512	24.00 ms	15×
Shape B — float16 mean	20260428	1.33 ms	—
Shape B — float16 mean	20260505	1.31 ms	—
Shape B — float16 mean	20260512	77.98 ms	58×

GPU time is ~0.14 ms across all three builds for both shapes — the regression is CPU-only.

Versions

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0

Bisect range: works in 2.13.0.dev20260505+cu130, broken in 2.13.0.dev20260512+cu130.

cc @jerryzh168 @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01 @pragupta

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.scatter_reduce` CPU performance regression (~15–58×) introduced in 2.13.0.

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

To Reproduce

Expected Behavior

Actual Behavior

Versions

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `torch.scatter_reduce` CPU performance regression (~15–58×) introduced in 2.13.0.

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

To Reproduce

Expected Behavior

Actual Behavior

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING