pytorch - 💡(How to fix) Fix `F.conv2d` CPU performance regression (~4–16×) for bfloat16 inputs introduced in 2.13.0.

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

torch.nn.functional.conv2d on CPU is 4–16× slower for bfloat16 inputs in the 2.13.0.dev20260512+cu130 nightly compared to 20260428 and 20260505. GPU time is identical across all three builds, isolating the regression to the CPU dispatch or kernel selection path.

Code Example

import torch
import torch.nn.functional as F
import time
import statistics

def bench_cpu(fn, n=30):
    fn()  # warmup
    return statistics.median(
        [(lambda t0: (fn(), time.perf_counter() - t0)[1])(time.perf_counter())
         for _ in range(n)]
    ) * 1e3

torch.manual_seed(0)

# Shape A: dilated 7×7, NCHW, bfloat16
inp_a = torch.randn(8, 16, 112, 112, dtype=torch.bfloat16)
w_a   = torch.randn(32, 16, 7, 7,    dtype=torch.bfloat16)
t_a   = bench_cpu(lambda: F.conv2d(inp_a, w_a, stride=1, padding=0, dilation=2))
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.9 ms; regressed to ~29 ms on May 12

# Shape B: grouped 3×3, channels_last, bfloat16
inp_b = torch.randn(8, 256, 56, 56, dtype=torch.bfloat16).to(memory_format=torch.channels_last)
w_b   = torch.randn(512, 32, 3, 3,  dtype=torch.bfloat16).to(memory_format=torch.channels_last)
t_b   = bench_cpu(lambda: F.conv2d(inp_b, w_b, stride=1, padding=1, groups=8))
print(f"Shape B: {t_b:.2f} ms")   # expect ~3.0 ms; regressed to ~12 ms on May 12

---

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.nn.functional.conv2d on CPU is 4–16× slower for bfloat16 inputs in the 2.13.0.dev20260512+cu130 nightly compared to 20260428 and 20260505. GPU time is identical across all three builds, isolating the regression to the CPU dispatch or kernel selection path.

Two independent conv configurations are affected — a dilated 7×7 NCHW conv (Shape A) and a grouped 3×3 channels_last conv (Shape B). These exercise distinct algorithm paths (Winograd/FFT vs. NHWC grouped-conv), suggesting the May 12 change broadly disabled multiple bfloat16 CPU fast paths simultaneously.

To Reproduce

import torch
import torch.nn.functional as F
import time
import statistics

def bench_cpu(fn, n=30):
    fn()  # warmup
    return statistics.median(
        [(lambda t0: (fn(), time.perf_counter() - t0)[1])(time.perf_counter())
         for _ in range(n)]
    ) * 1e3

torch.manual_seed(0)

# Shape A: dilated 7×7, NCHW, bfloat16
inp_a = torch.randn(8, 16, 112, 112, dtype=torch.bfloat16)
w_a   = torch.randn(32, 16, 7, 7,    dtype=torch.bfloat16)
t_a   = bench_cpu(lambda: F.conv2d(inp_a, w_a, stride=1, padding=0, dilation=2))
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.9 ms; regressed to ~29 ms on May 12

# Shape B: grouped 3×3, channels_last, bfloat16
inp_b = torch.randn(8, 256, 56, 56, dtype=torch.bfloat16).to(memory_format=torch.channels_last)
w_b   = torch.randn(512, 32, 3, 3,  dtype=torch.bfloat16).to(memory_format=torch.channels_last)
t_b   = bench_cpu(lambda: F.conv2d(inp_b, w_b, stride=1, padding=1, groups=8))
print(f"Shape B: {t_b:.2f} ms")   # expect ~3.0 ms; regressed to ~12 ms on May 12

Expected Behavior

CPU median latency consistent with the Apr 28 baseline:

ShapeConfigExpected CPU
[8, 16, 112, 112] input, [32, 16, 7, 7] weightbf16, dilation=2, NCHW~1.9 ms
[8, 256, 56, 56] input, [512, 32, 3, 3] weightbf16, groups=8, channels_last~3.0 ms

Actual Behavior

ShapeBuildCPU timeGPU timeSlowdown
Shape A — bf16, dilation=2, NCHW202604281.86 ms0.193 ms
Shape A — bf16, dilation=2, NCHW202605051.92 ms0.193 ms
Shape A — bf16, dilation=2, NCHW2026051229.40 ms0.193 ms~16×
Shape B — bf16, groups=8, channels_last202604283.01 ms0.308 ms
Shape B — bf16, groups=8, channels_last202605053.01 ms0.308 ms
Shape B — bf16, groups=8, channels_last2026051211.99 ms0.308 ms~4×

GPU time is unchanged across all three builds — the regression is CPU-only.

Versions

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0

Bisect range: works in 2.13.0.dev20260505+cu130, broken in 2.13.0.dev20260512+cu130.

cc @jerryzh168 @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `F.conv2d` CPU performance regression (~4–16×) for bfloat16 inputs introduced in 2.13.0.