pytorch - 💡(How to fix) Fix `F.conv2d` CPU performance regression (~4–16×) for bfloat16 inputs introduced in 2.13.0.

pytorch2026-05-14 23:36:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fix / Workaround

torch.nn.functional.conv2d on CPU is 4–16× slower for bfloat16 inputs in the 2.13.0.dev20260512+cu130 nightly compared to 20260428 and 20260505. GPU time is identical across all three builds, isolating the regression to the CPU dispatch or kernel selection path.

Code Example

import torch
import torch.nn.functional as F
import time
import statistics

def bench_cpu(fn, n=30):
    fn()  # warmup
    return statistics.median(
        [(lambda t0: (fn(), time.perf_counter() - t0)[1])(time.perf_counter())
         for _ in range(n)]
    ) * 1e3

torch.manual_seed(0)

# Shape A: dilated 7×7, NCHW, bfloat16
inp_a = torch.randn(8, 16, 112, 112, dtype=torch.bfloat16)
w_a   = torch.randn(32, 16, 7, 7,    dtype=torch.bfloat16)
t_a   = bench_cpu(lambda: F.conv2d(inp_a, w_a, stride=1, padding=0, dilation=2))
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.9 ms; regressed to ~29 ms on May 12

# Shape B: grouped 3×3, channels_last, bfloat16
inp_b = torch.randn(8, 256, 56, 56, dtype=torch.bfloat16).to(memory_format=torch.channels_last)
w_b   = torch.randn(512, 32, 3, 3,  dtype=torch.bfloat16).to(memory_format=torch.channels_last)
t_b   = bench_cpu(lambda: F.conv2d(inp_b, w_b, stride=1, padding=1, groups=8))
print(f"Shape B: {t_b:.2f} ms")   # expect ~3.0 ms; regressed to ~12 ms on May 12

---

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Two independent conv configurations are affected — a dilated 7×7 NCHW conv (Shape A) and a grouped 3×3 channels_last conv (Shape B). These exercise distinct algorithm paths (Winograd/FFT vs. NHWC grouped-conv), suggesting the May 12 change broadly disabled multiple bfloat16 CPU fast paths simultaneously.

To Reproduce

import torch
import torch.nn.functional as F
import time
import statistics

def bench_cpu(fn, n=30):
    fn()  # warmup
    return statistics.median(
        [(lambda t0: (fn(), time.perf_counter() - t0)[1])(time.perf_counter())
         for _ in range(n)]
    ) * 1e3

torch.manual_seed(0)

# Shape A: dilated 7×7, NCHW, bfloat16
inp_a = torch.randn(8, 16, 112, 112, dtype=torch.bfloat16)
w_a   = torch.randn(32, 16, 7, 7,    dtype=torch.bfloat16)
t_a   = bench_cpu(lambda: F.conv2d(inp_a, w_a, stride=1, padding=0, dilation=2))
print(f"Shape A: {t_a:.2f} ms")   # expect ~1.9 ms; regressed to ~29 ms on May 12

# Shape B: grouped 3×3, channels_last, bfloat16
inp_b = torch.randn(8, 256, 56, 56, dtype=torch.bfloat16).to(memory_format=torch.channels_last)
w_b   = torch.randn(512, 32, 3, 3,  dtype=torch.bfloat16).to(memory_format=torch.channels_last)
t_b   = bench_cpu(lambda: F.conv2d(inp_b, w_b, stride=1, padding=1, groups=8))
print(f"Shape B: {t_b:.2f} ms")   # expect ~3.0 ms; regressed to ~12 ms on May 12

Expected Behavior

CPU median latency consistent with the Apr 28 baseline:

Shape	Config	Expected CPU
`[8, 16, 112, 112]` input, `[32, 16, 7, 7]` weight	bf16, dilation=2, NCHW	~1.9 ms
`[8, 256, 56, 56]` input, `[512, 32, 3, 3]` weight	bf16, groups=8, channels_last	~3.0 ms

Actual Behavior

Shape	Build	CPU time	GPU time	Slowdown
Shape A — bf16, dilation=2, NCHW	20260428	1.86 ms	0.193 ms	—
Shape A — bf16, dilation=2, NCHW	20260505	1.92 ms	0.193 ms	—
Shape A — bf16, dilation=2, NCHW	20260512	29.40 ms	0.193 ms	~16×
Shape B — bf16, groups=8, channels_last	20260428	3.01 ms	0.308 ms	—
Shape B — bf16, groups=8, channels_last	20260505	3.01 ms	0.308 ms	—
Shape B — bf16, groups=8, channels_last	20260512	11.99 ms	0.308 ms	~4×

GPU time is unchanged across all three builds — the regression is CPU-only.

Versions

PyTorch version: 2.13.0.dev20260512+cu130
Python:          3.12.3 (GCC 13.3.0)
OS:              Linux 6.14.0-37-generic x86_64 (Ubuntu, glibc 2.39)
numpy:           2.4.4
GPU:             NVIDIA GeForce RTX 5090 (sm_120)
CUDA:            13.0

Bisect range: works in 2.13.0.dev20260505+cu130, broken in 2.13.0.dev20260512+cu130.

cc @jerryzh168 @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `F.conv2d` CPU performance regression (~4–16×) for bfloat16 inputs introduced in 2.13.0.

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

To Reproduce

Expected Behavior

Actual Behavior

Versions

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `F.conv2d` CPU performance regression (~4–16×) for bfloat16 inputs introduced in 2.13.0.

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

To Reproduce

Expected Behavior

Actual Behavior

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING