pytorch - 💡(How to fix) Fix FSDP reduce-overhead regression from CUDA allocator block ordering

pytorch2026-05-12 16:46:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

PyTorch PR #178362 / commit 3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17 regressed FSDP with torch.compile(mode="reduce-overhead") by changing CUDA allocator block ordering from address-based to allocation-counter-based. It happens in FSDP-only multi-GPU CUDA graph workloads, where unstable static input pointers trigger excessive cudagraph re-recording. Before 3e263a46, 4xFSDP bf16 reduce-overhead had 40 re-records, 240 pointer changes and scaled from 237,167 tok/s at 1x to 760,795 tok/s at 4x (3.21x, 80.2% efficiency). With 3e263a46, it regressed to 144 re-records, 1,588 pointer changes and scaled only from 226,767 tok/s to 598,963 tok/s (2.64x, 66.0% efficiency). This is not a permanent throughput ceiling. Peak throughput eventually returns, but only after a substantially longer warm-up period with repeated CUDA graph re-recording.

Root Cause

RAW_BUFFERClick to expand / collapse

Summary

Repro

Github Gist - test_llama3_mlp_cudagraph_regression.py

Training Loss Curve - 4xFSDP NVFP4 Llama3 8B

Normally, throughput (token/sec) reaches high throughput within 5 steps. After this https://github.com/pytorch/pytorch/commit/3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17, it can take ~2000 steps or 75M tokens.

AI Usage

Codex for git bisect; Claude for analysis

Versions

Field	Prior commit - Good	Failing container
PyTorch git	8c01604fcad5f08ffa39308e393e24ade29aa5eb	3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17
CUDA	13.3	13.3
Python	3.12.3	3.12.3
GPU	4x NVIDIA GB200	4x NVIDIA GB200
Driver	580.65.06	580.65.06
Kernel	6.14.0-1007-nvidia-64k	6.11.0-1011-nvidia-64k

cc @jerryzh168 @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360 @ppwwyyxx

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model loading #dependency error #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix FSDP reduce-overhead regression from CUDA allocator block ordering

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Repro

Training Loss Curve - 4xFSDP NVFP4 Llama3 8B

AI Usage

Versions

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix FSDP reduce-overhead regression from CUDA allocator block ordering

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Repro

Training Loss Curve - 4xFSDP NVFP4 Llama3 8B

AI Usage

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING