pytorch - 💡(How to fix) Fix FSDP reduce-overhead regression from CUDA allocator block ordering

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

PyTorch PR #178362 / commit 3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17 regressed FSDP with torch.compile(mode="reduce-overhead") by changing CUDA allocator block ordering from address-based to allocation-counter-based. It happens in FSDP-only multi-GPU CUDA graph workloads, where unstable static input pointers trigger excessive cudagraph re-recording. Before 3e263a46, 4xFSDP bf16 reduce-overhead had 40 re-records, 240 pointer changes and scaled from 237,167 tok/s at 1x to 760,795 tok/s at 4x (3.21x, 80.2% efficiency). With 3e263a46, it regressed to 144 re-records, 1,588 pointer changes and scaled only from 226,767 tok/s to 598,963 tok/s (2.64x, 66.0% efficiency). This is not a permanent throughput ceiling. Peak throughput eventually returns, but only after a substantially longer warm-up period with repeated CUDA graph re-recording.

Root Cause

PyTorch PR #178362 / commit 3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17 regressed FSDP with torch.compile(mode="reduce-overhead") by changing CUDA allocator block ordering from address-based to allocation-counter-based. It happens in FSDP-only multi-GPU CUDA graph workloads, where unstable static input pointers trigger excessive cudagraph re-recording. Before 3e263a46, 4xFSDP bf16 reduce-overhead had 40 re-records, 240 pointer changes and scaled from 237,167 tok/s at 1x to 760,795 tok/s at 4x (3.21x, 80.2% efficiency). With 3e263a46, it regressed to 144 re-records, 1,588 pointer changes and scaled only from 226,767 tok/s to 598,963 tok/s (2.64x, 66.0% efficiency). This is not a permanent throughput ceiling. Peak throughput eventually returns, but only after a substantially longer warm-up period with repeated CUDA graph re-recording.

RAW_BUFFERClick to expand / collapse

Summary

PyTorch PR #178362 / commit 3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17 regressed FSDP with torch.compile(mode="reduce-overhead") by changing CUDA allocator block ordering from address-based to allocation-counter-based. It happens in FSDP-only multi-GPU CUDA graph workloads, where unstable static input pointers trigger excessive cudagraph re-recording. Before 3e263a46, 4xFSDP bf16 reduce-overhead had 40 re-records, 240 pointer changes and scaled from 237,167 tok/s at 1x to 760,795 tok/s at 4x (3.21x, 80.2% efficiency). With 3e263a46, it regressed to 144 re-records, 1,588 pointer changes and scaled only from 226,767 tok/s to 598,963 tok/s (2.64x, 66.0% efficiency). This is not a permanent throughput ceiling. Peak throughput eventually returns, but only after a substantially longer warm-up period with repeated CUDA graph re-recording.

Repro

Github Gist - test_llama3_mlp_cudagraph_regression.py

Training Loss Curve - 4xFSDP NVFP4 Llama3 8B

<img width="1484" height="883" alt="Image" src="https://github.com/user-attachments/assets/5c458f27-1902-4a0a-aa4b-0e5337ca4d06" />

AI Usage

  • Codex for git bisect; Claude for analysis

Versions

FieldPrior commit - GoodFailing container
PyTorch git8c01604fcad5f08ffa39308e393e24ade29aa5eb3e263a46d03bbd64637b0607fe4d0d3c7ca0fa17
CUDA13.313.3
Python3.12.33.12.3
GPU4x NVIDIA GB2004x NVIDIA GB200
Driver580.65.06580.65.06
Kernel6.14.0-1007-nvidia-64k6.11.0-1011-nvidia-64k

cc @jerryzh168 @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360 @ppwwyyxx

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix FSDP reduce-overhead regression from CUDA allocator block ordering