pytorch - 💡(How to fix) Fix FSDP.optim_state_dict and FSDP.optim_state_dict_to_load called without group in torch.distributed.checkpoint.state_dict [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180449Fetched 2026-04-17 08:22:27
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×2mentioned ×2subscribed ×2commented ×1

Root Cause

When using get_state_dict / set_state_dict from torch.distributed.checkpoint.state_dict with FSDP models that use a non-default process group, the internal calls to FSDP.optim_state_dict and FSDP.optim_state_dict_to_load fail because the group parameter is not forwarded. Duplicate of # In torch/distributed/checkpoint/state_dict.py, there are two call sites that invoke FSDP's optimizer state dict APIs without passing the group argument: https://github.com/pytorch/pytorch/blob/0be2c4f2193516d3d25b83f225294c635f93081a/torch/distributed/checkpoint/state_dict.py#L900 https://github.com/pytorch/pytorch/blob/0be2c4f2193516d3d25b83f225294c635f93081a/torch/distributed/checkpoint/state_dict.py#L1109

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When using get_state_dict / set_state_dict from torch.distributed.checkpoint.state_dict with FSDP models that use a non-default process group, the internal calls to FSDP.optim_state_dict and FSDP.optim_state_dict_to_load fail because the group parameter is not forwarded. Duplicate of # In torch/distributed/checkpoint/state_dict.py, there are two call sites that invoke FSDP's optimizer state dict APIs without passing the group argument: https://github.com/pytorch/pytorch/blob/0be2c4f2193516d3d25b83f225294c635f93081a/torch/distributed/checkpoint/state_dict.py#L900 https://github.com/pytorch/pytorch/blob/0be2c4f2193516d3d25b83f225294c635f93081a/torch/distributed/checkpoint/state_dict.py#L1109

Both FSDP.optim_state_dict and FSDP.optim_state_dict_to_load accept an optional group: dist.ProcessGroup | None = None parameter. When group=None, they default to the default process group. However, if the FSDP model was initialized with a custom (non-default) process group, the collective communication operations inside these functions will operate on the wrong group, leading to errors or hangs.

Versions

main branch

cc @LucasLLC @pradeepfn

extent analysis

TL;DR

Passing the custom process group to FSDP's optimizer state dict APIs is likely necessary to fix the issue with get_state_dict and set_state_dict in torch.distributed.checkpoint.state_dict.

Guidance

  • Identify the custom process group used to initialize the FSDP model and pass it to FSDP.optim_state_dict and FSDP.optim_state_dict_to_load.
  • Modify the call sites in torch/distributed/checkpoint/state_dict.py to forward the group parameter to FSDP's optimizer state dict APIs.
  • Verify that the custom process group is correctly passed to the optimizer state dict APIs by checking the group argument in the FSDP.optim_state_dict and FSDP.optim_state_dict_to_load functions.
  • Consider updating the torch.distributed.checkpoint.state_dict module to handle custom process groups by default.

Example

# Assuming `group` is the custom process group
state_dict = FSDP.optim_state_dict(group=group)

Notes

The fix requires modifying the torch.distributed.checkpoint.state_dict module, which may involve updating the PyTorch library. The solution may not apply to all versions of PyTorch, especially if the module has changed since the main branch.

Recommendation

Apply workaround: Modify the call sites in torch/distributed/checkpoint/state_dict.py to forward the group parameter to FSDP's optimizer state dict APIs, as this is a more targeted solution that does not require updating the entire library.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix FSDP.optim_state_dict and FSDP.optim_state_dict_to_load called without group in torch.distributed.checkpoint.state_dict [1 comments, 2 participants]