pytorch - 💡(How to fix) Fix FSDP.optim_state_dict and FSDP.optim_state_dict_to_load called without group in torch.distributed.checkpoint.state_dict [1 comments, 2 participants]

pytorch2026-04-15 09:21:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180449•Fetched 2026-04-17 08:22:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

OutisKwak

Participants

mathceo

OutisKwak

Timeline (top)

labeled ×2mentioned ×2subscribed ×2commented ×1

Root Cause

When using get_state_dict / set_state_dict from torch.distributed.checkpoint.state_dict with FSDP models that use a non-default process group, the internal calls to FSDP.optim_state_dict and FSDP.optim_state_dict_to_load fail because the group parameter is not forwarded. Duplicate of # In torch/distributed/checkpoint/state_dict.py, there are two call sites that invoke FSDP's optimizer state dict APIs without passing the group argument: https://github.com/pytorch/pytorch/blob/0be2c4f2193516d3d25b83f225294c635f93081a/torch/distributed/checkpoint/state_dict.py#L900 https://github.com/pytorch/pytorch/blob/0be2c4f2193516d3d25b83f225294c635f93081a/torch/distributed/checkpoint/state_dict.py#L1109

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Both FSDP.optim_state_dict and FSDP.optim_state_dict_to_load accept an optional group: dist.ProcessGroup | None = None parameter. When group=None, they default to the default process group. However, if the FSDP model was initialized with a custom (non-default) process group, the collective communication operations inside these functions will operate on the wrong group, leading to errors or hangs.

Versions

main branch

cc @LucasLLC @pradeepfn

extent analysis

TL;DR

Passing the custom process group to FSDP's optimizer state dict APIs is likely necessary to fix the issue with get_state_dict and set_state_dict in torch.distributed.checkpoint.state_dict.

Guidance

Identify the custom process group used to initialize the FSDP model and pass it to FSDP.optim_state_dict and FSDP.optim_state_dict_to_load.
Modify the call sites in torch/distributed/checkpoint/state_dict.py to forward the group parameter to FSDP's optimizer state dict APIs.
Verify that the custom process group is correctly passed to the optimizer state dict APIs by checking the group argument in the FSDP.optim_state_dict and FSDP.optim_state_dict_to_load functions.
Consider updating the torch.distributed.checkpoint.state_dict module to handle custom process groups by default.

Example

# Assuming `group` is the custom process group
state_dict = FSDP.optim_state_dict(group=group)

Notes

The fix requires modifying the torch.distributed.checkpoint.state_dict module, which may involve updating the PyTorch library. The solution may not apply to all versions of PyTorch, especially if the module has changed since the main branch.

Recommendation

Apply workaround: Modify the call sites in torch/distributed/checkpoint/state_dict.py to forward the group parameter to FSDP's optimizer state dict APIs, as this is a more targeted solution that does not require updating the entire library.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #chain error #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix FSDP.optim_state_dict and FSDP.optim_state_dict_to_load called without group in torch.distributed.checkpoint.state_dict [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix FSDP.optim_state_dict and FSDP.optim_state_dict_to_load called without group in torch.distributed.checkpoint.state_dict [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING