pytorch - ✅(Solved) Fix [FSDP] Multi-GPU distributed tests crash with NCCL 2.29.7 NVLS multicast slot exhaustion on H200 [1 pull requests, 4 comments, 3 participants]

pytorch2026-03-30 07:56:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178749•Fetched 2026-04-08 01:52:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×22subscribed ×22commented ×4labeled ×4

FSDP distributed tests fail on 8x H200 systems with NCCL 2.29.7 due to NVLink SHARP (NVLS) multicast slot exhaustion. NCCL creates multicast groups for every sub-communicator and exhausts the NVSwitch hardware limit of 128 multicast slots, causing a fatal crash. This worked with NCCL 2.28.9.

Error Message

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:93, unhandled cuda error, NCCL version 2.29.7
ncclUnhandledCudaError: Call to CUDA function failed.
Failed to bind NVLink SHARP (NVLS) Multicast memory of size 2097152 : CUDA error 2 'out of memory'.
This is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.
Disable NVLS (NCCL_NVLS_ENABLE=0) if you wish to avoid this error in the future.

Root Cause

We traced this through the Fabric Manager logs and confirmed the following:

NVSwitch hardware has a hard limit of 128 multicast slots
NCCL 2.29.7 creates NVLS multicast groups for every sub-communicator, including 2-rank FSDP sub-communicators
Each rank creates ~10 multicast groups per sub-communicator
With 8 GPUs and multiple FSDP sub-communicators, all 128 slots are exhausted
None of the groups are freed during the test — they are only released when communicators are destroyed
When slot 128 is requested, cuMulticastBindMem fails with CUDA error 2 (out of memory)
NCCL 2.29.7 treats this as a fatal error and crashes

Fix Action

Fix / Workaround

Configuration	Result
`NCCL_NVLS_TREE_ENABLE=0`	FAILED — still exhausts 128 multicast slots
`NCCL_NVLS_NCHANNELS=1`	FAILED — still exhausts slots
`NCCL_NVLS_TREE_ENABLE=0` + `NCCL_NVLS_NCHANNELS=1` + `NCCL_MAX_NCHANNELS=2`	FAILED
`NCCL_NVLS_ENABLE=0`	PASSED (only working workaround)

Any PyTorch workload that creates many NCCL sub-communicators (FSDP with multiple parameter groups, hybrid parallelism, etc.) on NVSwitch-connected H200 systems with NCCL 2.29.7 will hit this crash. The only workaround (NCCL_NVLS_ENABLE=0) disables a performance optimization entirely.

PR fix notes

PR #179402: [FSDP] Cache post-forward DeviceMesh to deduplicate NCCL communicators

Repository: pytorch/pytorch
Author: weifengpy
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/179402

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

-> #179402

fix: https://github.com/pytorch/pytorch/issues/178749

When fully_shard is called with reshard_after_forward as an int, _get_post_forward_mesh_info creates a new DeviceMesh (and its underlying NCCL communicators) on every call, even when the mesh topology is identical. On NVSwitch-connected systems (e.g. H200), this exhausts the 128 multicast slot hardware limit and crashes.

Cache the post-forward mesh info keyed on (reshard_after_forward, source mesh) so that repeated fully_shard calls reuse the same DeviceMesh and process groups. This reduces NCCL communicators per rank from O(n_layers) to O(1) for the post-forward mesh.

Authored with Claude.

Changed files

test/distributed/_composable/fsdp/test_fully_shard_training.py (modified, +45/-0)
torch/distributed/fsdp/_fully_shard/_fsdp_init.py (modified, +41/-10)
torch/distributed/fsdp/_fully_shard/_fully_shard.py (modified, +3/-0)

Code Example

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:93, unhandled cuda error, NCCL version 2.29.7
ncclUnhandledCudaError: Call to CUDA function failed.
Failed to bind NVLink SHARP (NVLS) Multicast memory of size 2097152 : CUDA error 2 'out of memory'.
This is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.
Disable NVLS (NCCL_NVLS_ENABLE=0) if you wish to avoid this error in the future.

---

# Fails — NVLS enabled (default)
python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

# Passes — NVLS disabled
NCCL_NVLS_ENABLE=0 python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

---

[INFO] multicast group 126 is allocated.
[INFO] multicast group 127 is allocated.
[ERROR] all the NVSwitch multicast resources/slot ids are used for partition id -1.
[ERROR] failed to allocated resource to the multicast team setup request id ...

RAW_BUFFERClick to expand / collapse

Summary

Environment

Hardware: 8x NVIDIA H200 (NV18 NVLink interconnect, 4x NVSwitch)
NCCL: 2.29.7+cuda12.8
CUDA: 13.0
Driver: 580.65.06
Fabric Manager: 580.65.06 (running, State: Completed, Status: Success)
OS: RHEL 9.6 (kernel 6.12.0)
PyTorch: upstream main (commit 5bfd4be)

Error

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:93, unhandled cuda error, NCCL version 2.29.7
ncclUnhandledCudaError: Call to CUDA function failed.
Failed to bind NVLink SHARP (NVLS) Multicast memory of size 2097152 : CUDA error 2 'out of memory'.
This is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.
Disable NVLS (NCCL_NVLS_ENABLE=0) if you wish to avoid this error in the future.

Reproducer

# Fails — NVLS enabled (default)
python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

# Passes — NVLS disabled
NCCL_NVLS_ENABLE=0 python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

Root Cause Analysis

We traced this through the Fabric Manager logs and confirmed the following:

NVSwitch hardware has a hard limit of 128 multicast slots
NCCL 2.29.7 creates NVLS multicast groups for every sub-communicator, including 2-rank FSDP sub-communicators
Each rank creates ~10 multicast groups per sub-communicator
With 8 GPUs and multiple FSDP sub-communicators, all 128 slots are exhausted
None of the groups are freed during the test — they are only released when communicators are destroyed
When slot 128 is requested, cuMulticastBindMem fails with CUDA error 2 (out of memory)
NCCL 2.29.7 treats this as a fatal error and crashes

Fabric Manager log evidence:

[INFO] multicast group 126 is allocated.
[INFO] multicast group 127 is allocated.
[ERROR] all the NVSwitch multicast resources/slot ids are used for partition id -1.
[ERROR] failed to allocated resource to the multicast team setup request id ...

128 groups allocated (groups 0–127) before failure
0 groups freed during active test
Peak concurrent active groups: 127

System verification:

Fabric Manager is running and healthy (State: Completed, Status: Success, CliqueId: 0)
All 4 NVSwitches present and functional (/dev/nvidia-nvswitch0-3)
NVLink active: 18 links per GPU at 26.5 GB/s
CUDA multicast attribute reports supported (value=1)
Multicast works correctly for the first 128 groups — this is purely a resource exhaustion issue

Regression from NCCL 2.28.9

NCCL 2.28.9 either:

Created fewer multicast groups (did not create per-pair 2-rank NVLS groups for sub-communicators), or
Handled the allocation failure gracefully by falling back to non-NVLS transport

NCCL 2.29.7 introduced more aggressive NVLS usage for sub-communicators and removed the graceful fallback path, making the resource exhaustion fatal.

NCCL env vars tested

We tested all available NVLS-related NCCL knobs — none resolve the issue except fully disabling NVLS:

Configuration	Result
`NCCL_NVLS_TREE_ENABLE=0`	FAILED — still exhausts 128 multicast slots
`NCCL_NVLS_NCHANNELS=1`	FAILED — still exhausts slots
`NCCL_NVLS_TREE_ENABLE=0` + `NCCL_NVLS_NCHANNELS=1` + `NCCL_MAX_NCHANNELS=2`	FAILED
`NCCL_NVLS_ENABLE=0`	PASSED (only working workaround)

The available NVLS env vars in NCCL 2.29.7 are: NCCL_NVLS_ENABLE, NCCL_NVLS_TREE_ENABLE, NCCL_NVLS_NCHANNELS, NCCL_NVLS_CHUNKSIZE, NCCL_NVLSTREE_MAX_CHUNKSIZE. None provide a way to limit total multicast group allocation or disable NVLS selectively for sub-communicators.

Impact

Questions for PyTorch team

Is there a way to limit the number of NCCL sub-communicators created by FSDP, or to reuse communicators across parameter groups?
Would it be feasible to set NCCL_NVLS_ENABLE=0 as a default or fallback when NVLS multicast allocation fails, rather than crashing?
Are there plans to coordinate with NVIDIA on graceful NVLS fallback behavior in NCCL?

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @pytorch/distributed

extent analysis

Fix Plan

To fix the issue of NVLink SHARP (NVLS) multicast slot exhaustion, we can try the following steps:

Disable NVLS: Set the environment variable NCCL_NVLS_ENABLE to 0 before running the PyTorch workload. This can be done using the command NCCL_NVLS_ENABLE=0 python your_script.py.
Limit NCCL sub-communicators: If possible, limit the number of NCCL sub-communicators created by FSDP. This might involve modifying the FSDP implementation or configuring it to reuse communicators across parameter groups.
Modify NCCL: Modify the NCCL library to handle NVLS multicast allocation failures more gracefully, such as by falling back to non-NVLS transport.

Example code to disable NVLS:

import os

# Disable NVLS
os.environ['NCCL_NVLS_ENABLE'] = '0'

# Run your PyTorch workload

Verification

To verify that the fix worked, run the PyTorch workload with the fix applied and check for the following:

The workload completes without crashing due to NVLS multicast slot exhaustion.
The performance of the workload is not significantly impacted by disabling NVLS.

Extra Tips

When disabling NVLS, monitor the performance of your workload to ensure it is not significantly impacted.
Consider modifying the FSDP implementation to limit the number of NCCL sub-communicators created or to reuse communicators across parameter groups.
If modifying NCCL, ensure that the changes are compatible with your specific use case and do not introduce any regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ssr #optimization #configuration error #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [FSDP] Multi-GPU distributed tests crash with NCCL 2.29.7 NVLS multicast slot exhaustion on H200 [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #179402: [FSDP] Cache post-forward DeviceMesh to deduplicate NCCL communicators

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Environment

Error

Reproducer

Root Cause Analysis

Fabric Manager log evidence:

System verification:

Regression from NCCL 2.28.9

NCCL env vars tested

Impact

Questions for PyTorch team

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [FSDP] Multi-GPU distributed tests crash with NCCL 2.29.7 NVLS multicast slot exhaustion on H200 [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #179402: [FSDP] Cache post-forward DeviceMesh to deduplicate NCCL communicators

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Environment

Error

Reproducer

Root Cause Analysis

Fabric Manager log evidence:

System verification:

Regression from NCCL 2.28.9

NCCL env vars tested

Impact

Questions for PyTorch team

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING