pytorch - ✅(Solved) Fix [FSDP] Multi-GPU distributed tests crash with NCCL 2.29.7 NVLS multicast slot exhaustion on H200 [1 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178749Fetched 2026-04-08 01:52:30
View on GitHub
Comments
4
Participants
3
Timeline
56
Reactions
0
Author
Timeline (top)
mentioned ×22subscribed ×22commented ×4labeled ×4

FSDP distributed tests fail on 8x H200 systems with NCCL 2.29.7 due to NVLink SHARP (NVLS) multicast slot exhaustion. NCCL creates multicast groups for every sub-communicator and exhausts the NVSwitch hardware limit of 128 multicast slots, causing a fatal crash. This worked with NCCL 2.28.9.

Error Message

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:93, unhandled cuda error, NCCL version 2.29.7
ncclUnhandledCudaError: Call to CUDA function failed.
Failed to bind NVLink SHARP (NVLS) Multicast memory of size 2097152 : CUDA error 2 'out of memory'.
This is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.
Disable NVLS (NCCL_NVLS_ENABLE=0) if you wish to avoid this error in the future.

Root Cause

We traced this through the Fabric Manager logs and confirmed the following:

  1. NVSwitch hardware has a hard limit of 128 multicast slots
  2. NCCL 2.29.7 creates NVLS multicast groups for every sub-communicator, including 2-rank FSDP sub-communicators
  3. Each rank creates ~10 multicast groups per sub-communicator
  4. With 8 GPUs and multiple FSDP sub-communicators, all 128 slots are exhausted
  5. None of the groups are freed during the test — they are only released when communicators are destroyed
  6. When slot 128 is requested, cuMulticastBindMem fails with CUDA error 2 (out of memory)
  7. NCCL 2.29.7 treats this as a fatal error and crashes

Fix Action

Fix / Workaround

ConfigurationResult
NCCL_NVLS_TREE_ENABLE=0FAILED — still exhausts 128 multicast slots
NCCL_NVLS_NCHANNELS=1FAILED — still exhausts slots
NCCL_NVLS_TREE_ENABLE=0 + NCCL_NVLS_NCHANNELS=1 + NCCL_MAX_NCHANNELS=2FAILED
NCCL_NVLS_ENABLE=0PASSED (only working workaround)

Any PyTorch workload that creates many NCCL sub-communicators (FSDP with multiple parameter groups, hybrid parallelism, etc.) on NVSwitch-connected H200 systems with NCCL 2.29.7 will hit this crash. The only workaround (NCCL_NVLS_ENABLE=0) disables a performance optimization entirely.

PR fix notes

PR #179402: [FSDP] Cache post-forward DeviceMesh to deduplicate NCCL communicators

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #179402

fix: https://github.com/pytorch/pytorch/issues/178749

When fully_shard is called with reshard_after_forward as an int, _get_post_forward_mesh_info creates a new DeviceMesh (and its underlying NCCL communicators) on every call, even when the mesh topology is identical. On NVSwitch-connected systems (e.g. H200), this exhausts the 128 multicast slot hardware limit and crashes.

Cache the post-forward mesh info keyed on (reshard_after_forward, source mesh) so that repeated fully_shard calls reuse the same DeviceMesh and process groups. This reduces NCCL communicators per rank from O(n_layers) to O(1) for the post-forward mesh.

Authored with Claude.

Changed files

  • test/distributed/_composable/fsdp/test_fully_shard_training.py (modified, +45/-0)
  • torch/distributed/fsdp/_fully_shard/_fsdp_init.py (modified, +41/-10)
  • torch/distributed/fsdp/_fully_shard/_fully_shard.py (modified, +3/-0)

Code Example

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:93, unhandled cuda error, NCCL version 2.29.7
ncclUnhandledCudaError: Call to CUDA function failed.
Failed to bind NVLink SHARP (NVLS) Multicast memory of size 2097152 : CUDA error 2 'out of memory'.
This is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.
Disable NVLS (NCCL_NVLS_ENABLE=0) if you wish to avoid this error in the future.

---

# FailsNVLS enabled (default)
python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

# PassesNVLS disabled
NCCL_NVLS_ENABLE=0 python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

---

[INFO] multicast group 126 is allocated.
[INFO] multicast group 127 is allocated.
[ERROR] all the NVSwitch multicast resources/slot ids are used for partition id -1.
[ERROR] failed to allocated resource to the multicast team setup request id ...
RAW_BUFFERClick to expand / collapse

Summary

FSDP distributed tests fail on 8x H200 systems with NCCL 2.29.7 due to NVLink SHARP (NVLS) multicast slot exhaustion. NCCL creates multicast groups for every sub-communicator and exhausts the NVSwitch hardware limit of 128 multicast slots, causing a fatal crash. This worked with NCCL 2.28.9.

Environment

  • Hardware: 8x NVIDIA H200 (NV18 NVLink interconnect, 4x NVSwitch)
  • NCCL: 2.29.7+cuda12.8
  • CUDA: 13.0
  • Driver: 580.65.06
  • Fabric Manager: 580.65.06 (running, State: Completed, Status: Success)
  • OS: RHEL 9.6 (kernel 6.12.0)
  • PyTorch: upstream main (commit 5bfd4be)

Error

torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:93, unhandled cuda error, NCCL version 2.29.7
ncclUnhandledCudaError: Call to CUDA function failed.
Failed to bind NVLink SHARP (NVLS) Multicast memory of size 2097152 : CUDA error 2 'out of memory'.
This is usually caused by a system or configuration error in the Fabric Manager or NVSwitches.
Disable NVLS (NCCL_NVLS_ENABLE=0) if you wish to avoid this error in the future.

Reproducer

# Fails — NVLS enabled (default)
python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

# Passes — NVLS disabled
NCCL_NVLS_ENABLE=0 python test/distributed/_composable/fsdp/test_fully_shard_training.py \
  TestFullyShard1DTrainingCore.test_train_parity_multi_group

Root Cause Analysis

We traced this through the Fabric Manager logs and confirmed the following:

  1. NVSwitch hardware has a hard limit of 128 multicast slots
  2. NCCL 2.29.7 creates NVLS multicast groups for every sub-communicator, including 2-rank FSDP sub-communicators
  3. Each rank creates ~10 multicast groups per sub-communicator
  4. With 8 GPUs and multiple FSDP sub-communicators, all 128 slots are exhausted
  5. None of the groups are freed during the test — they are only released when communicators are destroyed
  6. When slot 128 is requested, cuMulticastBindMem fails with CUDA error 2 (out of memory)
  7. NCCL 2.29.7 treats this as a fatal error and crashes

Fabric Manager log evidence:

[INFO] multicast group 126 is allocated.
[INFO] multicast group 127 is allocated.
[ERROR] all the NVSwitch multicast resources/slot ids are used for partition id -1.
[ERROR] failed to allocated resource to the multicast team setup request id ...
  • 128 groups allocated (groups 0–127) before failure
  • 0 groups freed during active test
  • Peak concurrent active groups: 127

System verification:

  • Fabric Manager is running and healthy (State: Completed, Status: Success, CliqueId: 0)
  • All 4 NVSwitches present and functional (/dev/nvidia-nvswitch0-3)
  • NVLink active: 18 links per GPU at 26.5 GB/s
  • CUDA multicast attribute reports supported (value=1)
  • Multicast works correctly for the first 128 groups — this is purely a resource exhaustion issue

Regression from NCCL 2.28.9

NCCL 2.28.9 either:

  • Created fewer multicast groups (did not create per-pair 2-rank NVLS groups for sub-communicators), or
  • Handled the allocation failure gracefully by falling back to non-NVLS transport

NCCL 2.29.7 introduced more aggressive NVLS usage for sub-communicators and removed the graceful fallback path, making the resource exhaustion fatal.

NCCL env vars tested

We tested all available NVLS-related NCCL knobs — none resolve the issue except fully disabling NVLS:

ConfigurationResult
NCCL_NVLS_TREE_ENABLE=0FAILED — still exhausts 128 multicast slots
NCCL_NVLS_NCHANNELS=1FAILED — still exhausts slots
NCCL_NVLS_TREE_ENABLE=0 + NCCL_NVLS_NCHANNELS=1 + NCCL_MAX_NCHANNELS=2FAILED
NCCL_NVLS_ENABLE=0PASSED (only working workaround)

The available NVLS env vars in NCCL 2.29.7 are: NCCL_NVLS_ENABLE, NCCL_NVLS_TREE_ENABLE, NCCL_NVLS_NCHANNELS, NCCL_NVLS_CHUNKSIZE, NCCL_NVLSTREE_MAX_CHUNKSIZE. None provide a way to limit total multicast group allocation or disable NVLS selectively for sub-communicators.

Impact

Any PyTorch workload that creates many NCCL sub-communicators (FSDP with multiple parameter groups, hybrid parallelism, etc.) on NVSwitch-connected H200 systems with NCCL 2.29.7 will hit this crash. The only workaround (NCCL_NVLS_ENABLE=0) disables a performance optimization entirely.

Questions for PyTorch team

  1. Is there a way to limit the number of NCCL sub-communicators created by FSDP, or to reuse communicators across parameter groups?
  2. Would it be feasible to set NCCL_NVLS_ENABLE=0 as a default or fallback when NVLS multicast allocation fails, rather than crashing?
  3. Are there plans to coordinate with NVIDIA on graceful NVLS fallback behavior in NCCL?

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @pytorch/distributed

extent analysis

Fix Plan

To fix the issue of NVLink SHARP (NVLS) multicast slot exhaustion, we can try the following steps:

  • Disable NVLS: Set the environment variable NCCL_NVLS_ENABLE to 0 before running the PyTorch workload. This can be done using the command NCCL_NVLS_ENABLE=0 python your_script.py.
  • Limit NCCL sub-communicators: If possible, limit the number of NCCL sub-communicators created by FSDP. This might involve modifying the FSDP implementation or configuring it to reuse communicators across parameter groups.
  • Modify NCCL: Modify the NCCL library to handle NVLS multicast allocation failures more gracefully, such as by falling back to non-NVLS transport.

Example code to disable NVLS:

import os

# Disable NVLS
os.environ['NCCL_NVLS_ENABLE'] = '0'

# Run your PyTorch workload

Verification

To verify that the fix worked, run the PyTorch workload with the fix applied and check for the following:

  • The workload completes without crashing due to NVLS multicast slot exhaustion.
  • The performance of the workload is not significantly impacted by disabling NVLS.

Extra Tips

  • When disabling NVLS, monitor the performance of your workload to ensure it is not significantly impacted.
  • Consider modifying the FSDP implementation to limit the number of NCCL sub-communicators created or to reuse communicators across parameter groups.
  • If modifying NCCL, ensure that the changes are compatible with your specific use case and do not introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING