pytorch - ✅(Solved) Fix ProcessGroupNCCL scalable communicator init may reuse stale UniqueNCCLID store keys [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178473Fetched 2026-04-08 01:30:20
View on GitHub
Comments
1
Participants
2
Timeline
28
Reactions
0
Participants
Timeline (top)
mentioned ×12subscribed ×12labeled ×2commented ×1

Root Cause

This is important because it happens during communicator bootstrap rather than in a normal collective. If stale IDs are observed here, the failure mode can be more severe and harder to diagnose, for example:

  • communicator initialization hangs
  • timeouts
  • bootstrap failures that look like store/network/rank liveness issues

Fix Action

Fixed

PR fix notes

PR #178493: [c10d][NCCL] avoid reusing scalable init store keys across communicat…

Description (problem / solution / changelog)

Summary

Fix a scalable NCCL initialization bug where allgatherUniqueNCCLIDs() reused the same store key namespace across different communicator initialization rounds.

Under scalable init, later communicator-init rounds may observe stale ncclUniqueId values left behind by earlier rounds, which can lead to mismatched communicator IDs across ranks and eventual bootstrap failures.

This change prefixes scalable-init store keys with a per-communicator counter so that different initialization rounds no longer share the same key namespace.

Fixes #178473

External Repro

I reproduced this on a 4xH800 setup using a small external repro that:

  • enables scalable init with TORCH_NCCL_RANKS_PER_ROOT=1
  • creates multiple communicator-init rounds within the same process group
  • introduces rank skew between rounds
<details> <summary>External repro script</summary>
import argparse
import os
import sys
import time
import traceback
from datetime import timedelta

import torch
import torch.distributed as dist


def log(msg: str) -> None:
    rank = int(os.environ.get("RANK", "-1"))
    print(f"[rank {rank}] {msg}", flush=True)


def check_tensor(t: torch.Tensor, world_size: int, round_idx: int, device_idx: int) -> None:
    expected = float(sum(range(1, world_size + 1)))
    got = float(t.item())
    if got != expected:
        raise RuntimeError(
            f"round={round_idx} device=cuda:{device_idx} got {got}, expected {expected}"
        )


def run_round(rank: int, world_size: int, round_idx: int, skew_rank: int, skew_s: float) -> None:
    device_idx = (rank + round_idx) % world_size
    device = torch.device("cuda", device_idx)

    if rank == skew_rank and round_idx > 0:
        log(f"sleep {skew_s}s before round {round_idx} on cuda:{device_idx}")
        time.sleep(skew_s)

    torch.cuda.set_device(device)
    x = torch.tensor([float(rank + 1)], device=device)

    log(f"enter round {round_idx} on cuda:{device_idx}")
    dist.all_reduce(x, op=dist.ReduceOp.SUM)
    torch.cuda.synchronize(device)
    check_tensor(x, world_size, round_idx, device_idx)
    log(f"finish round {round_idx} on cuda:{device_idx}, value={x.item()}")


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--rounds", type=int, default=None)
    parser.add_argument("--sleep-seconds", type=float, default=3.0)
    parser.add_argument("--pg-timeout", type=int, default=60)
    args = parser.parse_args()

    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    os.environ.setdefault("TORCH_NCCL_RANKS_PER_ROOT", "1")
    os.environ.setdefault("TORCH_NCCL_BLOCKING_WAIT", "1")
    os.environ.setdefault("TORCH_NCCL_ASYNC_ERROR_HANDLING", "1")

    dist.init_process_group(
        backend="nccl",
        timeout=timedelta(seconds=args.pg_timeout),
    )

    try:
        for round_idx in range(args.rounds if args.rounds is not None else world_size):
            skew_rank = round_idx % world_size
            run_round(rank, world_size, round_idx, skew_rank, args.sleep_seconds)
    finally:
        if dist.is_initialized():
            dist.destroy_process_group()


if __name__ == "__main__":
    try:
        main()
    except Exception:
        traceback.print_exc()
        sys.exit(1)
</details>

Observed Failure

In the failing run:

  • round 0 completed successfully on all ranks
  • round 1 triggered another scalable communicator initialization
  • different ranks entered ncclCommInitRankScalable with different commId values
  • the run then failed during NCCL bootstrap with repeated socketPollConnect: connect returned Connection refused
<details> <summary>Relevant log excerpts</summary>
[rank 3] finish round 0 on cuda:3, value=10.0
[rank 0] finish round 0 on cuda:0, value=10.0
[rank 2] finish round 0 on cuda:2, value=10.0
[rank 1] finish round 0 on cuda:1, value=10.0

[rank 1] sleep 3.0s before round 1 on cuda:2
[rank 3] enter round 1 on cuda:0
[rank 0] enter round 1 on cuda:1
[rank 2] enter round 1 on cuda:3
n105-063-105:1046:1121 [1] NCCL INFO ncclCommInitRankScalable comm 0x981bf30 rank 0 nranks 4 cudaDev 1 nvmlDev 1 busId 67020 commId 0x60e7eda7645f7342 - Init START
n105-063-105:1049:1120 [0] NCCL INFO ncclCommInitRankScalable comm 0x91431c0 rank 3 nranks 4 cudaDev 0 nvmlDev 0 busId 65020 commId 0xb0374f7b726e0ae3 - Init START
n105-063-105:1048:1122 [3] NCCL INFO ncclCommInitRankScalable comm 0xa0bde40 rank 2 nranks 4 cudaDev 3 nvmlDev 3 busId 6b020 commId 0x60e7eda7645f7342 - Init START
n105-063-105:1047:1125 [2] NCCL INFO ncclCommInitRankScalable comm 0x97370d0 rank 1 nranks 4 cudaDev 2 nvmlDev 2 busId 69020 commId 0x60e7eda7645f7342 - Init START
NCCL WARN socketPollConnect: connect returned Connection refused, exceeded error retry count (35)
</details>

Test Plan

  • ran the external repro on a 4xH800 worker
  • observed round-0 success and round-1 communicator ID divergence before the fix
  • verified the root cause in ProcessGroupNCCL::allgatherUniqueNCCLIDs()

Changed files

  • torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (modified, +3/-2)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

ProcessGroupNCCL::allgatherUniqueNCCLIDs() in the scalable communicator initialization path appears to use store keys that are not scoped by communicator initialization round.

In the regular unique ID broadcast path, store keys are already differentiated by communicator counter / sequence. However, in the scalable init path, the keys are derived from root position and do not appear to be isolated across repeated communicator initializations in the same process group lifetime.

This means repeated scalable communicator initialization may potentially reuse the same Store keys for different rounds, which can allow later rounds to observe stale ncclUniqueId values from earlier rounds.

This is important because it happens during communicator bootstrap rather than in a normal collective. If stale IDs are observed here, the failure mode can be more severe and harder to diagnose, for example:

  • communicator initialization hangs
  • timeouts
  • bootstrap failures that look like store/network/rank liveness issues

The issue is also subtle because the retrieved value can still have the correct byte length, so basic size checks may still pass before the failure surfaces later in communicator creation.

Relevant code path:

  • ProcessGroupNCCL::initNCCLComm()
  • scalable init branch
  • ProcessGroupNCCL::allgatherUniqueNCCLIDs()

Proposed fix direction: Scope the UniqueNCCLID store keys by communicator initialization counter / sequence, similar to the regular unique ID broadcast path, so each communicator initialization round uses a distinct key namespace.

I found this while comparing an internal branch against upstream, and I did not find an existing issue that appears to describe this exact scalable-init store-key reuse problem.

Versions

PyTorch version: main branch source inspection CUDA used to build PyTorch: N/A Is CUDA available: N/A Python version: N/A OS: Linux

Issue identified from source inspection in ProcessGroupNCCL.cpp, specifically the scalable communicator initialization path.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To address the issue of store key reuse in the scalable communicator initialization path, we need to scope the UniqueNCCLID store keys by communicator initialization counter/sequence. Here are the steps:

  • Modify the ProcessGroupNCCL::allgatherUniqueNCCLIDs() function to include the communicator initialization counter in the store key.
  • Update the store key generation to use a unique namespace for each communicator initialization round.

Example code changes:

// In ProcessGroupNCCL::allgatherUniqueNCCLIDs()
std::string getStoreKey(int rank, int commInitCounter) {
  // Include commInitCounter in the store key
  return "nccl_unique_id_" + std::to_string(rank) + "_" + std::to_string(commInitCounter);
}

//...

// Use the updated getStoreKey function to generate store keys
std::string storeKey = getStoreKey(rank, commInitCounter);
  • Ensure that the commInitCounter is incremented for each new communicator initialization round.

Verification

To verify the fix, test the scalable communicator initialization path with multiple rounds of initialization and verify that:

  • The store keys are unique for each round.
  • The communicator initialization completes successfully without hangs or timeouts.
  • The retrieved ncclUniqueId values are correct and not stale.

Extra Tips

  • Review the ProcessGroupNCCL::initNCCLComm() function to ensure that the commInitCounter is properly incremented and passed to the allgatherUniqueNCCLIDs() function.
  • Consider adding logging or debugging statements to verify the store key generation and ncclUniqueId retrieval.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING