pytorch - ✅(Solved) Fix ProcessGroupNCCL scalable communicator init may reuse stale UniqueNCCLID store keys [1 pull requests, 1 comments, 2 participants]

pytorch2026-03-26 03:55:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178473•Fetched 2026-04-08 01:30:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Thinkin999

Participants

d4l3k

Thinkin999

Timeline (top)

mentioned ×12subscribed ×12labeled ×2commented ×1

Root Cause

This is important because it happens during communicator bootstrap rather than in a normal collective. If stale IDs are observed here, the failure mode can be more severe and harder to diagnose, for example:

communicator initialization hangs
timeouts
bootstrap failures that look like store/network/rank liveness issues

Fix Action

Fixed

Fixed by PR: [c10d][NCCL] avoid reusing scalable init store keys across communicat… (https://github.com/pytorch/pytorch/pull/178493)

PR fix notes

PR #178493: [c10d][NCCL] avoid reusing scalable init store keys across communicat…

Repository: pytorch/pytorch
Author: Thinkin999
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/178493

Description (problem / solution / changelog)

Summary

Fix a scalable NCCL initialization bug where allgatherUniqueNCCLIDs() reused the same store key namespace across different communicator initialization rounds.

Under scalable init, later communicator-init rounds may observe stale ncclUniqueId values left behind by earlier rounds, which can lead to mismatched communicator IDs across ranks and eventual bootstrap failures.

This change prefixes scalable-init store keys with a per-communicator counter so that different initialization rounds no longer share the same key namespace.

Fixes #178473

External Repro

I reproduced this on a 4xH800 setup using a small external repro that:

enables scalable init with TORCH_NCCL_RANKS_PER_ROOT=1
creates multiple communicator-init rounds within the same process group
introduces rank skew between rounds

<details> <summary>External repro script</summary>

import argparse
import os
import sys
import time
import traceback
from datetime import timedelta

import torch
import torch.distributed as dist


def log(msg: str) -> None:
    rank = int(os.environ.get("RANK", "-1"))
    print(f"[rank {rank}] {msg}", flush=True)


def check_tensor(t: torch.Tensor, world_size: int, round_idx: int, device_idx: int) -> None:
    expected = float(sum(range(1, world_size + 1)))
    got = float(t.item())
    if got != expected:
        raise RuntimeError(
            f"round={round_idx} device=cuda:{device_idx} got {got}, expected {expected}"
        )


def run_round(rank: int, world_size: int, round_idx: int, skew_rank: int, skew_s: float) -> None:
    device_idx = (rank + round_idx) % world_size
    device = torch.device("cuda", device_idx)

    if rank == skew_rank and round_idx > 0:
        log(f"sleep {skew_s}s before round {round_idx} on cuda:{device_idx}")
        time.sleep(skew_s)

    torch.cuda.set_device(device)
    x = torch.tensor([float(rank + 1)], device=device)

    log(f"enter round {round_idx} on cuda:{device_idx}")
    dist.all_reduce(x, op=dist.ReduceOp.SUM)
    torch.cuda.synchronize(device)
    check_tensor(x, world_size, round_idx, device_idx)
    log(f"finish round {round_idx} on cuda:{device_idx}, value={x.item()}")


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--rounds", type=int, default=None)
    parser.add_argument("--sleep-seconds", type=float, default=3.0)
    parser.add_argument("--pg-timeout", type=int, default=60)
    args = parser.parse_args()

    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    os.environ.setdefault("TORCH_NCCL_RANKS_PER_ROOT", "1")
    os.environ.setdefault("TORCH_NCCL_BLOCKING_WAIT", "1")
    os.environ.setdefault("TORCH_NCCL_ASYNC_ERROR_HANDLING", "1")

    dist.init_process_group(
        backend="nccl",
        timeout=timedelta(seconds=args.pg_timeout),
    )

    try:
        for round_idx in range(args.rounds if args.rounds is not None else world_size):
            skew_rank = round_idx % world_size
            run_round(rank, world_size, round_idx, skew_rank, args.sleep_seconds)
    finally:
        if dist.is_initialized():
            dist.destroy_process_group()


if __name__ == "__main__":
    try:
        main()
    except Exception:
        traceback.print_exc()
        sys.exit(1)

</details>

Observed Failure

In the failing run:

round 0 completed successfully on all ranks
round 1 triggered another scalable communicator initialization
different ranks entered ncclCommInitRankScalable with different commId values
the run then failed during NCCL bootstrap with repeated socketPollConnect: connect returned Connection refused

<details> <summary>Relevant log excerpts</summary>

[rank 3] finish round 0 on cuda:3, value=10.0
[rank 0] finish round 0 on cuda:0, value=10.0
[rank 2] finish round 0 on cuda:2, value=10.0
[rank 1] finish round 0 on cuda:1, value=10.0

[rank 1] sleep 3.0s before round 1 on cuda:2
[rank 3] enter round 1 on cuda:0
[rank 0] enter round 1 on cuda:1
[rank 2] enter round 1 on cuda:3

n105-063-105:1046:1121 [1] NCCL INFO ncclCommInitRankScalable comm 0x981bf30 rank 0 nranks 4 cudaDev 1 nvmlDev 1 busId 67020 commId 0x60e7eda7645f7342 - Init START
n105-063-105:1049:1120 [0] NCCL INFO ncclCommInitRankScalable comm 0x91431c0 rank 3 nranks 4 cudaDev 0 nvmlDev 0 busId 65020 commId 0xb0374f7b726e0ae3 - Init START
n105-063-105:1048:1122 [3] NCCL INFO ncclCommInitRankScalable comm 0xa0bde40 rank 2 nranks 4 cudaDev 3 nvmlDev 3 busId 6b020 commId 0x60e7eda7645f7342 - Init START
n105-063-105:1047:1125 [2] NCCL INFO ncclCommInitRankScalable comm 0x97370d0 rank 1 nranks 4 cudaDev 2 nvmlDev 2 busId 69020 commId 0x60e7eda7645f7342 - Init START

NCCL WARN socketPollConnect: connect returned Connection refused, exceeded error retry count (35)

</details>

Test Plan

ran the external repro on a 4xH800 worker
observed round-0 success and round-1 communicator ID divergence before the fix
verified the root cause in ProcessGroupNCCL::allgatherUniqueNCCLIDs()

Changed files

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (modified, +3/-2)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

ProcessGroupNCCL::allgatherUniqueNCCLIDs() in the scalable communicator initialization path appears to use store keys that are not scoped by communicator initialization round.

In the regular unique ID broadcast path, store keys are already differentiated by communicator counter / sequence. However, in the scalable init path, the keys are derived from root position and do not appear to be isolated across repeated communicator initializations in the same process group lifetime.

This means repeated scalable communicator initialization may potentially reuse the same Store keys for different rounds, which can allow later rounds to observe stale ncclUniqueId values from earlier rounds.

communicator initialization hangs
timeouts
bootstrap failures that look like store/network/rank liveness issues

The issue is also subtle because the retrieved value can still have the correct byte length, so basic size checks may still pass before the failure surfaces later in communicator creation.

Relevant code path:

ProcessGroupNCCL::initNCCLComm()
scalable init branch
ProcessGroupNCCL::allgatherUniqueNCCLIDs()

Proposed fix direction: Scope the UniqueNCCLID store keys by communicator initialization counter / sequence, similar to the regular unique ID broadcast path, so each communicator initialization round uses a distinct key namespace.

I found this while comparing an internal branch against upstream, and I did not find an existing issue that appears to describe this exact scalable-init store-key reuse problem.

Versions

PyTorch version: main branch source inspection CUDA used to build PyTorch: N/A Is CUDA available: N/A Python version: N/A OS: Linux

Issue identified from source inspection in ProcessGroupNCCL.cpp, specifically the scalable communicator initialization path.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To address the issue of store key reuse in the scalable communicator initialization path, we need to scope the UniqueNCCLID store keys by communicator initialization counter/sequence. Here are the steps:

Modify the ProcessGroupNCCL::allgatherUniqueNCCLIDs() function to include the communicator initialization counter in the store key.
Update the store key generation to use a unique namespace for each communicator initialization round.

Example code changes:

// In ProcessGroupNCCL::allgatherUniqueNCCLIDs()
std::string getStoreKey(int rank, int commInitCounter) {
  // Include commInitCounter in the store key
  return "nccl_unique_id_" + std::to_string(rank) + "_" + std::to_string(commInitCounter);
}

//...

// Use the updated getStoreKey function to generate store keys
std::string storeKey = getStoreKey(rank, commInitCounter);

Ensure that the commInitCounter is incremented for each new communicator initialization round.

Verification

To verify the fix, test the scalable communicator initialization path with multiple rounds of initialization and verify that:

The store keys are unique for each round.
The communicator initialization completes successfully without hangs or timeouts.
The retrieved ncclUniqueId values are correct and not stale.

Extra Tips

Review the ProcessGroupNCCL::initNCCLComm() function to ensure that the commInitCounter is properly incremented and passed to the allgatherUniqueNCCLIDs() function.
Consider adding logging or debugging statements to verify the store key generation and ncclUniqueId retrieval.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model loading #dependency error #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix ProcessGroupNCCL scalable communicator init may reuse stale UniqueNCCLID store keys [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #178493: [c10d][NCCL] avoid reusing scalable init store keys across communicat…

Description (problem / solution / changelog)

Summary

External Repro

Observed Failure

Test Plan

Changed files

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix ProcessGroupNCCL scalable communicator init may reuse stale UniqueNCCLID store keys [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #178493: [c10d][NCCL] avoid reusing scalable init store keys across communicat…

Description (problem / solution / changelog)

Summary

External Repro

Observed Failure

Test Plan

Changed files

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING