pytorch - ✅(Solved) Fix `device_id` not honored when creating `process_group` with `$TORCH_DISTRIBUTED_DEBUG=DETAIL` [1 pull requests, 9 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178977Fetched 2026-04-08 02:22:06
View on GitHub
Comments
9
Participants
5
Timeline
110
Reactions
1
Author
Timeline (top)
mentioned ×42subscribed ×42unsubscribed ×13commented ×9

PR fix notes

PR #178779: Implement missing methods in ProcessGroupWrapper

Description (problem / solution / changelog)

Most importantly shutdown is missing which in the case of the NCCL process group may lead to hangs on termination.

See #178758

<details><summary>Example reproducer:</summary>
import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
    try:
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = port
        os.environ["LOCAL_RANK"] = str(rank)
        os.environ["RANK"] = str(rank)
        os.environ["LOCAL_SIZE"] = str(world_size)
        os.environ["WORLD_SIZE"] = str(world_size)

        dist.init_process_group(
            backend="nccl",
            init_method=f"file://{init_file}",
            rank=rank,
            world_size=world_size,
            device_id=torch.device('cuda', rank)
        )
        dist.barrier()
        print(f"[Rank {rank}] ready")
        dist.destroy_process_group()
    except Exception as e:
        print(rank, "ERROR", e)


def test_main():
    mp.set_start_method("forkserver", force=True)
    pool = mp.Pool(processes=WORLD_SIZE)
    results = pool.starmap_async(
        init_distributed,
        [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
    )
    results.wait()
    pool.close()
    pool.join()
    print("Finished")


if __name__ == "__main__":
    test_main()
</details>

Note that this shows a related issue: The device passed to dist.init_process_group is not passed from ProcessGroupWrapper to the underlying process group. Hence it will try guessing and warn about it:

Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()

I don't see an obvious solution to that so left it for now.

Changed files

  • torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (modified, +78/-0)
  • torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp (modified, +47/-13)

Code Example

import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = port
        os.environ["LOCAL_RANK"] = str(rank)
        os.environ["RANK"] = str(rank)
        os.environ["LOCAL_SIZE"] = str(world_size)
        os.environ["WORLD_SIZE"] = str(world_size)

        dist.init_process_group(
            backend="nccl",
            init_method=f"file://{init_file}",
            rank=rank,
            world_size=world_size,
            device_id=torch.device('cuda', rank)
        )
        dist.barrier()  # CAUSES WARNING
        dist.destroy_process_group()


mp.set_start_method("forkserver", force=True)
pool = mp.Pool(processes=WORLD_SIZE)
pool.starmap(
        init_distributed,
        [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
)
pool.close()
pool.join()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When $TORCH_DISTRIBUTED_DEBUG=DETAIL the ProcessGroup will be wrapped in ProcessGroupWrapper. When dist.init_process_group then sets the device in the process group it will set it only in the wrapper, not the real process group.

This can cause hangs as indicated by the warning:

Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()

<details><summary>Example reproducer:</summary>
import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = port
        os.environ["LOCAL_RANK"] = str(rank)
        os.environ["RANK"] = str(rank)
        os.environ["LOCAL_SIZE"] = str(world_size)
        os.environ["WORLD_SIZE"] = str(world_size)

        dist.init_process_group(
            backend="nccl",
            init_method=f"file://{init_file}",
            rank=rank,
            world_size=world_size,
            device_id=torch.device('cuda', rank)
        )
        dist.barrier()  # CAUSES WARNING
        dist.destroy_process_group()


mp.set_start_method("forkserver", force=True)
pool = mp.Pool(processes=WORLD_SIZE)
pool.starmap(
        init_distributed,
        [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
)
pool.close()
pool.join()
</details>

See related PR https://github.com/pytorch/pytorch/pull/178779

Versions

Issues exists since introduction of the "bound device" in #114916, i.e. 2.3.0 and still in current 2.11.0 Of course only for NCCL & CUDA

extent analysis

TL;DR

Specify the device_id in init_process_group to avoid potential hangs when using TORCH_DISTRIBUTED_DEBUG=DETAIL and NCCL backend.

Guidance

  • The issue arises from the ProcessGroup being wrapped in ProcessGroupWrapper when TORCH_DISTRIBUTED_DEBUG=DETAIL, causing the device to be set only in the wrapper.
  • To mitigate this, specify the device_id in init_process_group as indicated by the warning message.
  • Verify that the warning message disappears and the code runs without hangs after specifying the device_id.
  • Consider updating to a version where this issue is fixed, if available, as indicated by the related PR.

Example

dist.init_process_group(
    backend="nccl",
    init_method=f"file://{init_file}",
    rank=rank,
    world_size=world_size,
    device_ids=[rank]  # Specify device_id
)

Notes

This issue is specific to NCCL backend and CUDA, and has been present since version 2.3.0.

Recommendation

Apply workaround by specifying the device_id in init_process_group, as this is a straightforward fix that can be applied immediately, whereas updating to a fixed version may not be feasible for all users.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING