pytorch - ✅(Solved) Fix `device_id` not honored when creating `process_group` with `$TORCH_DISTRIBUTED_DEBUG=DETAIL` [1 pull requests, 9 comments, 5 participants]

pytorch2026-04-01 07:52:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178977•Fetched 2026-04-08 02:22:06

View on GitHub

Comments

Participants

Timeline

110

Reactions

Author

Participants

Timeline (top)

mentioned ×42subscribed ×42unsubscribed ×13commented ×9

PR fix notes

PR #178779: Implement missing methods in `ProcessGroupWrapper`

Repository: pytorch/pytorch
Author: Flamefire
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/178779

Description (problem / solution / changelog)

Most importantly shutdown is missing which in the case of the NCCL process group may lead to hangs on termination.

See #178758

<details><summary>Example reproducer:</summary>

import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
    try:
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = port
        os.environ["LOCAL_RANK"] = str(rank)
        os.environ["RANK"] = str(rank)
        os.environ["LOCAL_SIZE"] = str(world_size)
        os.environ["WORLD_SIZE"] = str(world_size)

        dist.init_process_group(
            backend="nccl",
            init_method=f"file://{init_file}",
            rank=rank,
            world_size=world_size,
            device_id=torch.device('cuda', rank)
        )
        dist.barrier()
        print(f"[Rank {rank}] ready")
        dist.destroy_process_group()
    except Exception as e:
        print(rank, "ERROR", e)


def test_main():
    mp.set_start_method("forkserver", force=True)
    pool = mp.Pool(processes=WORLD_SIZE)
    results = pool.starmap_async(
        init_distributed,
        [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
    )
    results.wait()
    pool.close()
    pool.join()
    print("Finished")


if __name__ == "__main__":
    test_main()

</details>

Note that this shows a related issue: The device passed to dist.init_process_group is not passed from ProcessGroupWrapper to the underlying process group. Hence it will try guessing and warn about it:

Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()

I don't see an obvious solution to that so left it for now.

Changed files

torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (modified, +78/-0)
torch/csrc/distributed/c10d/ProcessGroupWrapper.hpp (modified, +47/-13)

Code Example

import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = port
        os.environ["LOCAL_RANK"] = str(rank)
        os.environ["RANK"] = str(rank)
        os.environ["LOCAL_SIZE"] = str(world_size)
        os.environ["WORLD_SIZE"] = str(world_size)

        dist.init_process_group(
            backend="nccl",
            init_method=f"file://{init_file}",
            rank=rank,
            world_size=world_size,
            device_id=torch.device('cuda', rank)
        )
        dist.barrier()  # CAUSES WARNING
        dist.destroy_process_group()


mp.set_start_method("forkserver", force=True)
pool = mp.Pool(processes=WORLD_SIZE)
pool.starmap(
        init_distributed,
        [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
)
pool.close()
pool.join()

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When $TORCH_DISTRIBUTED_DEBUG=DETAIL the ProcessGroup will be wrapped in ProcessGroupWrapper. When dist.init_process_group then sets the device in the process group it will set it only in the wrapper, not the real process group.

This can cause hangs as indicated by the warning:

Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()

<details><summary>Example reproducer:</summary>

import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
        os.environ["MASTER_ADDR"] = "127.0.0.1"
        os.environ["MASTER_PORT"] = port
        os.environ["LOCAL_RANK"] = str(rank)
        os.environ["RANK"] = str(rank)
        os.environ["LOCAL_SIZE"] = str(world_size)
        os.environ["WORLD_SIZE"] = str(world_size)

        dist.init_process_group(
            backend="nccl",
            init_method=f"file://{init_file}",
            rank=rank,
            world_size=world_size,
            device_id=torch.device('cuda', rank)
        )
        dist.barrier()  # CAUSES WARNING
        dist.destroy_process_group()


mp.set_start_method("forkserver", force=True)
pool = mp.Pool(processes=WORLD_SIZE)
pool.starmap(
        init_distributed,
        [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
)
pool.close()
pool.join()

</details>

Versions

Issues exists since introduction of the "bound device" in #114916, i.e. 2.3.0 and still in current 2.11.0 Of course only for NCCL & CUDA

extent analysis

TL;DR

Specify the device_id in init_process_group to avoid potential hangs when using TORCH_DISTRIBUTED_DEBUG=DETAIL and NCCL backend.

Guidance

The issue arises from the ProcessGroup being wrapped in ProcessGroupWrapper when TORCH_DISTRIBUTED_DEBUG=DETAIL, causing the device to be set only in the wrapper.
To mitigate this, specify the device_id in init_process_group as indicated by the warning message.
Verify that the warning message disappears and the code runs without hangs after specifying the device_id.
Consider updating to a version where this issue is fixed, if available, as indicated by the related PR.

Example

dist.init_process_group(
    backend="nccl",
    init_method=f"file://{init_file}",
    rank=rank,
    world_size=world_size,
    device_ids=[rank]  # Specify device_id
)

Notes

This issue is specific to NCCL backend and CUDA, and has been present since version 2.3.0.

Recommendation

Apply workaround by specifying the device_id in init_process_group, as this is a straightforward fix that can be applied immediately, whereas updating to a fixed version may not be feasible for all users.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix `device_id` not honored when creating `process_group` with `$TORCH_DISTRIBUTED_DEBUG=DETAIL` [1 pull requests, 9 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #178779: Implement missing methods in `ProcessGroupWrapper`

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix `device_id` not honored when creating `process_group` with `$TORCH_DISTRIBUTED_DEBUG=DETAIL` [1 pull requests, 9 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #178779: Implement missing methods in ProcessGroupWrapper

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #178779: Implement missing methods in `ProcessGroupWrapper`