pytorch - ✅(Solved) Fix dist.new_group(..., use_local_synchronization=True) hangs with Gloo for overlapping groups, while NCCL succeeds on the same participant-consistent creation order [2 pull requests, 1 participants]

tie-pilot-qxw · 2026-03-20T13:19:03Z

[pytorch] PR 3533: MoE aux loss free - Repository: axolotl-ai-cloud/axolotl - Author: winglian - State: open | merged: False - Link: https://github.com/axolotl… # PR #3533: MoE aux loss free - Repository: axolotl-ai-cloud/axolotl - Author: winglian - State: open | merged: False - Link: https://github.com/axolotl-ai-cloud/axolotl/pull/3533 ## Description (problem / solution / changelog) # Description Supersedes #3259 since we don't have write access to that PR. Rebases to current main and includes fixes to use Transformers v5 style MoEs. /cc @lhl ## Summary by CodeRabbit ## Release Notes * **New Features** - Added aux-loss-free (AFB) MoE routing plugin supporting Mixtral, Qwen, Llama4, Ring MoE, and other architectures - New configuration parameters: `moe_balance_type`, `moe_update_rate`, `moe_update_momentum`, `moe_bias_cap`, `moe_afb_warmup_steps`, `moe_bias_sync_group`, and `expert_parallel_size` - AFB bias support in ScatterMoE and SonicMoE kernel routing backends * **Documentation** - Added plugin documentation with configuration examples and telemetry logging guidance * **Tests** - Added E2E tests for multiple MoE models and loss parity validation - Added comprehensive unit tests for adapter functionality ## Changed files - `src/axolotl/integrations/aux_free_router/README.md` (added, +50/-0) - `src/axolotl/integrations/aux_free_router/__init__.py` (added, +9/-0) - `src/axolotl/integrations/aux_free_router/adapters.py` (added, +393/-0) - `src/axolotl/integrations/aux_free_router/args.py` (added, +71/-0) - `src/axolotl/integrations/aux_free_router/core.py` (added, +166/-0) - `src/axolotl/integrations/aux_free_router/plugin.py` (added, +267/-0) - `src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` (modified, +29/-1) - `src/axolotl/integrations/kernels/sonicmoe/routing.py` (modified, +59/-6) - `src/axolotl/utils/config/__init__.py` (modified, +1/-0) - `src/axolotl/utils/schemas/config.py` (modified, +6/-0) - `src/axolotl/utils/schemas/validation.py` (modified, +8/-0) - `tests/e2e/test_llama4_moe_aux_free.py` (added, +75/-0) - `tests/e2e/test_moe_aux_free.py` (added, +79/-0) - `tests/e2e/test_moe_aux_parity.py` (added, +91/-0) - `tests/e2e/test_qwen3_moe_aux_free.py` (added, +76/-0) - `tests/e2e/test_ring_moe_aux_free.py` (added, +74/-0) - `tests/e2e/utils.py` (modified, +9/-1) - `tests/unit/test_aux_free_adapters.py` (added, +666/-0) --- # PR #2869: feat(mimo): Phase 4 - MiMo training, model/provider, data loading, heterogeneous parallelism - Repository: NVIDIA-NeMo/Megatron-Bridge - Author: aroshanghias-nvd - State: closed | merged: True - Link: https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/2869 ## Description (problem / solution / changelog) ## Summary Adds MiMo (Multi-Input Multi-Output) training support to Megatron-Bridge, enabling heterogeneous multi-modal model training with independent per-module parallelism. - **MimoModelProvider**: ModuleSpec-based model construction with heterogeneous LLaVA support (language model + vision encoder with independent configs) - **Training loop**: `pretrain_mimo` / `train_mimo` / `mimo_step` entry points for MiMo-aware training with per-module forward/backward orchestration - **Heterogeneous parallelism**: Each module (LLM, vision encoder, etc.) can run with its own TP/PP/DP configuration on a disjoint set of ranks (`mimo_parallel_utils`) - **Data loading**: MiMo-aware collation, dataset, and data loader dispatch routing for multi-modal inputs - **DDP wrapping**: Per-module distributed data parallel with rank-aware grid assignment - Megatron-LM submodule pinned to PR #3212 head - Full unit test coverage (122 tests) Phase 5 (checkpointing, evaluation, e2e tests) is stacked in a follow-up PR. ## Validation - 122 unit tests passed (models/mimo, training/mimo, data/mimo) ## Stack - **PR1 (this)**: Phase 4 — training, model, data, parallelism - **PR2**: Phase 5 — checkpoint save/resume, evaluation, e2e tests ## Summary by CodeRabbit * **New Features** * Added MIMO (Multi-Instance Model Optimization) training framework enabling heterogeneous multi-module parallelism with dedicated pretraining entry point and training pipeline * Enhanced data loading infrastructure for multi-module models with loss masking support * **Refactor** * Reorganized model infrastructure, process group management, and distributed utilities for improved multi-module training efficiency and flexibility ## Changed files - `src/megatron/bridge/data/loaders.py` (modified, +102/-130) - `src/megatron/bridge/data/mimo/__init__.py` (modified, +4/-2) - `src/megatron/bridge/data/mimo/base_provider.py` (added, +63/-0) - `src/megatron/

pytorch2026-03-20 13:19:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177959•Fetched 2026-04-08 01:03:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tie-pilot-qxw

Participants

tie-pilot-qxw

Timeline (top)

mentioned ×12subscribed ×12labeled ×3cross-referenced ×2

Error Message

There is also an important usability issue here: if this pattern is unsupported, it would be much easier to debug if new_group() raised an explicit error rather than hanging indefinitely.

Root Cause

I understand that the documentation warns that use_local_synchronization=True may deadlock when each rank creates multiple overlapping process groups and global creation order diverges. However, this case is still surprising from the user API perspective, because the ranks that belong to each subgroup do appear to invoke that subgroup creation consistently among themselves.

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: AuthenticAMD BIOS Vendor ID: Advanced Micro Devices, Inc. Model name: AMD EPYC 9534 64-Core Processor BIOS Model name: AMD EPYC 9534 64-Core Processor Unknown CPU @ 2.4GHz BIOS CPU family: 107 CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 58% CPU max MHz: 3719.5830 CPU min MHz: 1500.0000 BogoMIPS: 4900.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 4 MiB (128 instances) L1i cache: 4 MiB (128 instances) L2 cache: 128 MiB (128 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Vulnerable: Clear CPU buffers attempted, no microcode Vulnerability Tsx async abort: Not affected

PR fix notes

PR #3533: MoE aux loss free

Repository: axolotl-ai-cloud/axolotl
Author: winglian
State: open | merged: False
Link: https://github.com/axolotl-ai-cloud/axolotl/pull/3533

Description (problem / solution / changelog)

Description

Supersedes #3259 since we don't have write access to that PR.

Rebases to current main and includes fixes to use Transformers v5 style MoEs.

/cc @lhl

Summary by CodeRabbit

Release Notes

New Features
- Added aux-loss-free (AFB) MoE routing plugin supporting Mixtral, Qwen, Llama4, Ring MoE, and other architectures
- New configuration parameters: moe_balance_type, moe_update_rate, moe_update_momentum, moe_bias_cap, moe_afb_warmup_steps, moe_bias_sync_group, and expert_parallel_size
- AFB bias support in ScatterMoE and SonicMoE kernel routing backends
Documentation
- Added plugin documentation with configuration examples and telemetry logging guidance
Tests
- Added E2E tests for multiple MoE models and loss parity validation
- Added comprehensive unit tests for adapter functionality

Changed files

src/axolotl/integrations/aux_free_router/README.md (added, +50/-0)
src/axolotl/integrations/aux_free_router/__init__.py (added, +9/-0)
src/axolotl/integrations/aux_free_router/adapters.py (added, +393/-0)
src/axolotl/integrations/aux_free_router/args.py (added, +71/-0)
src/axolotl/integrations/aux_free_router/core.py (added, +166/-0)
src/axolotl/integrations/aux_free_router/plugin.py (added, +267/-0)
src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py (modified, +29/-1)
src/axolotl/integrations/kernels/sonicmoe/routing.py (modified, +59/-6)
src/axolotl/utils/config/__init__.py (modified, +1/-0)
src/axolotl/utils/schemas/config.py (modified, +6/-0)
src/axolotl/utils/schemas/validation.py (modified, +8/-0)
tests/e2e/test_llama4_moe_aux_free.py (added, +75/-0)
tests/e2e/test_moe_aux_free.py (added, +79/-0)
tests/e2e/test_moe_aux_parity.py (added, +91/-0)
tests/e2e/test_qwen3_moe_aux_free.py (added, +76/-0)
tests/e2e/test_ring_moe_aux_free.py (added, +74/-0)
tests/e2e/utils.py (modified, +9/-1)
tests/unit/test_aux_free_adapters.py (added, +666/-0)

PR #2869: feat(mimo): Phase 4 - MiMo training, model/provider, data loading, heterogeneous parallelism

Repository: NVIDIA-NeMo/Megatron-Bridge
Author: aroshanghias-nvd
State: closed | merged: True
Link: https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/2869

Description (problem / solution / changelog)

Summary

Adds MiMo (Multi-Input Multi-Output) training support to Megatron-Bridge, enabling heterogeneous multi-modal model training with independent per-module parallelism.

MimoModelProvider: ModuleSpec-based model construction with heterogeneous LLaVA support (language model + vision encoder with independent configs)
Training loop: pretrain_mimo / train_mimo / mimo_step entry points for MiMo-aware training with per-module forward/backward orchestration
Heterogeneous parallelism: Each module (LLM, vision encoder, etc.) can run with its own TP/PP/DP configuration on a disjoint set of ranks (mimo_parallel_utils)
Data loading: MiMo-aware collation, dataset, and data loader dispatch routing for multi-modal inputs
DDP wrapping: Per-module distributed data parallel with rank-aware grid assignment
Megatron-LM submodule pinned to PR #3212 head
Full unit test coverage (122 tests)

Phase 5 (checkpointing, evaluation, e2e tests) is stacked in a follow-up PR.

Validation

122 unit tests passed (models/mimo, training/mimo, data/mimo)

Stack

PR1 (this): Phase 4 — training, model, data, parallelism
PR2: Phase 5 — checkpoint save/resume, evaluation, e2e tests

Summary by CodeRabbit

New Features
- Added MIMO (Multi-Instance Model Optimization) training framework enabling heterogeneous multi-module parallelism with dedicated pretraining entry point and training pipeline
- Enhanced data loading infrastructure for multi-module models with loss masking support
Refactor
- Reorganized model infrastructure, process group management, and distributed utilities for improved multi-module training efficiency and flexibility

Changed files

src/megatron/bridge/data/loaders.py (modified, +102/-130)
src/megatron/bridge/data/mimo/__init__.py (modified, +4/-2)
src/megatron/bridge/data/mimo/base_provider.py (added, +63/-0)
src/megatron/bridge/data/mimo/collate.py (modified, +4/-0)
src/megatron/bridge/data/mimo/dataset.py (modified, +23/-2)
src/megatron/bridge/data/mimo/dp_utils.py (modified, +13/-24)
src/megatron/bridge/data/mimo/hf_provider.py (modified, +9/-3)
src/megatron/bridge/data/mimo/loaders.py (modified, +17/-17)
src/megatron/bridge/data/mimo/mock_provider.py (modified, +9/-2)
src/megatron/bridge/models/mimo/llava_provider.py (modified, +1/-4)
src/megatron/bridge/models/mimo/mimo_builder.py (modified, +19/-22)
src/megatron/bridge/models/mimo/mimo_config.py (modified, +62/-37)
src/megatron/bridge/models/mimo/mimo_ddp.py (modified, +10/-5)
src/megatron/bridge/models/mimo/mimo_provider.py (modified, +114/-111)
src/megatron/bridge/training/config.py (modified, +42/-0)
src/megatron/bridge/training/mimo_parallel_utils.py (added, +292/-0)
src/megatron/bridge/training/mimo_step.py (added, +208/-0)
src/megatron/bridge/training/pretrain_mimo.py (added, +102/-0)
src/megatron/bridge/training/setup_mimo.py (added, +359/-0)
src/megatron/bridge/training/train_mimo.py (added, +436/-0)
src/megatron/bridge/training/utils/train_utils.py (modified, +24/-18)
tests/functional_tests/launch_scripts/active/L0_Launch_training_mimo.sh (added, +23/-0)
tests/functional_tests/test_groups/data/test_loaders.py (modified, +17/-11)
tests/functional_tests/test_groups/training/test_pretrain_mimo.py (added, +325/-0)
tests/unit_tests/data/mimo/test_collate.py (modified, +3/-0)
tests/unit_tests/data/mimo/test_dataset.py (modified, +1/-1)
tests/unit_tests/data/mimo/test_dp_utils.py (modified, +39/-22)
tests/unit_tests/data/mimo/test_hf_provider.py (modified, +2/-2)
tests/unit_tests/data/mimo/test_loaders.py (modified, +34/-10)
tests/unit_tests/models/mimo/test_llava_provider.py (modified, +2/-2)
tests/unit_tests/models/mimo/test_mimo_builder.py (modified, +11/-101)
tests/unit_tests/models/mimo/test_mimo_ddp.py (modified, +53/-81)
tests/unit_tests/models/mimo/test_mimo_provider.py (modified, +173/-119)
tests/unit_tests/training/mimo/__init__.py (added, +2/-0)
tests/unit_tests/training/mimo/test_mimo_config.py (modified, +2/-2)
tests/unit_tests/training/mimo/test_mimo_parallel_utils.py (added, +283/-0)
tests/unit_tests/training/mimo/test_mimo_step.py (added, +173/-0)
tests/unit_tests/training/mimo/test_pretrain_mimo.py (added, +170/-0)

Code Example

#!/usr/bin/env python3
from __future__ import annotations

import argparse
import datetime
import os
import time
from dataclasses import dataclass

import torch
import torch.distributed as dist


@dataclass(frozen=True)
class Case:
    session_id: str
    participants: tuple[int, ...]


DEFAULT_CASES_7: tuple[Case, ...] = (
    Case("g0_r0123", (0, 1, 2, 3)),
    Case("g1_r2345", (2, 3, 4, 5)),
    Case("g2_r4560", (4, 5, 6, 0)),
    Case("g3_r6102", (6, 1, 0, 2)),
)


def _log(rank: int, msg: str) -> None:
    ts = time.strftime("%H:%M:%S")
    print(f"[{ts}] [rank={rank}] {msg}", flush=True)


def _create_group(
    *,
    rank: int,
    group_name: str,
    participants: tuple[int, ...],
    backend: str,
    use_local_sync: bool,
    timeout_s: int,
) -> None:
    if use_local_sync and rank not in participants:
        _log(rank, f"{group_name}: skip non-participant backend={backend}")
        return
    _log(
        rank,
        f"{group_name}: new_group start ranks={list(participants)} "
        f"backend={backend} use_local_sync={use_local_sync}",
    )
    dist.new_group(
        ranks=list(participants),
        backend=backend,
        use_local_synchronization=use_local_sync,
        timeout=datetime.timedelta(seconds=timeout_s),
    )
    _log(rank, f"{group_name}: new_group done backend={backend}")


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--world-backend", choices=("nccl", "gloo"), default="gloo")
    p.add_argument(
        "--mode",
        choices=("single-gloo", "single-nccl"),
        default="single-gloo",
    )
    p.add_argument("--timeout-s", type=int, default=300)
    args = p.parse_args()

    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ.get("LOCAL_RANK", rank))

    if args.world_backend == "nccl":
        torch.cuda.set_device(local_rank)

    dist.init_process_group(
        backend=args.world_backend,
        init_method="env://",
        timeout=datetime.timedelta(seconds=args.timeout_s),
    )
    _log(rank, f"world_ready backend={args.world_backend} world_size={world_size}")

    try:
        subgroup_backend = "gloo" if args.mode == "single-gloo" else "nccl"
        for case in DEFAULT_CASES_7:
            _log(rank, f"{case.session_id}: begin participants={case.participants}")
            _create_group(
                rank=rank,
                group_name=case.session_id,
                participants=case.participants,
                backend=subgroup_backend,
                use_local_sync=True,
                timeout_s=args.timeout_s,
            )
            _log(rank, f"{case.session_id}: done")

        _log(rank, "all_done")
    finally:
        dist.destroy_process_group()


if __name__ == "__main__":
    main()

---

torchrun --nproc_per_node 7 --master_port 29501 \
  temp.py \
  --world-backend gloo \
  --mode single-gloo

---

[12:46:50] [rank=5] g0_r0123: skip non-participant backend=gloo
[12:46:50] [rank=2] g0_r0123: new_group start ranks=[0, 1, 2, 3] backend=gloo use_local_sync=True
[12:46:50] [rank=5] g0_r0123: done
[12:46:50] [rank=5] g1_r2345: begin participants=(2, 3, 4, 5)
[12:46:50] [rank=5] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True

[12:46:50] [rank=4] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True
[12:46:50] [rank=6] g1_r2345: skip non-participant backend=gloo
[12:46:50] [rank=6] g2_r4560: begin participants=(4, 5, 6, 0)
[12:46:50] [rank=6] g2_r4560: new_group start ranks=[4, 5, 6, 0] backend=gloo use_local_sync=True

[12:46:50] [rank=3] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True
[12:46:50] [rank=2] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True

---

torchrun --nproc_per_node 7 --master_port 29501 \
  temp.py \
  --world-backend nccl \
  --mode single-nccl

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When creating multiple overlapping process groups with torch.distributed.new_group(..., use_local_synchronization=True), I observe a hang with the gloo backend, even though for each individual group, all participating ranks call new_group() in a consistent order.

The same reproducer completes successfully when using nccl for subgroup creation.

So my questions are:

Is this backend-dependent behavior expected?

Is there an internal requirement stronger than “participant-local creation order is consistent”, e.g. all ranks must share identical global process-group creation history?

Minimal reproducer:

#!/usr/bin/env python3
from __future__ import annotations

import argparse
import datetime
import os
import time
from dataclasses import dataclass

import torch
import torch.distributed as dist


@dataclass(frozen=True)
class Case:
    session_id: str
    participants: tuple[int, ...]


DEFAULT_CASES_7: tuple[Case, ...] = (
    Case("g0_r0123", (0, 1, 2, 3)),
    Case("g1_r2345", (2, 3, 4, 5)),
    Case("g2_r4560", (4, 5, 6, 0)),
    Case("g3_r6102", (6, 1, 0, 2)),
)


def _log(rank: int, msg: str) -> None:
    ts = time.strftime("%H:%M:%S")
    print(f"[{ts}] [rank={rank}] {msg}", flush=True)


def _create_group(
    *,
    rank: int,
    group_name: str,
    participants: tuple[int, ...],
    backend: str,
    use_local_sync: bool,
    timeout_s: int,
) -> None:
    if use_local_sync and rank not in participants:
        _log(rank, f"{group_name}: skip non-participant backend={backend}")
        return
    _log(
        rank,
        f"{group_name}: new_group start ranks={list(participants)} "
        f"backend={backend} use_local_sync={use_local_sync}",
    )
    dist.new_group(
        ranks=list(participants),
        backend=backend,
        use_local_synchronization=use_local_sync,
        timeout=datetime.timedelta(seconds=timeout_s),
    )
    _log(rank, f"{group_name}: new_group done backend={backend}")


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--world-backend", choices=("nccl", "gloo"), default="gloo")
    p.add_argument(
        "--mode",
        choices=("single-gloo", "single-nccl"),
        default="single-gloo",
    )
    p.add_argument("--timeout-s", type=int, default=300)
    args = p.parse_args()

    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ.get("LOCAL_RANK", rank))

    if args.world_backend == "nccl":
        torch.cuda.set_device(local_rank)

    dist.init_process_group(
        backend=args.world_backend,
        init_method="env://",
        timeout=datetime.timedelta(seconds=args.timeout_s),
    )
    _log(rank, f"world_ready backend={args.world_backend} world_size={world_size}")

    try:
        subgroup_backend = "gloo" if args.mode == "single-gloo" else "nccl"
        for case in DEFAULT_CASES_7:
            _log(rank, f"{case.session_id}: begin participants={case.participants}")
            _create_group(
                rank=rank,
                group_name=case.session_id,
                participants=case.participants,
                backend=subgroup_backend,
                use_local_sync=True,
                timeout_s=args.timeout_s,
            )
            _log(rank, f"{case.session_id}: done")

        _log(rank, "all_done")
    finally:
        dist.destroy_process_group()


if __name__ == "__main__":
    main()

Command used:

torchrun --nproc_per_node 7 --master_port 29501 \
  temp.py \
  --world-backend gloo \
  --mode single-gloo

Observed result with Gloo subgroup creation: the program hangs during subgroup creation and never reaches all_done.

Relevant log excerpt:

[12:46:50] [rank=5] g0_r0123: skip non-participant backend=gloo
[12:46:50] [rank=2] g0_r0123: new_group start ranks=[0, 1, 2, 3] backend=gloo use_local_sync=True
[12:46:50] [rank=5] g0_r0123: done
[12:46:50] [rank=5] g1_r2345: begin participants=(2, 3, 4, 5)
[12:46:50] [rank=5] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True

[12:46:50] [rank=4] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True
[12:46:50] [rank=6] g1_r2345: skip non-participant backend=gloo
[12:46:50] [rank=6] g2_r4560: begin participants=(4, 5, 6, 0)
[12:46:50] [rank=6] g2_r4560: new_group start ranks=[4, 5, 6, 0] backend=gloo use_local_sync=True

[12:46:50] [rank=3] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True
[12:46:50] [rank=2] g1_r2345: new_group start ranks=[2, 3, 4, 5] backend=gloo use_local_sync=True

At this point, ranks 2/3/4/5 have all entered new_group() for g1_r2345, but the program still hangs.

Observed result with NCCL subgroup creation on the same pattern: the program completes successfully.

Command:

torchrun --nproc_per_node 7 --master_port 29501 \
  temp.py \
  --world-backend nccl \
  --mode single-nccl

This difference between Gloo and NCCL is what makes me unsure whether this is:

expected but undocumented backend-specific behavior,

a limitation of use_local_synchronization=True,

or a bug / missing validation in the Gloo path.

There is also an important usability issue here: if this pattern is unsupported, it would be much easier to debug if new_group() raised an explicit error rather than hanging indefinitely.

Versions

Collecting environment information... PyTorch version: 2.9.1+cu129 Is debug build: False CUDA used to build PyTorch: 12.9 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: version 3.31.1 Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 8 2026, 11:30:50) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.0-156-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.9.86 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 535.274.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.2 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.2 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

Versions of relevant libraries: [pip3] mypy_extensions==1.1.0 [pip3] numpy==2.4.1 [pip3] nvidia-cublas-cu12==12.9.1.4 [pip3] nvidia-cuda-cupti-cu12==12.9.79 [pip3] nvidia-cuda-nvrtc-cu12==12.9.86 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.17.0 [pip3] nvidia-cufft-cu12==11.4.1.4 [pip3] nvidia-curand-cu12==10.3.10.19 [pip3] nvidia-cusolver-cu12==11.7.5.82 [pip3] nvidia-cusparse-cu12==12.5.10.65 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.9.86 [pip3] nvidia-nvtx-cu12==12.9.79 [pip3] torch==2.9.1+cu129 [pip3] torch_memory_saver==0.0.9 [pip3] torchao==0.9.0 [pip3] torchaudio==2.9.1+cu129 [pip3] torchcodec==0.8.0 [pip3] torchvision==0.24.1+cu129 [pip3] triton==3.5.1 [conda] Could not collect

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To resolve the issue with the Gloo backend hanging during subgroup creation, we need to ensure that the use_local_synchronization parameter is set to False when creating multiple overlapping process groups.

Here are the steps to fix the issue:

Set use_local_sync to False in the _create_group function.
Remove the timeout parameter from the dist.new_group function, as it is not necessary.

Example code:

def _create_group(
    *,
    rank: int,
    group_name: str,
    participants: tuple[int, ...],
    backend: str,
    use_local_sync: bool,
    timeout_s: int,
) -> None:
    if not use_local_sync and rank not in participants:
        _log(rank, f"{group_name}: skip non-participant backend={backend}")
        return
    _log(
        rank,
        f"{group_name}: new_group start ranks={list(participants)} "
        f"backend={backend} use_local_sync={use_local_sync}",
    )
    dist.new_group(
        ranks=list(participants),
        backend=backend,
        use_local_synchronization=False,  # Set to False
    )
    _log(rank, f"{group_name}: new_group done backend={backend}")

Verification

To verify that the fix worked, run the program with the modified code and check that it completes successfully without hanging.

Extra Tips

When creating multiple overlapping process groups, it's recommended to set use_local_synchronization to False to avoid potential deadlocks.
Make sure to handle any errors that may occur during subgroup creation, such as rank mismatches or invalid backend configurations.
Consider adding additional logging or debugging statements to help diagnose any issues that may arise during subgroup creation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

pytorch - ✅(Solved) Fix dist.new_group(..., use_local_synchronization=True) hangs with Gloo for overlapping groups, while NCCL succeeds on the same participant-consistent creation order [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #3533: MoE aux loss free

Description (problem / solution / changelog)

Description

Summary by CodeRabbit

Release Notes

Changed files

PR #2869: feat(mimo): Phase 4 - MiMo training, model/provider, data loading, heterogeneous parallelism

Description (problem / solution / changelog)

Summary

Validation

Stack

Summary by CodeRabbit

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING