vllm - 💡(How to fix) Fix [Bug]: MoE EP allgather_reducescatter — divergent collective sequence between TP peers causes ~T+1h NCCL deadlock under steady-state inference

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Within the same forward pass, one TP-peer takes path B while its three peers take path A. Because they post different collective types on the same TP communicator, neither side can complete. The coordinate_batch_across_dp synchronizer (vllm/v1/worker/dp_utils.py:164) does its job — it pads the per-DP-rank token count to the max — but it does not synchronize the per-TP-rank dispatch decision, so TP-peers inside a single DP rank can still diverge.

Fix Action

Fix / Workaround

--all2all-backend allgather_reducescatter MoE dispatch appears to take data-dependent branches in the per-rank EP path that vary the collective sequence across TP-peer ranks within the same forward pass:

  • Path A (normal load): issue all_gather → expert compute → reduce_scatter → final TP all_reduce SUM [n_tokens, hidden_dim]. Last collective shape = [batch, hidden].
  • Path B (degenerate load — e.g. zero or near-zero tokens routed to this rank's local experts): an optimization path that issues only a small metadata all_gatherv (per-expert token counts) and skips the bulk dispatch. Last collective shape = [~ranks, ~experts].

Within the same forward pass, one TP-peer takes path B while its three peers take path A. Because they post different collective types on the same TP communicator, neither side can complete. The coordinate_batch_across_dp synchronizer (vllm/v1/worker/dp_utils.py:164) does its job — it pads the per-DP-rank token count to the max — but it does not synchronize the per-TP-rank dispatch decision, so TP-peers inside a single DP rank can still diverge.

Code Example

vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --data-parallel-backend ray \
  --data-parallel-size-local 1 \
  --enable-expert-parallel \
  --all2all-backend allgather_reducescatter \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --enforce-eager \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --trust-remote-code

---

GPU utilization (nvidia-smi):
  all 8 GPUs: 100% SM utilization, 0% memory-bandwidth utilization
  ~90 GB VRAM used per GPU (model loaded, normal)

---

Worker main-thread kernel state (all 8 workers):
  state R (running on CPU)
  wchan 0 (not blocked in any syscall)
  ~115 sibling threads each: state S, wchan futex_wait_queue / ep_poll (idle Ray bookkeeping)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>vLLM environment</summary>
  • vLLM version: 0.1.dev16577+g041cfa68e (local branch built off recent main; the relevant code paths are unmodified upstream)
  • PyTorch: 2.9.x + CUDA 13
  • NCCL: 2.28.x
  • Hardware: 2 nodes × 4× NVIDIA GH200 96 GB (aarch64)
  • Interconnect: NVLink intra-node, IB cross-node (Jupiter at JSC)
  • Model: cyankiwi/MiniMax-M2.7-AWQ-4bit (AWQ 4-bit quantized, ~229B params)
  • Workload: agent trace generation via Daytona sandboxes (--max-num-seqs 32, real LLM-agent traffic, mixed prompt sizes)
</details>

🐛 Describe the bug

TL;DR: Under steady-state MoE inference (no warmup/CUDA-graph involvement), one rank of a TP group diverges from its peers in NCCL collective sequence — issuing an all_gatherv (small metadata payload) while its TP-peers are blocked on an all_reduce SUM (hidden-state fusion). Both sides spin forever; NCCL has no way to match the calls.

This is reproducible at ~T+1h on every run. Different rank diverges each time → data-dependent race, not a fixed code-path bug. We've reproduced cleanly across 5 runs (v4r, v4s, v4t, v4u, v4y) with identical config, capturing both per-rank pynccl traces and /proc-level state at hang time.

Setup

vllm serve cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --data-parallel-backend ray \
  --data-parallel-size-local 1 \
  --enable-expert-parallel \
  --all2all-backend allgather_reducescatter \
  --max-model-len 32768 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.9 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --enforce-eager \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --trust-remote-code

Each DP rank owns one full node (TP=4 within node, all4 GPUs). 2 DP replicas across 2 nodes. EP is on. 32 concurrent agent trials hit the endpoint.

Symptoms

  1. ~T+10–14 min: weight loading completes normally on all workers.
  2. ~T+10 min onward: trials stream, ~2–3 trials/min, healthy throughput.
  3. ~T+55–80 min: [shm_broadcast] No available shared memory broadcast block found in 600 seconds warnings start firing.
  4. Trial completion stalls. No new trials finish.
  5. After more time, the engine declares EngineDeadError. Slurm wallclock eventually times the job out.

Direct diagnostic evidence

We instrumented vllm/distributed/device_communicators/pynccl.py to maintain a per-process trace buffer recording every NCCL collective: op, world_size, rank, input/output sizes, dtype. The buffer is periodically flushed (300 s) to disk so it survives SIGKILL.

Live /proc state at hang time (run v4y, T+1h09):

GPU utilization (nvidia-smi):
  all 8 GPUs: 100% SM utilization, 0% memory-bandwidth utilization
  ~90 GB VRAM used per GPU (model loaded, normal)

100% SM with 0% mem-bw is the classic NCCL spin-wait signature on a never-completing peer recv.

Worker main-thread kernel state (all 8 workers):
  state R (running on CPU)
  wchan 0 (not blocked in any syscall)
  ~115 sibling threads each: state S, wchan futex_wait_queue / ep_poll (idle Ray bookkeeping)

So host-side every worker is busy-polling cudaStreamSynchronize for a GPU NCCL kernel that will never return. The deadlock is on the GPU side; host CPU is hot, not blocked.

Per-rank pynccl trace at hang (one of five reproductions):

HostPIDTP rankDP ws=2 n_callsTP ws=4 n_callsTP last opTP last input shape
node-1252907028878 (+1)1122RedOpType.SUM[10733, 2] (outlier shape)
node-12529071188771123RedOpType.SUM[10733, 3072]
node-12529072388771123RedOpType.SUM[10733, 3072]
node-12529073088771123RedOpType.SUM[10733, 3072]
node-15340843129593 (−17)407 (+17)all_gatherv[14, 256]
node-15340843039610390RedOpType.SUM[15, 2]
node-15340843209610390RedOpType.SUM[15, 2]
node-15340843319610390RedOpType.SUM[15, 2]

Key facts:

  • pid 3408431 (node-15, TP-rank 2 of DP rank 1) is in all_gatherv while its 3 TP-peers are in RedOpType.SUM. These are different NCCL collectives on the same TP communicator — NCCL cannot match them.
  • pid 3408431 has done exactly 17 fewer DP-PG calls and exactly 17 more TP-PG calls than its TP-peers. The same 17-call swap pattern showed up in v4r and v4u reproductions (rank 7 and rank 3 respectively each lagging exactly 17 DP / leading exactly 17 TP). It is always a clean 17-call swap.
  • The lagging rank is different in every run: v4r=global rank 7, v4s=rank 0 of node-37 (DP-1), v4u=rank 3 of node-17 (DP-0), v4y=rank 2 of node-15 (DP-1). No fixed rank-of-group is always the culprit → race condition, data-dependent.
  • Shape [14, 256] and [15, 2] are far too small to be hidden states — they look like per-expert / per-rank metadata payloads. Shape [10733, 3072] matches the post-MoE TP all-reduce of hidden states for that micro-batch.

Root-cause hypothesis

--all2all-backend allgather_reducescatter MoE dispatch appears to take data-dependent branches in the per-rank EP path that vary the collective sequence across TP-peer ranks within the same forward pass:

  • Path A (normal load): issue all_gather → expert compute → reduce_scatter → final TP all_reduce SUM [n_tokens, hidden_dim]. Last collective shape = [batch, hidden].
  • Path B (degenerate load — e.g. zero or near-zero tokens routed to this rank's local experts): an optimization path that issues only a small metadata all_gatherv (per-expert token counts) and skips the bulk dispatch. Last collective shape = [~ranks, ~experts].

Within the same forward pass, one TP-peer takes path B while its three peers take path A. Because they post different collective types on the same TP communicator, neither side can complete. The coordinate_batch_across_dp synchronizer (vllm/v1/worker/dp_utils.py:164) does its job — it pads the per-DP-rank token count to the max — but it does not synchronize the per-TP-rank dispatch decision, so TP-peers inside a single DP rank can still diverge.

The 17-call swap is consistent with this: the "degenerate path" issues fewer DP-PG collectives and more TP-PG collectives than the normal path by exactly that many, and the swap remains stable until eventually one TP-group of peers winds up waiting on a recv from a rank that's posted a different collective.

Why it's a different rank each run

The race surfaces whenever one TP-rank's local experts happen to get a degenerate token-routing distribution for a step. Routing depends on the actual incoming tokens (agent traffic in our case) and the model's current routing weights. Pure data-dependent; entirely random which rank gets unlucky each ~1h.

Why prior fixes don't help

  • coordinate_batch_across_dp (DP token-padding all-reduce) runs per-step but only synchronizes per-DP-rank batch counts; per-TP-rank dispatch divergence happens inside that boundary.
  • --enforce-eager is set; no CUDA-graph involvement.
  • Not an OOM (90 GB used / 96 GB available).
  • Not a network/IB issue (collectives have been completing for an hour before deadlock).
  • Not Ray-related (ray_executor_v2, all actors alive at hang time).

Proposed direction

Two viable fixes for the allgather_reducescatter MoE backend:

  1. Always run the full dispatch sequence regardless of local load. Burns a small amount of bandwidth on degenerate-load steps but eliminates the divergence. Probably the simplest correct fix.
  2. Pad/sentinel the routing decision so every rank always has ≥ 1 token through the full dispatch path. Avoids the wasted bandwidth in #1 but is more invasive.

The same class of bug likely affects the deepep_high_throughput and deepep_low_latency backends. pplx is now removed (commit eb19955c3). Whichever backend lands as the post-pplx canonical needs the always-issue-all-collectives property.

What we can share

  • Full pynccl trace dumps from 5 hang reproductions (5 × 8 worker dumps × ~860 KB each)
  • Live /proc/<pid>/{stack,status,wchan} snapshots at hang time on multiple node-pairs
  • nvidia-smi snapshots showing the 100% SM / 0% mem-bw signature
  • Our pynccl.py instrumentation patch is small and standalone — happy to upstream it as a separate PR if useful

Workaround we are evaluating

Watchdog-driven engine restart on shm-broadcast warning fire — accept the periodic ~1h-MTBF deadlock as a known property and recover via subprocess restart. Not a fix, but enables real workloads while the proper fix is plumbed.


AI assistance disclosure: this issue was drafted with Claude Code assistance. Diagnostics, reproductions, and root-cause analysis were performed across the runs by the submitting human; the writeup was AI-aided for clarity.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: MoE EP allgather_reducescatter — divergent collective sequence between TP peers causes ~T+1h NCCL deadlock under steady-state inference