vllm - 💡(How to fix) Fix [Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

RayExecutorV2 (introduced via PR #36836, "[Feat][Executor] Introduce RayExecutorV2") inherits from MultiprocExecutor and uses shm_broadcast for inter-rank communication. shm_broadcast is single-host shared memory — it has no cross-node path natively. For multi-node DP, vLLM falls back to Gloo TCP for the cross-node bits, which times out after ~30 min:

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

The pre-RayExecutorV2 ray_executor.py uses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.

Error Message

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). [repeated every 10 minutes...] [W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol). ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor. ray.exceptions.RayTaskError(RuntimeError)

Root Cause

vllm/v1/executor/ray_executor_v2.py:

```python from vllm.v1.executor.multiproc_executor import ( FutureWrapper, MultiprocExecutor, WorkerProc, )

...

class RayExecutorV2(MultiprocExecutor): ... ```

By inheriting from MultiprocExecutor, RayExecutorV2 picks up the shm_broadcast-based inter-worker comm path. shm_broadcast uses POSIX shared memory which is fundamentally per-host. For multi-node DP — where workers on different nodes must exchange tensors during collective ops — there's no shm path; the fallback to Gloo TCP has the failure modes shown above.

The legacy vllm/v1/executor/ray_executor.py (which doesn't inherit from MultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.

Fix Action

Workaround

Setting VLLM_USE_V2_MODEL_RUNNER=0 forces the legacy ray_executor.py path. Confirmed in our environment:

  • V2 (V1_MODEL_RUNNER=1): job hung at ~12 min, 0 trial progress over 3+ hours, multiple shm_broadcast warnings and final sample_tokens RPC timeout.
  • V1 (V1_MODEL_RUNNER=0): same model + yaml, ZERO shm_broadcast warnings, ZERO sample_tokens timeouts, ZERO Gloo unbound_buffer timeouts, trials flowing healthily.

Code Example

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

---

python -m vllm.entrypoints.openai.api_server \
  --model cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 --data-parallel-size 2 \
  --data-parallel-backend ray --data-parallel-size-local 1 \
  --data-parallel-address <head_ip> \
  --distributed-executor-backend ray \
  --trust-remote-code --enforce-eager

---

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds.
    This typically happens when some processes are hanging or doing some
    time-consuming work (e.g. compilation, weight/kv cache quantization).
[repeated every 10 minutes...]
[W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol).
ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78]
    Timed out waiting 1800000ms for recv operation to complete
ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor.
ray.exceptions.RayTaskError(RuntimeError)
RAW_BUFFERClick to expand / collapse

Summary

RayExecutorV2 (introduced via PR #36836, "[Feat][Executor] Introduce RayExecutorV2") inherits from MultiprocExecutor and uses shm_broadcast for inter-rank communication. shm_broadcast is single-host shared memory — it has no cross-node path natively. For multi-node DP, vLLM falls back to Gloo TCP for the cross-node bits, which times out after ~30 min:

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

The pre-RayExecutorV2 ray_executor.py uses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.

Repro

Run any MoE model with data_parallel_size > 1 spanning multiple nodes, leaving VLLM_USE_V2_MODEL_RUNNER=1 (the default). E.g. MiniMax-M2.7-AWQ-4bit on 2× single-node-TP=4 (DP=2 across two 4×H100/GH200 nodes):

python -m vllm.entrypoints.openai.api_server \
  --model cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 --data-parallel-size 2 \
  --data-parallel-backend ray --data-parallel-size-local 1 \
  --data-parallel-address <head_ip> \
  --distributed-executor-backend ray \
  --trust-remote-code --enforce-eager

The job starts, both DPMoEEngineCoreActor instances are created across both nodes, model loads (~11 min), but the first batch's sample_tokens RPC hangs. vllm.log accumulates:

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds.
    This typically happens when some processes are hanging or doing some
    time-consuming work (e.g. compilation, weight/kv cache quantization).
[repeated every 10 minutes...]
[W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol).
ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78]
    Timed out waiting 1800000ms for recv operation to complete
ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor.
ray.exceptions.RayTaskError(RuntimeError)

After this, the engine never recovers; trial throughput drops to ~0 progress per hour. Tested with enable_expert_parallel: true and false — both fail in the same shm_broadcast/Gloo path (EP off shifts the bottleneck from per-token MoE all-to-all to whatever subsequent collective tries to use shm).

Workaround

Setting VLLM_USE_V2_MODEL_RUNNER=0 forces the legacy ray_executor.py path. Confirmed in our environment:

  • V2 (V1_MODEL_RUNNER=1): job hung at ~12 min, 0 trial progress over 3+ hours, multiple shm_broadcast warnings and final sample_tokens RPC timeout.
  • V1 (V1_MODEL_RUNNER=0): same model + yaml, ZERO shm_broadcast warnings, ZERO sample_tokens timeouts, ZERO Gloo unbound_buffer timeouts, trials flowing healthily.

Root cause analysis

vllm/v1/executor/ray_executor_v2.py:

```python from vllm.v1.executor.multiproc_executor import ( FutureWrapper, MultiprocExecutor, WorkerProc, )

...

class RayExecutorV2(MultiprocExecutor): ... ```

By inheriting from MultiprocExecutor, RayExecutorV2 picks up the shm_broadcast-based inter-worker comm path. shm_broadcast uses POSIX shared memory which is fundamentally per-host. For multi-node DP — where workers on different nodes must exchange tensors during collective ops — there's no shm path; the fallback to Gloo TCP has the failure modes shown above.

The legacy vllm/v1/executor/ray_executor.py (which doesn't inherit from MultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.

Environment

  • vLLM commit: 041cfa68e (upstream/main 2026-05-13)
  • PyTorch: 2.11.0+cu130
  • aarch64 + CUDA 13 (Jupiter GH200), but the issue is architecturally cross-node, not platform-specific
  • 2 nodes × 4 GPUs each
  • Model: MiniMax-M2.7 AWQ 4-bit (MoE with 256 experts), but reproduces with any DP>1 multi-node MoE

Fix path suggestions

  1. Short term: Document VLLM_USE_V2_MODEL_RUNNER=0 as the workaround for multi-node DP until V2 supports it.
  2. Medium term: Either (a) make RayExecutorV2 not inherit shm_broadcast for cross-node DP — fall back to Ray RPC for inter-node collectives while keeping shm for intra-node, or (b) gate the V2 selection on a single-node check.
  3. Longer term: Reimplement RayExecutorV2's inter-rank comm using Ray's collective groups / NCCL directly so cross-node DP works without the shm/Gloo dance.

AI-assisted disclosure

This issue write-up was drafted with Claude. The diagnosis and workaround were validated end-to-end against our production workload before posting; no theoretical claims are being made about code paths I did not actually trace through both branches.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory