vllm - 💡(How to fix) Fix [Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory

vllm2026-05-22 12:01:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

RayExecutorV2 (introduced via PR #36836, "[Feat][Executor] Introduce RayExecutorV2") inherits from MultiprocExecutor and uses shm_broadcast for inter-rank communication. shm_broadcast is single-host shared memory — it has no cross-node path natively. For multi-node DP, vLLM falls back to Gloo TCP for the cross-node bits, which times out after ~30 min:

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

The pre-RayExecutorV2 ray_executor.py uses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.

Error Message

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). [repeated every 10 minutes...] [W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol). ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78] Timed out waiting 1800000ms for recv operation to complete ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor. ray.exceptions.RayTaskError(RuntimeError)

Root Cause

vllm/v1/executor/ray_executor_v2.py:

```python from vllm.v1.executor.multiproc_executor import ( FutureWrapper, MultiprocExecutor, WorkerProc, )

...

class RayExecutorV2(MultiprocExecutor): ... ```

By inheriting from MultiprocExecutor, RayExecutorV2 picks up the shm_broadcast-based inter-worker comm path. shm_broadcast uses POSIX shared memory which is fundamentally per-host. For multi-node DP — where workers on different nodes must exchange tensors during collective ops — there's no shm path; the fallback to Gloo TCP has the failure modes shown above.

The legacy vllm/v1/executor/ray_executor.py (which doesn't inherit from MultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.

Fix Action

Workaround

Setting VLLM_USE_V2_MODEL_RUNNER=0 forces the legacy ray_executor.py path. Confirmed in our environment:

V2 (V1_MODEL_RUNNER=1): job hung at ~12 min, 0 trial progress over 3+ hours, multiple shm_broadcast warnings and final sample_tokens RPC timeout.
V1 (V1_MODEL_RUNNER=0): same model + yaml, ZERO shm_broadcast warnings, ZERO sample_tokens timeouts, ZERO Gloo unbound_buffer timeouts, trials flowing healthily.

Code Example

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

---

python -m vllm.entrypoints.openai.api_server \
  --model cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 --data-parallel-size 2 \
  --data-parallel-backend ray --data-parallel-size-local 1 \
  --data-parallel-address <head_ip> \
  --distributed-executor-backend ray \
  --trust-remote-code --enforce-eager

---

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds.
    This typically happens when some processes are hanging or doing some
    time-consuming work (e.g. compilation, weight/kv cache quantization).
[repeated every 10 minutes...]
[W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol).
ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78]
    Timed out waiting 1800000ms for recv operation to complete
ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor.
ray.exceptions.RayTaskError(RuntimeError)

RAW_BUFFERClick to expand / collapse

Summary

gloo/transport/tcp/unbound_buffer.cc:78 Timed out waiting 1800000ms for recv operation to complete

The pre-RayExecutorV2 ray_executor.py uses pure Ray RPC for ALL collective operations — Ray handles cross-node natively via its own RPC layer, no shm.

Repro

Run any MoE model with data_parallel_size > 1 spanning multiple nodes, leaving VLLM_USE_V2_MODEL_RUNNER=1 (the default). E.g. MiniMax-M2.7-AWQ-4bit on 2× single-node-TP=4 (DP=2 across two 4×H100/GH200 nodes):

python -m vllm.entrypoints.openai.api_server \
  --model cyankiwi/MiniMax-M2.7-AWQ-4bit \
  --tensor-parallel-size 4 --data-parallel-size 2 \
  --data-parallel-backend ray --data-parallel-size-local 1 \
  --data-parallel-address <head_ip> \
  --distributed-executor-backend ray \
  --trust-remote-code --enforce-eager

The job starts, both DPMoEEngineCoreActor instances are created across both nodes, model loads (~11 min), but the first batch's sample_tokens RPC hangs. vllm.log accumulates:

INFO ... [shm_broadcast.py:698] No available shared memory broadcast block found in 600 seconds.
    This typically happens when some processes are hanging or doing some
    time-consuming work (e.g. compilation, weight/kv cache quantization).
[repeated every 10 minutes...]
[W ... socket.cpp:764] [c10d] ... (errno: 97 - Address family not supported by protocol).
ERROR ... [core.py:2178] RuntimeError: ... [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:78]
    Timed out waiting 1800000ms for recv operation to complete
ERROR ... RayWorkerProc rank=[3] died unexpectedly, shutting down executor.
ray.exceptions.RayTaskError(RuntimeError)

After this, the engine never recovers; trial throughput drops to ~0 progress per hour. Tested with enable_expert_parallel: true and false — both fail in the same shm_broadcast/Gloo path (EP off shifts the bottleneck from per-token MoE all-to-all to whatever subsequent collective tries to use shm).

Workaround

Setting VLLM_USE_V2_MODEL_RUNNER=0 forces the legacy ray_executor.py path. Confirmed in our environment:

V2 (V1_MODEL_RUNNER=1): job hung at ~12 min, 0 trial progress over 3+ hours, multiple shm_broadcast warnings and final sample_tokens RPC timeout.
V1 (V1_MODEL_RUNNER=0): same model + yaml, ZERO shm_broadcast warnings, ZERO sample_tokens timeouts, ZERO Gloo unbound_buffer timeouts, trials flowing healthily.

Root cause analysis

vllm/v1/executor/ray_executor_v2.py:

```python from vllm.v1.executor.multiproc_executor import ( FutureWrapper, MultiprocExecutor, WorkerProc, )

...

class RayExecutorV2(MultiprocExecutor): ... ```

The legacy vllm/v1/executor/ray_executor.py (which doesn't inherit from MultiprocExecutor) uses Ray RPC for everything. Ray's RPC layer natively handles cross-node transport.

Environment

vLLM commit: 041cfa68e (upstream/main 2026-05-13)
PyTorch: 2.11.0+cu130
aarch64 + CUDA 13 (Jupiter GH200), but the issue is architecturally cross-node, not platform-specific
2 nodes × 4 GPUs each
Model: MiniMax-M2.7 AWQ 4-bit (MoE with 256 experts), but reproduces with any DP>1 multi-node MoE

Fix path suggestions

Short term: Document VLLM_USE_V2_MODEL_RUNNER=0 as the workaround for multi-node DP until V2 supports it.
Medium term: Either (a) make RayExecutorV2 not inherit shm_broadcast for cross-node DP — fall back to Ray RPC for inter-node collectives while keeping shm for intra-node, or (b) gate the V2 selection on a single-node check.
Longer term: Reimplement RayExecutorV2's inter-rank comm using Ray's collective groups / NCCL directly so cross-node DP works without the shm/Gloo dance.

AI-assisted disclosure

This issue write-up was drafted with Claude. The diagnosis and workaround were validated end-to-end against our production workload before posting; no theoretical claims are being made about code paths I did not actually trace through both branches.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: RayExecutorV2 multi-node DP hangs on shm_broadcast — cross-node ranks can't share single-host shared memory

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

...

Fix Action

Workaround

Code Example

Summary

Repro

Workaround

Root cause analysis

...

Environment

Fix path suggestions

AI-assisted disclosure

Still need to ship something?

TRENDING