vllm - 💡(How to fix) Fix Recurring CUDA kernel hang on 2x DGX Spark (GB10, sm_12.1) with MiniMax-M2.7-NVFP4, TP=2 across 2 nodes [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41725Fetched 2026-05-06 06:15:13
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
0
Participants
Timeline (top)
commented ×1

Error Message

(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (APIServer) ERROR [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. (APIServer) INFO: Shutting down

Root Cause

The hang occurs inside a CUDA kernel during the TP forward pass (AllReduce). Because both nodes are simultaneously at high GPU utilization with no NCCL-level error, the deadlock appears to be at the CUDA compute layer rather than in the collective transport. This may be related to GB10 (sm_12.1) being outside the stated supported compute capability range for the PyTorch version in use (PyTorch 2.9.0 states max supported capability 12.0), though the same hang occurs with and without RDMA transport enabled.

Code Example

export NCCL_IB_HCA=rocep1s0f0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export NCCL_P2P_DISABLE=1
export NCCL_NET_GDR_LEVEL=5
export NCCL_DEBUG=WARN
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=120
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1

vllm serve nvidia/MiniMax-M2.7-NVFP4 \
  --trust-remote-code \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  -cc.pass_config.fuse_allreduce_rms=False \
  --tensor-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \  # (node-rank 1 on second node, with --headless)
  --master-addr 169.254.205.76 \
  --tool-call-parser minimax_m2 \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.80 \
  --enforce-eager \
  --moe-backend marlin \
  --port 8888 \
  --reasoning-parser minimax_m2

---

No available shared memory broadcast block found in 60 seconds.

---

(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer) ERROR [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
(APIServer) INFO: Shutting down

---

# NCCL version 2.28.9+cuda13.0
RAW_BUFFERClick to expand / collapse

Environment

  • vLLM version: 0.19.1
  • Model: nvidia/MiniMax-M2.7-NVFP4
  • Hardware: 2x NVIDIA DGX Spark, each with 1x GB10 Superchip (sm_12.1 / CUDA capability 12.1)
  • Interconnect: C2C link (169.254.x.x/16), 900 GB/s, RoCE over enp1s0f0np0 / rocep1s0f0
  • CUDA: 13.0
  • NCCL: 2.28.9+cuda13.0
  • PyTorch: 2.9.0+cu130
  • OS: Ubuntu (kernel 6.17.0-nvidia)

Launch configuration

export NCCL_IB_HCA=rocep1s0f0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export NCCL_P2P_DISABLE=1
export NCCL_NET_GDR_LEVEL=5
export NCCL_DEBUG=WARN
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=120
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1

vllm serve nvidia/MiniMax-M2.7-NVFP4 \
  --trust-remote-code \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  -cc.pass_config.fuse_allreduce_rms=False \
  --tensor-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \  # (node-rank 1 on second node, with --headless)
  --master-addr 169.254.205.76 \
  --tool-call-parser minimax_m2 \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.80 \
  --enforce-eager \
  --moe-backend marlin \
  --port 8888 \
  --reasoning-parser minimax_m2

Symptom

After 35–55 minutes of serving inference requests, both nodes simultaneously enter a state where:

  1. GPU utilization on both nodes locks at 96–99% indefinitely
  2. No errors appear in the NCCL debug log (NCCL_DEBUG=WARN) — the log only ever contains the version banner
  3. The vLLM shm_broadcast ring times out with one warning per minute:
    No available shared memory broadcast block found in 60 seconds.
  4. After TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=120 seconds, the PyTorch NCCL watchdog fires and kills the EngineCore
  5. The API server exits cleanly (exit 0), and Restart=always recovers the service in ~10 minutes

The hang is not detected by NCCL itself — the debug log stays empty throughout. Both nodes are simultaneously pegged at high GPU utilization, which rules out a simple network stall (one side waiting for the other).

What has been tried (none resolved the hang)

ChangeResult
NCCL_IB_HCA=rocep1s0f0 (pin to stable C2C RoCE adapter)No change
NCCL_IB_DISABLE=1 (force TCP-only)No change; also made restart harder
Remove --enable-expert-parallelNo change
Remove --moe-backend marlinHang occurred faster (~5 min instead of ~45 min)
NCCL_P2P_DISABLE=1 + NCCL_NET_GDR_LEVEL=5No change
NetworkManager unmanaged on secondary RoCE interfaceFixed a separate startup crash; unrelated

Crash log excerpt

(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer) ERROR [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
(APIServer) INFO: Shutting down

The NCCL debug log (/tmp/nccl-<hostname>.log) contains only:

# NCCL version 2.28.9+cuda13.0

Suspected cause

The hang occurs inside a CUDA kernel during the TP forward pass (AllReduce). Because both nodes are simultaneously at high GPU utilization with no NCCL-level error, the deadlock appears to be at the CUDA compute layer rather than in the collective transport. This may be related to GB10 (sm_12.1) being outside the stated supported compute capability range for the PyTorch version in use (PyTorch 2.9.0 states max supported capability 12.0), though the same hang occurs with and without RDMA transport enabled.

Closest related issues: #40969 (GB10 CUDA hang after N requests), #33041 (TP=2 hang on Blackwell).

Reporter

Shawn Edwards (lesserevil)

extent analysis

TL;DR

The most likely fix or workaround is to investigate and address potential CUDA kernel deadlocks or compatibility issues related to the GB10 Superchip's compute capability.

Guidance

  • Investigate the compatibility of PyTorch 2.9.0 with the GB10 Superchip's compute capability (sm_12.1) and consider upgrading to a version that supports this capability.
  • Review the CUDA kernel code used in the TP forward pass (AllReduce) to identify potential deadlock scenarios.
  • Consider disabling RDMA transport or switching to a different transport mechanism to isolate the issue.
  • Monitor GPU utilization and NCCL debug logs to gather more information about the hang.

Example

No code snippet is provided as the issue is related to a complex system configuration and CUDA kernel behavior.

Notes

The issue may be related to a known problem with GB10 Superchips (e.g., #40969, #33041), and investigating these issues may provide more insight. Additionally, the fact that the hang occurs with and without RDMA transport enabled suggests that the issue may be related to the CUDA compute layer rather than the collective transport.

Recommendation

Apply a workaround by disabling RDMA transport or switching to a different transport mechanism to isolate the issue, as the root cause is still unknown and may require further investigation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix Recurring CUDA kernel hang on 2x DGX Spark (GB10, sm_12.1) with MiniMax-M2.7-NVFP4, TP=2 across 2 nodes [1 comments, 2 participants]