vllm - 💡(How to fix) Fix Recurring CUDA kernel hang on 2x DGX Spark (GB10, sm_12.1) with MiniMax-M2.7-NVFP4, TP=2 across 2 nodes [1 comments, 2 participants]

Error Message

(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. (APIServer) ERROR [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. (APIServer) INFO: Shutting down

Root Cause

The hang occurs inside a CUDA kernel during the TP forward pass (AllReduce). Because both nodes are simultaneously at high GPU utilization with no NCCL-level error, the deadlock appears to be at the CUDA compute layer rather than in the collective transport. This may be related to GB10 (sm_12.1) being outside the stated supported compute capability range for the PyTorch version in use (PyTorch 2.9.0 states max supported capability 12.0), though the same hang occurs with and without RDMA transport enabled.

Code Example

export NCCL_IB_HCA=rocep1s0f0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export NCCL_P2P_DISABLE=1
export NCCL_NET_GDR_LEVEL=5
export NCCL_DEBUG=WARN
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=120
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1

vllm serve nvidia/MiniMax-M2.7-NVFP4 \
  --trust-remote-code \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  -cc.pass_config.fuse_allreduce_rms=False \
  --tensor-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \  # (node-rank 1 on second node, with --headless)
  --master-addr 169.254.205.76 \
  --tool-call-parser minimax_m2 \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.80 \
  --enforce-eager \
  --moe-backend marlin \
  --port 8888 \
  --reasoning-parser minimax_m2

---

No available shared memory broadcast block found in 60 seconds.

---

(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer) ERROR [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
(APIServer) INFO: Shutting down

---

# NCCL version 2.28.9+cuda13.0

Environment

vLLM version: 0.19.1
Model: nvidia/MiniMax-M2.7-NVFP4
Hardware: 2x NVIDIA DGX Spark, each with 1x GB10 Superchip (sm_12.1 / CUDA capability 12.1)
Interconnect: C2C link (169.254.x.x/16), 900 GB/s, RoCE over enp1s0f0np0 / rocep1s0f0
CUDA: 13.0
NCCL: 2.28.9+cuda13.0
PyTorch: 2.9.0+cu130
OS: Ubuntu (kernel 6.17.0-nvidia)

Launch configuration

export NCCL_IB_HCA=rocep1s0f0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export NCCL_P2P_DISABLE=1
export NCCL_NET_GDR_LEVEL=5
export NCCL_DEBUG=WARN
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=120
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=1

vllm serve nvidia/MiniMax-M2.7-NVFP4 \
  --trust-remote-code \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  -cc.pass_config.fuse_allreduce_rms=False \
  --tensor-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \  # (node-rank 1 on second node, with --headless)
  --master-addr 169.254.205.76 \
  --tool-call-parser minimax_m2 \
  --enable-auto-tool-choice \
  --gpu-memory-utilization 0.80 \
  --enforce-eager \
  --moe-backend marlin \
  --port 8888 \
  --reasoning-parser minimax_m2

Symptom

After 35–55 minutes of serving inference requests, both nodes simultaneously enter a state where:

GPU utilization on both nodes locks at 96–99% indefinitely
No errors appear in the NCCL debug log (NCCL_DEBUG=WARN) — the log only ever contains the version banner
The vLLM shm_broadcast ring times out with one warning per minute:
```
No available shared memory broadcast block found in 60 seconds.
```
After TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=120 seconds, the PyTorch NCCL watchdog fires and kills the EngineCore
The API server exits cleanly (exit 0), and Restart=always recovers the service in ~10 minutes

The hang is not detected by NCCL itself — the debug log stays empty throughout. Both nodes are simultaneously pegged at high GPU utilization, which rules out a simple network stall (one side waiting for the other).

What has been tried (none resolved the hang)

Change	Result
`NCCL_IB_HCA=rocep1s0f0` (pin to stable C2C RoCE adapter)	No change
`NCCL_IB_DISABLE=1` (force TCP-only)	No change; also made restart harder
Remove `--enable-expert-parallel`	No change
Remove `--moe-backend marlin`	Hang occurred faster (~5 min instead of ~45 min)
`NCCL_P2P_DISABLE=1` + `NCCL_NET_GDR_LEVEL=5`	No change
NetworkManager unmanaged on secondary RoCE interface	Fixed a separate startup crash; unrelated

Crash log excerpt

(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(APIServer) ERROR [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
(APIServer) INFO: Shutting down

The NCCL debug log (/tmp/nccl-<hostname>.log) contains only:

# NCCL version 2.28.9+cuda13.0

Suspected cause

Closest related issues: #40969 (GB10 CUDA hang after N requests), #33041 (TP=2 hang on Blackwell).

Reporter

Shawn Edwards (lesserevil)

extent analysis

TL;DR

The most likely fix or workaround is to investigate and address potential CUDA kernel deadlocks or compatibility issues related to the GB10 Superchip's compute capability.

Guidance

Investigate the compatibility of PyTorch 2.9.0 with the GB10 Superchip's compute capability (sm_12.1) and consider upgrading to a version that supports this capability.
Review the CUDA kernel code used in the TP forward pass (AllReduce) to identify potential deadlock scenarios.
Consider disabling RDMA transport or switching to a different transport mechanism to isolate the issue.
Monitor GPU utilization and NCCL debug logs to gather more information about the hang.

Example

No code snippet is provided as the issue is related to a complex system configuration and CUDA kernel behavior.

Notes

The issue may be related to a known problem with GB10 Superchips (e.g., #40969, #33041), and investigating these issues may provide more insight. Additionally, the fact that the hang occurs with and without RDMA transport enabled suggests that the issue may be related to the CUDA compute layer rather than the collective transport.

Recommendation

Apply a workaround by disabling RDMA transport or switching to a different transport mechanism to isolate the issue, as the root cause is still unknown and may require further investigation.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix Recurring CUDA kernel hang on 2x DGX Spark (GB10, sm_12.1) with MiniMax-M2.7-NVFP4, TP=2 across 2 nodes [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Environment

Launch configuration

Symptom

What has been tried (none resolved the hang)

Crash log excerpt

Suspected cause

Reporter

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix Recurring CUDA kernel hang on 2x DGX Spark (GB10, sm_12.1) with MiniMax-M2.7-NVFP4, TP=2 across 2 nodes [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Environment

Launch configuration

Symptom

What has been tried (none resolved the hang)

Crash log excerpt

Suspected cause

Reporter

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING