vllm - 💡(How to fix) Fix [Bug]: TP Worker hang causes EngineDeadError - RPC call to sample_tokens timed out (DeepSeek-V4-Pro, TP=8, MTP speculative decoding) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41530Fetched 2026-05-04 04:59:02
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×1labeled ×1subscribed ×1

Error Message

Error Logs 00:12:21 - EngineDeadError, 500 Internal Server Error Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last):

  • The worker processes appear to hang silently — there is no crash or exception on the worker side, just export NCCL_DEBUG=WARN

Root Cause

vLLM engine crashes with EngineDeadError after TP worker processes hang for ~5 minutes. The sample_tokens RPC call times out because shared memory broadcast blocks are unavailable — workers are unresponsive.

Code Example

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
MODEL_PATH="xxx/deepseek-ai/DeepSeek-V4-Pro/"
SERVED_MODEL_NAME="deepseek-v4-pro"
  export NCCL_TIMEOUT=1800000                                                                        
  export NCCL_HEART_BEAT_POLL_MS=5000                                                                           
  export NCCL_DEBUG=WARN             
vllm serve $MODEL_PATH \
  --trust-remote-code \
    --served-model-name "${SERVED_MODEL_NAME}" \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --max-model-len 800000 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --gpu-memory-utilization 0.95 \
  --no-enable-flashinfer-autotune \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --speculative_config '{"method":"mtp","num_speculative_tokens":1}'
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Body:

Bug Description

vLLM engine crashes with EngineDeadError after TP worker processes hang for ~5 minutes. The sample_tokens RPC call times out because shared memory broadcast blocks are unavailable — workers are unresponsive.

Environment

  • vLLM version: v0.1.dev15830+g8d599d76a
  • Python version: 3.12
  • CUDA version: 13.0 (driver), NCCL 2.28.9
  • OS: Linux
  • GPU: 8x NVIDIA GPU (Tensor Parallel=8)
  • Model: deepseek-ai/DeepSeek-V4-Pro/
  • Speculative decoding: MTP (method='mtp', num_speculative_tokens=1)
  • Quantization: deepseek_v4_fp8
  • KV cache dtype: fp8
  • Tokenizer mode: deepseek_v4
  • max_seq_len: 800,000
  • GPU memory utilization: 0.95
  • enable_expert_parallel: True
  • CUDAGraph mode: FULL_DECODE_ONLY

Reproduction

The issue occurs intermittently during normal inference serving. The engine runs fine for a period (hours),
then a worker process hangs, causing a cascading failure:

  1. Engine is idle or serving low traffic (0-2 concurrent requests)
  2. Worker processes stop responding to shared memory broadcast
  3. After ~5 minutes of repeated warnings, sample_tokens RPC times out
  4. Engine crashes with EngineDeadError, all in-flight requests return 500

Error Logs

Timeline:
00:07:34 - Engine idle (0 reqs running, 0 waiting)
00:08:22 - "No available shared memory broadcast block found in 60 seconds"
00:09:22 - Same warning repeated
00:10:22 - Same warning repeated
00:11:22 - Same warning repeated
00:12:21 - Fatal: TimeoutError: RPC call to sample_tokens timed out
00:12:21 - EngineDeadError, 500 Internal Server Error

Stack trace:
Traceback (most recent call last):
File "vllm/v1/executor/multiproc_executor.py", line 395, in get_response
status, result = mq.dequeue(timeout=dequeue_timeout)
File "vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
with self.acquire_read(timeout, indefinite) as buf:
File "vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File "vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms
raise TimeoutError
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "vllm/v1/engine/core.py", line 1103, in run_engine_core
engine_core.run_busy_loop()
File "vllm/v1/engine/core.py", line 1144, in run_busy_loop
self._process_engine_step()
File "vllm/v1/engine/core.py", line 1183, in _process_engine_step
outputs, model_executed = self.step_fn()
File "vllm/v1/engine/core.py", line 501, in step_with_batch_queue
model_output = future.result()
File "vllm/v1/executor/multiproc_executor.py", line 93, in _wait_for_response
response = self.aggregate(self.get_response())
File "vllm/v1/executor/multiproc_executor.py", line 397, in get_response
raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

Scheduler Stats at Crash Time

num_running_reqs=1, num_waiting_reqs=0
kv_cache_usage=0.44%
prefix_cache_stats: queries=11645, hits=11264 (hit rate ~96.7%)

Additional Context

  • The server was serving traffic normally before the hang. GPU KV cache usage was consistently low (0-3.5%),
    so this is not an OOM issue.
  • The issue appeared after the engine had been running for ~1.5 hours.
  • No CUDA errors, NCCL errors, or GPU Xid errors were observed in the logs before the hang.
  • The worker processes appear to hang silently — there is no crash or exception on the worker side, just
    unresponsiveness on the shared memory channel.

logs_1777776402350_qianghuaxuexi_sv-38fc26e4-6031-428a-b740-f5ba18ad5496-0.txt

vllm server script:

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
MODEL_PATH="xxx/deepseek-ai/DeepSeek-V4-Pro/"
SERVED_MODEL_NAME="deepseek-v4-pro"
  export NCCL_TIMEOUT=1800000                                                                        
  export NCCL_HEART_BEAT_POLL_MS=5000                                                                           
  export NCCL_DEBUG=WARN             
vllm serve $MODEL_PATH \
  --trust-remote-code \
    --served-model-name "${SERVED_MODEL_NAME}" \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --max-model-len 800000 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --gpu-memory-utilization 0.95 \
  --no-enable-flashinfer-autotune \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --speculative_config '{"method":"mtp","num_speculative_tokens":1}'

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The vLLM engine crash with EngineDeadError may be resolved by adjusting the NCCL timeout settings to prevent worker processes from hanging due to shared memory broadcast blocks being unavailable.

Guidance

  • Review the NCCL timeout settings, specifically NCCL_TIMEOUT and NCCL_HEART_BEAT_POLL_MS, to ensure they are adequately configured for the system's workload and network conditions.
  • Consider increasing the NCCL_TIMEOUT value to allow for longer periods of inactivity before declaring a timeout, which may help prevent the worker processes from hanging.
  • Investigate the system's resource utilization and network connectivity to identify potential bottlenecks or issues that could be contributing to the worker processes' unresponsiveness.
  • Verify that the GPU memory utilization is not exceeding the configured limit, which could lead to performance issues and timeouts.

Example

No code snippet is provided as the issue seems to be related to configuration and system settings rather than code-specific problems.

Notes

The provided information suggests that the issue is intermittent and occurs after a period of normal operation, which may indicate a complex interaction between system resources, network conditions, and the vLLM engine's configuration. Further investigation and monitoring of system metrics may be necessary to fully diagnose and resolve the issue.

Recommendation

Apply a workaround by adjusting the NCCL timeout settings to prevent worker processes from hanging due to shared memory broadcast blocks being unavailable. This may involve increasing the NCCL_TIMEOUT value or adjusting the NCCL_HEART_BEAT_POLL_MS setting to better suit the system's workload and network conditions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: TP Worker hang causes EngineDeadError - RPC call to sample_tokens timed out (DeepSeek-V4-Pro, TP=8, MTP speculative decoding) [1 comments, 2 participants]