vllm - 💡(How to fix) Fix [Bug]: TP Worker hang causes EngineDeadError - RPC call to sample_tokens timed out (DeepSeek-V4-Pro, TP=8, MTP speculative decoding) [1 comments, 2 participants]

Error Message

Error Logs 00:12:21 - EngineDeadError, 500 Internal Server Error Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last):

The worker processes appear to hang silently — there is no crash or exception on the worker side, just export NCCL_DEBUG=WARN

Code Example

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
MODEL_PATH="xxx/deepseek-ai/DeepSeek-V4-Pro/"
SERVED_MODEL_NAME="deepseek-v4-pro"
  export NCCL_TIMEOUT=1800000                                                                        
  export NCCL_HEART_BEAT_POLL_MS=5000                                                                           
  export NCCL_DEBUG=WARN             
vllm serve $MODEL_PATH \
  --trust-remote-code \
    --served-model-name "${SERVED_MODEL_NAME}" \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --max-model-len 800000 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --gpu-memory-utilization 0.95 \
  --no-enable-flashinfer-autotune \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --speculative_config '{"method":"mtp","num_speculative_tokens":1}'

🐛 Describe the bug

Body:

Bug Description

vLLM engine crashes with EngineDeadError after TP worker processes hang for ~5 minutes. The sample_tokens RPC call times out because shared memory broadcast blocks are unavailable — workers are unresponsive.

Environment

vLLM version: v0.1.dev15830+g8d599d76a
Python version: 3.12
CUDA version: 13.0 (driver), NCCL 2.28.9
OS: Linux
GPU: 8x NVIDIA GPU (Tensor Parallel=8)
Model: deepseek-ai/DeepSeek-V4-Pro/
Speculative decoding: MTP (method='mtp', num_speculative_tokens=1)
Quantization: deepseek_v4_fp8
KV cache dtype: fp8
Tokenizer mode: deepseek_v4
max_seq_len: 800,000
GPU memory utilization: 0.95
enable_expert_parallel: True
CUDAGraph mode: FULL_DECODE_ONLY

Reproduction

The issue occurs intermittently during normal inference serving. The engine runs fine for a period (hours),
then a worker process hangs, causing a cascading failure:

Engine is idle or serving low traffic (0-2 concurrent requests)
Worker processes stop responding to shared memory broadcast
After ~5 minutes of repeated warnings, sample_tokens RPC times out
Engine crashes with EngineDeadError, all in-flight requests return 500

Error Logs

Timeline:
00:07:34 - Engine idle (0 reqs running, 0 waiting)
00:08:22 - "No available shared memory broadcast block found in 60 seconds"
00:09:22 - Same warning repeated
00:10:22 - Same warning repeated
00:11:22 - Same warning repeated
00:12:21 - Fatal: TimeoutError: RPC call to sample_tokens timed out
00:12:21 - EngineDeadError, 500 Internal Server Error

Stack trace:
Traceback (most recent call last):
File "vllm/v1/executor/multiproc_executor.py", line 395, in get_response
status, result = mq.dequeue(timeout=dequeue_timeout)
File "vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
with self.acquire_read(timeout, indefinite) as buf:
File "vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
File "vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms
raise TimeoutError
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "vllm/v1/engine/core.py", line 1103, in run_engine_core
engine_core.run_busy_loop()
File "vllm/v1/engine/core.py", line 1144, in run_busy_loop
self._process_engine_step()
File "vllm/v1/engine/core.py", line 1183, in _process_engine_step
outputs, model_executed = self.step_fn()
File "vllm/v1/engine/core.py", line 501, in step_with_batch_queue
model_output = future.result()
File "vllm/v1/executor/multiproc_executor.py", line 93, in _wait_for_response
response = self.aggregate(self.get_response())
File "vllm/v1/executor/multiproc_executor.py", line 397, in get_response
raise TimeoutError(f"RPC call to {method} timed out.") from e
TimeoutError: RPC call to sample_tokens timed out.

Scheduler Stats at Crash Time

num_running_reqs=1, num_waiting_reqs=0
kv_cache_usage=0.44%
prefix_cache_stats: queries=11645, hits=11264 (hit rate ~96.7%)

Additional Context

The server was serving traffic normally before the hang. GPU KV cache usage was consistently low (0-3.5%),
so this is not an OOM issue.
The issue appeared after the engine had been running for ~1.5 hours.
No CUDA errors, NCCL errors, or GPU Xid errors were observed in the logs before the hang.
The worker processes appear to hang silently — there is no crash or exception on the worker side, just
unresponsiveness on the shared memory channel.

logs_1777776402350_qianghuaxuexi_sv-38fc26e4-6031-428a-b740-f5ba18ad5496-0.txt

vllm server script:

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
MODEL_PATH="xxx/deepseek-ai/DeepSeek-V4-Pro/"
SERVED_MODEL_NAME="deepseek-v4-pro"
  export NCCL_TIMEOUT=1800000                                                                        
  export NCCL_HEART_BEAT_POLL_MS=5000                                                                           
  export NCCL_DEBUG=WARN             
vllm serve $MODEL_PATH \
  --trust-remote-code \
    --served-model-name "${SERVED_MODEL_NAME}" \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --max-model-len 800000 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --gpu-memory-utilization 0.95 \
  --no-enable-flashinfer-autotune \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --speculative_config '{"method":"mtp","num_speculative_tokens":1}'

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The vLLM engine crash with EngineDeadError may be resolved by adjusting the NCCL timeout settings to prevent worker processes from hanging due to shared memory broadcast blocks being unavailable.

Guidance

Review the NCCL timeout settings, specifically NCCL_TIMEOUT and NCCL_HEART_BEAT_POLL_MS, to ensure they are adequately configured for the system's workload and network conditions.
Consider increasing the NCCL_TIMEOUT value to allow for longer periods of inactivity before declaring a timeout, which may help prevent the worker processes from hanging.
Investigate the system's resource utilization and network connectivity to identify potential bottlenecks or issues that could be contributing to the worker processes' unresponsiveness.
Verify that the GPU memory utilization is not exceeding the configured limit, which could lead to performance issues and timeouts.

Example

No code snippet is provided as the issue seems to be related to configuration and system settings rather than code-specific problems.

Notes

The provided information suggests that the issue is intermittent and occurs after a period of normal operation, which may indicate a complex interaction between system resources, network conditions, and the vLLM engine's configuration. Further investigation and monitoring of system metrics may be necessary to fully diagnose and resolve the issue.

Recommendation

Apply a workaround by adjusting the NCCL timeout settings to prevent worker processes from hanging due to shared memory broadcast blocks being unavailable. This may involve increasing the NCCL_TIMEOUT value or adjusting the NCCL_HEART_BEAT_POLL_MS setting to better suit the system's workload and network conditions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: TP Worker hang causes EngineDeadError - RPC call to sample_tokens timed out (DeepSeek-V4-Pro, TP=8, MTP speculative decoding) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: TP Worker hang causes EngineDeadError - RPC call to sample_tokens timed out (DeepSeek-V4-Pro, TP=8, MTP speculative decoding) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING