vllm - 💡(How to fix) Fix [Bug]: V1 engine core deadlocks under concurrent load (fp8 + prefix caching + Qwen3.5) [15 comments, 7 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37729Fetched 2026-04-08 01:08:33
View on GitHub
Comments
15
Participants
7
Timeline
23
Reactions
0
Author
Timeline (top)
commented ×15mentioned ×3subscribed ×3labeled ×1

Error Message

  1. No specific error triggers it — no OOM, no CUDA error, no validation error

Fix Action

Fix / Workaround

Current workaround: --enforce-eager prevents the deadlock but with ~8x throughput regression, making it unviable for production. Restarting affected replicas is the only practical mitigation.

Code Example

CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled

  GPU Topology:

        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity    GPU NUMA ID
  GPU0     X     SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB    PHB    88-175    1        N/A
  NIC0    SYS     X     NODE    NODE    NODE    NODE    SYS    SYS    SYS    SYS
  NIC1    SYS    NODE     X     PHB    PHB    PHB    SYS    SYS    SYS    SYS
  NIC2    SYS    NODE    PHB     X     PHB    PHB    SYS    SYS    SYS    SYS
  NIC3    SYS    NODE    PHB    PHB     X     PHB    SYS    SYS    SYS    SYS
  NIC4    SYS    NODE    PHB    PHB    PHB     X     SYS    SYS    SYS    SYS
  NIC5    PHB    SYS    SYS    SYS    SYS    SYS     X     PHB    PHB    PHB
  NIC6    PHB    SYS    SYS    SYS    SYS    SYS    PHB     X     PHB    PHB
  NIC7    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB     X     PHB
  NIC8    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB     X

  Legend:
    X    = Self
    SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
    PIX  = Connection traversing at most a single PCIe bridge
    NV#  = Connection traversing a bonded set of # NVLinks

  NIC Legend:
    NIC0: mlx5_0
    NIC1: mlx5_1
    NIC2: mlx5_2
    NIC3: mlx5_3
    NIC4: mlx5_4
    NIC5: mlx5_5
    NIC6: mlx5_6
    NIC7: mlx5_7
    NIC8: mlx5_8

  ==============================
       Environment Variables
  ==============================
  VLLM_ENABLE_CUDA_COMPATIBILITY=0
  LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
  CUDA_VERSION=12.9.1
  TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
  NVIDIA_DRIVER_CAPABILITIES=compute,utility
  VLLM_USAGE_SOURCE=production-docker-image
  NVIDIA_VISIBLE_DEVICES=GPU-819acdb6-5324-2224-7d7a-c365f47d479f
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

vllm serve "$MODEL_DIR" \
      --tokenizer Qwen/Qwen3.5-4B \
      --dtype bfloat16 \
      --gpu-memory-utilization 0.95 \
      --max-num-batched-tokens 16384 \
      --max-model-len 16384 \
      --max-num-seqs 128 \
      --quantization fp8 \
      --trust-remote-code \
      --enable-prefix-caching \
      --async-scheduling \
      --default-chat-template-kwargs '{"enable_thinking": false}'

---

13:57:45  Avg prompt throughput: 2598.6 tokens/s, Avg generation throughput: 1684.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 96.9%
  13:57:55  Avg prompt throughput: 83.6 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%
  13:58:05  Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%

---

┌──────────────────────────────────┬────────────────────────────────────┐
ConfigResult  ├──────────────────────────────────┼────────────────────────────────────┤
Default V1 + async-scheduling    │ DEADLOCK  ├──────────────────────────────────┼────────────────────────────────────┤
Default V1 + no async-scheduling │ DEADLOCK  ├──────────────────────────────────┼────────────────────────────────────┤
Default V1 + --enforce-eager     │ Didn't run enough but ~8x slower (not viable)  └──────────────────────────────────┴────────────────────────────────────┘
RAW_BUFFERClick to expand / collapse

Your current environment

  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled

  GPU Topology:

        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity    GPU NUMA ID
  GPU0     X     SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB    PHB    88-175    1        N/A
  NIC0    SYS     X     NODE    NODE    NODE    NODE    SYS    SYS    SYS    SYS
  NIC1    SYS    NODE     X     PHB    PHB    PHB    SYS    SYS    SYS    SYS
  NIC2    SYS    NODE    PHB     X     PHB    PHB    SYS    SYS    SYS    SYS
  NIC3    SYS    NODE    PHB    PHB     X     PHB    SYS    SYS    SYS    SYS
  NIC4    SYS    NODE    PHB    PHB    PHB     X     SYS    SYS    SYS    SYS
  NIC5    PHB    SYS    SYS    SYS    SYS    SYS     X     PHB    PHB    PHB
  NIC6    PHB    SYS    SYS    SYS    SYS    SYS    PHB     X     PHB    PHB
  NIC7    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB     X     PHB
  NIC8    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB     X

  Legend:
    X    = Self
    SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
    PIX  = Connection traversing at most a single PCIe bridge
    NV#  = Connection traversing a bonded set of # NVLinks

  NIC Legend:
    NIC0: mlx5_0
    NIC1: mlx5_1
    NIC2: mlx5_2
    NIC3: mlx5_3
    NIC4: mlx5_4
    NIC5: mlx5_5
    NIC6: mlx5_6
    NIC7: mlx5_7
    NIC8: mlx5_8

  ==============================
       Environment Variables
  ==============================
  VLLM_ENABLE_CUDA_COMPATIBILITY=0
  LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
  CUDA_VERSION=12.9.1
  TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
  NVIDIA_DRIVER_CAPABILITIES=compute,utility
  VLLM_USAGE_SOURCE=production-docker-image
  NVIDIA_VISIBLE_DEVICES=GPU-819acdb6-5324-2224-7d7a-c365f47d479f
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

🐛 Describe the bug

Describe the bug

The V1 engine core silently deadlocks under concurrent load using default configuration. The API server remains healthy (/health returns 200, /metrics responds), but zero tokens are generated. Requests stay in "running" state indefinitely and the engine never recovers without a pod restart.

vLLM serve command:

  vllm serve "$MODEL_DIR" \
      --tokenizer Qwen/Qwen3.5-4B \
      --dtype bfloat16 \
      --gpu-memory-utilization 0.95 \
      --max-num-batched-tokens 16384 \
      --max-model-len 16384 \
      --max-num-seqs 128 \
      --quantization fp8 \
      --trust-remote-code \
      --enable-prefix-caching \
      --async-scheduling \
      --default-chat-template-kwargs '{"enable_thinking": false}'

How to reproduce:

  1. Deploy vLLM with the configuration above
  2. Send concurrent requests (8-64 concurrency) — works fine initially
  3. Under sustained load or after a brief spike, the engine freezes
  4. No specific error triggers it — no OOM, no CUDA error, no validation error

Engine log sequence showing the deadlock:

  13:57:45  Avg prompt throughput: 2598.6 tokens/s, Avg generation throughput: 1684.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 96.9%
  13:57:55  Avg prompt throughput: 83.6 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%
  13:58:05  Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%

After 13:58:05, throughput stays at 0.0 permanently. Running count stays at 11. KV cache stays at 1.3%. The engine never recovers.

Key observations:

  • Not resource exhaustion: Only 11/128 request slots used, 1.3% KV cache
  • API server alive: /health returns 200, /metrics responds
  • Engine core frozen: Zero tokens generated, forward pass never executes
  • Affects arbitrary request counts: Seen with 11, 124, and other counts
  • Does not self-recover: Requires pod restart
  • --request-timeout does not help: Engine core is frozen, can't enforce timeouts

Investigation — deadlock tied to CUDA graph code paths:

Systematic testing to isolate the trigger:

  ┌──────────────────────────────────┬────────────────────────────────────┐
  │              Config              │               Result               │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + async-scheduling    │ DEADLOCK                           │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + no async-scheduling │ DEADLOCK                           │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + --enforce-eager     │ Didn't run enough but ~8x slower (not viable) │
  └──────────────────────────────────┴────────────────────────────────────┘

Incident 1 (nightly build): During a production spike, autoscaler scaled from 6→20→6 replicas. Two surviving replicas entered deadlock with 123-124 running requests each, stayed frozen 4+ hours until manual restart.

Incident 2 (v0.17.1 stable): Same deadlock reproduced within minutes of a load test. Single replica, low concurrency (8→16 ramp). Froze at 11 running requests. No autoscaling involved.

Current workaround: --enforce-eager prevents the deadlock but with ~8x throughput regression, making it unviable for production. Restarting affected replicas is the only practical mitigation.

Before submitting a new issue...

  • I have searched for relevant issues
  • I have asked the chatbot living at docs.vllm.ai

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the deadlock issue in the V1 engine core, we will focus on modifying the CUDA graph code paths to prevent deadlocks. Here are the steps:

  • Disable async-scheduling: Temporarily disable async-scheduling to prevent concurrent requests from causing deadlocks.
  • Implement a timeout mechanism: Introduce a timeout mechanism to detect and recover from deadlocks. This can be done by setting a timeout for each request and restarting the engine if the timeout is exceeded.
  • Modify CUDA graph code paths: Modify the CUDA graph code paths to prevent deadlocks. This can be done by introducing synchronization mechanisms, such as locks or semaphores, to ensure that concurrent requests do not cause deadlocks.

Example code snippet to implement a timeout mechanism:

import threading
import time

class TimeoutException(Exception):
    pass

class Engine:
    def __init__(self, timeout):
        self.timeout = timeout
        self.lock = threading.Lock()

    def process_request(self, request):
        with self.lock:
            start_time = time.time()
            try:
                # Process the request
                # ...
                if time.time() - start_time > self.timeout:
                    raise TimeoutException("Request timed out")
            except TimeoutException:
                # Restart the engine
                self.restart()

    def restart(self):
        # Restart the engine
        # ...

Verification

To verify that the fix worked, you can test the engine with concurrent requests and check if the deadlock occurs. You can also monitor the engine's performance and check if the timeout mechanism is triggered.

  • Test the engine with concurrent requests (8-64 concurrency)
  • Monitor the engine's performance and check if the timeout mechanism is triggered
  • Verify that the engine recovers from deadlocks and continues to process requests

Extra Tips

To prevent regressions, it's essential to:

  • Thoroughly test the engine with concurrent requests
  • Monitor the engine's performance and adjust the timeout mechanism as needed
  • Implement synchronization mechanisms to prevent deadlocks in the CUDA graph code paths
  • Continuously monitor the engine's performance and adjust the configuration as needed to prevent deadlocks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING