vllm - 💡(How to fix) Fix [Bug]: V1 engine core deadlocks under concurrent load (fp8 + prefix caching + Qwen3.5) [15 comments, 7 participants]

vllm2026-03-21 01:35:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37729•Fetched 2026-04-08 01:08:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×15mentioned ×3subscribed ×3labeled ×1

Error Message

No specific error triggers it — no OOM, no CUDA error, no validation error

Fix Action

Fix / Workaround

Current workaround: --enforce-eager prevents the deadlock but with ~8x throughput regression, making it unviable for production. Restarting affected replicas is the only practical mitigation.

Code Example

CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled

  GPU Topology:

        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity    GPU NUMA ID
  GPU0     X     SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB    PHB    88-175    1        N/A
  NIC0    SYS     X     NODE    NODE    NODE    NODE    SYS    SYS    SYS    SYS
  NIC1    SYS    NODE     X     PHB    PHB    PHB    SYS    SYS    SYS    SYS
  NIC2    SYS    NODE    PHB     X     PHB    PHB    SYS    SYS    SYS    SYS
  NIC3    SYS    NODE    PHB    PHB     X     PHB    SYS    SYS    SYS    SYS
  NIC4    SYS    NODE    PHB    PHB    PHB     X     SYS    SYS    SYS    SYS
  NIC5    PHB    SYS    SYS    SYS    SYS    SYS     X     PHB    PHB    PHB
  NIC6    PHB    SYS    SYS    SYS    SYS    SYS    PHB     X     PHB    PHB
  NIC7    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB     X     PHB
  NIC8    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB     X

  Legend:
    X    = Self
    SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
    PIX  = Connection traversing at most a single PCIe bridge
    NV#  = Connection traversing a bonded set of # NVLinks

  NIC Legend:
    NIC0: mlx5_0
    NIC1: mlx5_1
    NIC2: mlx5_2
    NIC3: mlx5_3
    NIC4: mlx5_4
    NIC5: mlx5_5
    NIC6: mlx5_6
    NIC7: mlx5_7
    NIC8: mlx5_8

  ==============================
       Environment Variables
  ==============================
  VLLM_ENABLE_CUDA_COMPATIBILITY=0
  LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
  CUDA_VERSION=12.9.1
  TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
  NVIDIA_DRIVER_CAPABILITIES=compute,utility
  VLLM_USAGE_SOURCE=production-docker-image
  NVIDIA_VISIBLE_DEVICES=GPU-819acdb6-5324-2224-7d7a-c365f47d479f
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

vllm serve "$MODEL_DIR" \
      --tokenizer Qwen/Qwen3.5-4B \
      --dtype bfloat16 \
      --gpu-memory-utilization 0.95 \
      --max-num-batched-tokens 16384 \
      --max-model-len 16384 \
      --max-num-seqs 128 \
      --quantization fp8 \
      --trust-remote-code \
      --enable-prefix-caching \
      --async-scheduling \
      --default-chat-template-kwargs '{"enable_thinking": false}'

---

13:57:45  Avg prompt throughput: 2598.6 tokens/s, Avg generation throughput: 1684.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 96.9%
  13:57:55  Avg prompt throughput: 83.6 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%
  13:58:05  Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%

---

┌──────────────────────────────────┬────────────────────────────────────┐
  │              Config              │               Result               │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + async-scheduling    │ DEADLOCK                           │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + no async-scheduling │ DEADLOCK                           │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + --enforce-eager     │ Didn't run enough but ~8x slower (not viable) │
  └──────────────────────────────────┴────────────────────────────────────┘

RAW_BUFFERClick to expand / collapse

Your current environment

  CUDA Archs: 7.0 7.5 8.0 8.9 9.0 10.0 12.0; ROCm: Disabled

  GPU Topology:

        GPU0    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity    GPU NUMA ID
  GPU0     X     SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB    PHB    88-175    1        N/A
  NIC0    SYS     X     NODE    NODE    NODE    NODE    SYS    SYS    SYS    SYS
  NIC1    SYS    NODE     X     PHB    PHB    PHB    SYS    SYS    SYS    SYS
  NIC2    SYS    NODE    PHB     X     PHB    PHB    SYS    SYS    SYS    SYS
  NIC3    SYS    NODE    PHB    PHB     X     PHB    SYS    SYS    SYS    SYS
  NIC4    SYS    NODE    PHB    PHB    PHB     X     SYS    SYS    SYS    SYS
  NIC5    PHB    SYS    SYS    SYS    SYS    SYS     X     PHB    PHB    PHB
  NIC6    PHB    SYS    SYS    SYS    SYS    SYS    PHB     X     PHB    PHB
  NIC7    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB     X     PHB
  NIC8    PHB    SYS    SYS    SYS    SYS    SYS    PHB    PHB    PHB     X

  Legend:
    X    = Self
    SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
    PIX  = Connection traversing at most a single PCIe bridge
    NV#  = Connection traversing a bonded set of # NVLinks

  NIC Legend:
    NIC0: mlx5_0
    NIC1: mlx5_1
    NIC2: mlx5_2
    NIC3: mlx5_3
    NIC4: mlx5_4
    NIC5: mlx5_5
    NIC6: mlx5_6
    NIC7: mlx5_7
    NIC8: mlx5_8

  ==============================
       Environment Variables
  ==============================
  VLLM_ENABLE_CUDA_COMPATIBILITY=0
  LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64
  CUDA_VERSION=12.9.1
  TORCH_CUDA_ARCH_LIST=7.0 7.5 8.0 8.9 9.0 10.0 12.0
  NVIDIA_DRIVER_CAPABILITIES=compute,utility
  VLLM_USAGE_SOURCE=production-docker-image
  NVIDIA_VISIBLE_DEVICES=GPU-819acdb6-5324-2224-7d7a-c365f47d479f
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

🐛 Describe the bug

Describe the bug

The V1 engine core silently deadlocks under concurrent load using default configuration. The API server remains healthy (/health returns 200, /metrics responds), but zero tokens are generated. Requests stay in "running" state indefinitely and the engine never recovers without a pod restart.

vLLM serve command:

  vllm serve "$MODEL_DIR" \
      --tokenizer Qwen/Qwen3.5-4B \
      --dtype bfloat16 \
      --gpu-memory-utilization 0.95 \
      --max-num-batched-tokens 16384 \
      --max-model-len 16384 \
      --max-num-seqs 128 \
      --quantization fp8 \
      --trust-remote-code \
      --enable-prefix-caching \
      --async-scheduling \
      --default-chat-template-kwargs '{"enable_thinking": false}'

How to reproduce:

Deploy vLLM with the configuration above
Send concurrent requests (8-64 concurrency) — works fine initially
Under sustained load or after a brief spike, the engine freezes
No specific error triggers it — no OOM, no CUDA error, no validation error

Engine log sequence showing the deadlock:

  13:57:45  Avg prompt throughput: 2598.6 tokens/s, Avg generation throughput: 1684.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 96.9%
  13:57:55  Avg prompt throughput: 83.6 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%
  13:58:05  Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 96.9%

After 13:58:05, throughput stays at 0.0 permanently. Running count stays at 11. KV cache stays at 1.3%. The engine never recovers.

Key observations:

Not resource exhaustion: Only 11/128 request slots used, 1.3% KV cache
API server alive: /health returns 200, /metrics responds
Engine core frozen: Zero tokens generated, forward pass never executes
Affects arbitrary request counts: Seen with 11, 124, and other counts
Does not self-recover: Requires pod restart
--request-timeout does not help: Engine core is frozen, can't enforce timeouts

Investigation — deadlock tied to CUDA graph code paths:

Systematic testing to isolate the trigger:

  ┌──────────────────────────────────┬────────────────────────────────────┐
  │              Config              │               Result               │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + async-scheduling    │ DEADLOCK                           │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + no async-scheduling │ DEADLOCK                           │
  ├──────────────────────────────────┼────────────────────────────────────┤
  │ Default V1 + --enforce-eager     │ Didn't run enough but ~8x slower (not viable) │
  └──────────────────────────────────┴────────────────────────────────────┘

Incident 1 (nightly build): During a production spike, autoscaler scaled from 6→20→6 replicas. Two surviving replicas entered deadlock with 123-124 running requests each, stayed frozen 4+ hours until manual restart.

Incident 2 (v0.17.1 stable): Same deadlock reproduced within minutes of a load test. Single replica, low concurrency (8→16 ramp). Froze at 11 running requests. No autoscaling involved.

Current workaround: --enforce-eager prevents the deadlock but with ~8x throughput regression, making it unviable for production. Restarting affected replicas is the only practical mitigation.

Before submitting a new issue...

I have searched for relevant issues
I have asked the chatbot living at docs.vllm.ai

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the deadlock issue in the V1 engine core, we will focus on modifying the CUDA graph code paths to prevent deadlocks. Here are the steps:

Disable async-scheduling: Temporarily disable async-scheduling to prevent concurrent requests from causing deadlocks.
Implement a timeout mechanism: Introduce a timeout mechanism to detect and recover from deadlocks. This can be done by setting a timeout for each request and restarting the engine if the timeout is exceeded.
Modify CUDA graph code paths: Modify the CUDA graph code paths to prevent deadlocks. This can be done by introducing synchronization mechanisms, such as locks or semaphores, to ensure that concurrent requests do not cause deadlocks.

Example code snippet to implement a timeout mechanism:

import threading
import time

class TimeoutException(Exception):
    pass

class Engine:
    def __init__(self, timeout):
        self.timeout = timeout
        self.lock = threading.Lock()

    def process_request(self, request):
        with self.lock:
            start_time = time.time()
            try:
                # Process the request
                # ...
                if time.time() - start_time > self.timeout:
                    raise TimeoutException("Request timed out")
            except TimeoutException:
                # Restart the engine
                self.restart()

    def restart(self):
        # Restart the engine
        # ...

Verification

To verify that the fix worked, you can test the engine with concurrent requests and check if the deadlock occurs. You can also monitor the engine's performance and check if the timeout mechanism is triggered.

Test the engine with concurrent requests (8-64 concurrency)
Monitor the engine's performance and check if the timeout mechanism is triggered
Verify that the engine recovers from deadlocks and continues to process requests

Extra Tips

To prevent regressions, it's essential to:

Thoroughly test the engine with concurrent requests
Monitor the engine's performance and adjust the timeout mechanism as needed
Implement synchronization mechanisms to prevent deadlocks in the CUDA graph code paths
Continuously monitor the engine's performance and adjust the configuration as needed to prevent deadlocks.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #database connection #vector store #environment variable #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: V1 engine core deadlocks under concurrent load (fp8 + prefix caching + Qwen3.5) [15 comments, 7 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Describe the bug

Key observations:

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: V1 engine core deadlocks under concurrent load (fp8 + prefix caching + Qwen3.5) [15 comments, 7 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Describe the bug

Key observations:

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING