vllm - 💡(How to fix) Fix [Bug]: SM 7.5 extreme slowness hangs indefinitely on T4 (vllm 0.17.0 with Qwen3.5-27B) [5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36589Fetched 2026-04-08 00:36:08
View on GitHub
Comments
5
Participants
2
Timeline
11
Reactions
2
Author
Participants
Timeline (top)
commented ×5cross-referenced ×2subscribed ×2labeled ×1

Code Example

qwen3.5-27b  | (APIServer pid=1) INFO:     Started server process [1]
qwen3.5-27b  | (APIServer pid=1) INFO:     Waiting for application startup.
qwen3.5-27b  | (APIServer pid=1) INFO:     Application startup complete.
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:33:38 [loggers.py:259] Engine 000: Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:33:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:33:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:44334 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:44334 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:50810 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:08 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 20.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:54580 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45392 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45412 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45412 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:42562 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:18 [loggers.py:259] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57394 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57402 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57402 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57054 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57054 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (EngineCore_DP0 pid=407) INFO 03-10 02:38:31 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:08 [loggers.py:259] Engine 000: Avg prompt throughput: 24.4 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:42842 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:48524 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:48524 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:48530 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.5%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:54128 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.7%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55166 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:28 [loggers.py:259] Engine 000: Avg prompt throughput: 32.8 tokens/s, Avg generation throughput: 40.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.5%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:56598 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55942 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55942 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:56996 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:18 [loggers.py:259] Engine 000: Avg prompt throughput: 20.1 tokens/s, Avg generation throughput: 42.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:53258 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.5%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:41274 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
RAW_BUFFERClick to expand / collapse

Your current environment

vllm 0.17.0 qwen3.5-27B tesla T4 x 4

🐛 Describe the bug

After applying the suggested fix from #36357 (fallback to TORCH_SDPA for multimodal encoder on SM < 8.0), I still get the exact same shm_broadcast timeout warning. The server does start and can serve requests , but the warning appears during runtime, generation throughput drops dramatically, and KV cache usage keeps climbing even after requests finish. This matches the "hanging or time-consuming work (compilation/weight/kv cache quantization)" message from the original issue.

Long loading for the response and indefinitely timeout. <img width="1344" height="495" alt="Image" src="https://github.com/user-attachments/assets/b18ee1f5-fbbb-4de8-ad12-52687757bbcf" />

qwen3.5-27b  | (APIServer pid=1) INFO:     Started server process [1]
qwen3.5-27b  | (APIServer pid=1) INFO:     Waiting for application startup.
qwen3.5-27b  | (APIServer pid=1) INFO:     Application startup complete.
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:33:38 [loggers.py:259] Engine 000: Avg prompt throughput: 1.7 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:33:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:33:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:44334 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:44334 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:34:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:50810 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:08 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 20.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:35:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 15.4%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:54580 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:36:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45392 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45412 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45412 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:42562 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:18 [loggers.py:259] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57394 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57402 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57402 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:37:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57054 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:57054 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (EngineCore_DP0 pid=407) INFO 03-10 02:38:31 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:08 [loggers.py:259] Engine 000: Avg prompt throughput: 24.4 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 30.8%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.4 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:42842 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:39:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:48524 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:48524 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:48530 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.5%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:54128 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:40:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 46.2%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 48.7%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55166 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:28 [loggers.py:259] Engine 000: Avg prompt throughput: 32.8 tokens/s, Avg generation throughput: 40.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.5%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 41.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:56598 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:41:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55942 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:55942 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:56996 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:18 [loggers.py:259] Engine 000: Avg prompt throughput: 20.1 tokens/s, Avg generation throughput: 42.2 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 42.6 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 43.6%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:45396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 33.3%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:42:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 35.9%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:53258 - "GET /v1/models HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 38.5%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO:     172.21.0.1:41274 - "POST /v1/chat/completions HTTP/1.1" 200 OK
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
qwen3.5-27b  | (APIServer pid=1) INFO 03-10 02:43:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves adjusting the configuration to prevent the GPU KV cache from growing indefinitely and to handle the shm_broadcast timeout warning.

Here are the steps:

  • Increase the shm_broadcast_timeout value to allow for more time to handle the broadcast.
  • Implement a mechanism to limit the GPU KV cache growth, such as setting a maximum cache size or implementing a cache eviction policy.
  • Optimize the model and data loading to reduce the compilation and weight/kv cache quantization time.

Example code to increase the shm_broadcast_timeout value:

import os

# Increase the shm_broadcast_timeout value
os.environ['SHM_BROADCAST_TIMEOUT'] = '120'  # in seconds

Example code to limit the GPU KV cache growth:

import torch

# Set the maximum GPU KV cache size
max_cache_size = 1024 * 1024 * 1024  # 1 GB

# Create a cache eviction policy
class CacheEvictionPolicy:
    def __init__(self, max_cache_size):
        self.max_cache_size = max_cache_size
        self.cache_size = 0

    def add_to_cache(self, key, value):
        if self.cache_size + sys.getsizeof(value) > self.max_cache_size:
            # Evict the least recently used item from the cache
            del self.cache[key]
        self.cache[key] = value
        self.cache_size += sys.getsizeof(value)

# Create a GPU KV cache with the eviction policy
cache = CacheEvictionPolicy(max_cache_size)

Verification

To verify that the fix worked, monitor the GPU KV cache usage and the shm_broadcast timeout warnings. The cache usage should no longer grow indefinitely, and the timeout warnings should be reduced or eliminated.

Extra Tips

  • Regularly monitor the system resources and adjust the configuration as needed to prevent similar issues.
  • Consider implementing a more robust cache eviction policy, such as a least recently used (LRU) policy.
  • Optimize the model and data loading to reduce the compilation and weight/kv cache quantization time.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: SM 7.5 extreme slowness hangs indefinitely on T4 (vllm 0.17.0 with Qwen3.5-27B) [5 comments, 2 participants]