vllm - 💡(How to fix) Fix [Bug]: SimpleCPUOffloadConnector: requests stuck in Waiting (Running=0) under sustained load / CPU offload

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Willing to capture full EngineCore + Worker traceback if this escalates to a hang vs. intentional throttling.

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Summary When serving Qwen3_5-35B-A3B with SimpleCPUOffloadConnector, hybrid KV cache manager enabled, prefix caching, and chunked prefill, the engine initially processes concurrent /v1/chat/completions successfully, then transitions to a state where Running: 0 and Waiting: N (e.g. 50) with zero prompt/generation throughput, while HTTP requests may still return 200 OK. GPU KV cache usage drops to 0%. This appears correlated with CPU KV offload / cache pressure (large cpu_bytes_to_use).

Environment vLLM: 0.20.2rc1.dev129+g1acd67a79 (or your exact commit / wheel) Model: Qwen3_5-35B-A3B (local path), --language-model-only Hardware: 2× GPU (e.g. H20), tensor_parallel_size=2 OS / CUDA / PyTorch: (fill in) Configuration (minimal repro sketch) Approximate CLI (align with your deployment):

vllm serve <Qwen3_5-35B-A3B>
--tensor-parallel-size 2
--language-model-only
--enable-prefix-caching
--gpu-memory-utilization 0.85
--kv-transfer-config '{"kv_connector":"SimpleCPUOffloadConnector","kv_role":"kv_both","kv_connector_extra_config":{"cpu_bytes_to_use":137438953472}}'
--no-disable-hybrid-kv-cache-manager
--block-size 16
--max-model-len 128000
--enable-chunked-prefill
--max-num-batched-tokens 16384
... Workload: sustained high concurrency (e.g. 50 parallel clients) hitting POST /v1/chat/completions until GPU prefix cache and/or CPU offload path is stressed.

Observed behavior For a while, engine metrics look healthy, e.g. Running: 49 reqs, non-zero prompt/generation throughput, GPU KV cache usage ~6%, Prefix cache hit rate non-trivial (e.g. ~40%). Then metrics flip to something like: Running: 0 reqs, Waiting: 50 reqs, Avg prompt throughput: 0, Avg generation throughput: 0, GPU KV cache usage: 0.0%, prefix / external prefix hit rates near 0%. Uvicorn may still log many 200 OK for POST /v1/chat/completions around the same window, while the periodic engine line shows no running work — from a user perspective this looks like stall / no forward progress despite accepted connections. Example log snippets:

Engine 000: ... Running: 49 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.3%, Prefix cache hit rate: 40.0%, External prefix cache hit rate: 0.0% ... Engine 000: ... Running: 0 reqs, Waiting: 50 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.3%, ... Engine 000: ... Running: 0 reqs, Waiting: 50 reqs, ... Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s Expected behavior Under CPU offload and prefix caching, the scheduler should continue to admit and run requests within capacity (or fail fast with explicit errors / HTTP 5xx / queue depth signals), rather than leaving a full batch Waiting with Running=0 and zero throughput for extended periods.

Hypothesis (for maintainers) Possible interaction between SimpleCPUOffloadConnector, hybrid KV cache manager, and scheduler / block pool under pressure: e.g. deadlock, missed wake-up, or overly aggressive backpressure so no request is marked runnable while the wait queue stays saturated.

Additional context External prefix cache hit rate: 0.0% throughout — clarify if unrelated. Willing to capture full EngineCore + Worker traceback if this escalates to a hang vs. intentional throttling.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: SimpleCPUOffloadConnector: requests stuck in Waiting (Running=0) under sustained load / CPU offload