vllm - 💡(How to fix) Fix [Bug]: `_check_enough_kv_cache_memory` does not account for TP/PP sharding, making KV offloading impossible in multi-GPU distributed deployments [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(EngineCore pid=401) ERROR [core.py:1140] ValueError: To serve at least one

Root Cause

The bug is in vllm/v1/core/kv_cache_utils.py, function _check_enough_kv_cache_memory (around line 714).

The check compares:

  • Available: per-GPU KV cache memory reported by a single worker → 15.57 GiB
  • Required: total KV cache for max_model_len across all model layers → 223.68 GiB

This is incorrect for distributed deployments. With tp=8 and pp=2, the KV cache for any single request is sharded across all 16 GPUs: Total available KV memory = 15.57 GiB × 16 GPUs = 248.9 GiB Required for max_model_len = 223.68 GiB 248.9 GiB > 223.68 GiB → should PASS, but the check FAILS

The correct comparison should factor in the TP and PP group sizes so that the effective per-GPU requirement is: required_per_gpu = total_required / (tp_size × pp_size) = 223.68 GiB / (8 × 2) = 13.98 GiB 13.98 GiB < 15.57 GiB → PASS ✓

The non-offloading code path does not perform this check and correctly uses all 16 GPUs' combined capacity, which is why the baseline deployment works.


Fix Action

Fixed

Code Example

# Node 0
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr "192.168.180.132" \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

# Node 1
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 1 \
  --master-addr "192.168.180.132" \
  --headless \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

---

# Current (incorrect for TP/PP > 1):
if available_kv_cache_memory < required_kv_cache_memory:
    raise ValueError(...)

# Suggested fix:
num_kv_shards = tensor_parallel_size * pipeline_parallel_size
if available_kv_cache_memory * num_kv_shards < required_kv_cache_memory:
    raise ValueError(...)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>When KV offloading is enabled (`--kv_offloading_backend native` or `lmcache`) in a multi-GPU deployment with Tensor Parallelism (TP) and/or Pipeline Parallelism (PP), `_check_enough_kv_cache_memory` incorrectly compares the **per-GPU** available KV cache memory against the **total undistributed** KV cache requirement for `max_model_len`. This causes a spurious `ValueError` that prevents the engine from starting, even though the aggregated KV capacity across all GPUs is sufficient to serve the requested context length.</summary>

Environment

  • vLLM version: v0.21.0
  • Model: DeepSeek-V4-Pro
  • Hardware: 2 nodes × 8 × H100 80GB (16 GPUs total)
  • Parallelism: --tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2
  • Docker image: vllm/vllm-openai:v0.21.0
</details>

🐛 Describe the bug

Steps to Reproduce

Launch vLLM with KV offloading enabled on a multi-node, multi-GPU setup:

# Node 0
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr "192.168.180.132" \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

# Node 1
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 1 \
  --master-addr "192.168.180.132" \
  --headless \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

Actual Behavior

The engine fails immediately at startup with: (EngineCore pid=401) ERROR [core.py:1140] ValueError: To serve at least one request with the models's max seq len (1000000), (223.68 GiB KV cache is needed, which is larger than the available KV cache memory (15.57 GiB). Based on the available memory, the estimated maximum model length is 69376. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.


Expected Behavior

The engine should start successfully. The check should account for KV cache distribution across all GPUs in the TP and PP groups.

Proof that the hardware is capable: The identical deployment without KV offloading flags starts successfully and allocates ~4 million KV cache tokens, confirming that vLLM's normal distributed KV allocation works correctly across all 16 GPUs. The only change that triggers the failure is adding --kv_offloading_backend.


Root Cause Analysis

The bug is in vllm/v1/core/kv_cache_utils.py, function _check_enough_kv_cache_memory (around line 714).

The check compares:

  • Available: per-GPU KV cache memory reported by a single worker → 15.57 GiB
  • Required: total KV cache for max_model_len across all model layers → 223.68 GiB

This is incorrect for distributed deployments. With tp=8 and pp=2, the KV cache for any single request is sharded across all 16 GPUs: Total available KV memory = 15.57 GiB × 16 GPUs = 248.9 GiB Required for max_model_len = 223.68 GiB 248.9 GiB > 223.68 GiB → should PASS, but the check FAILS

The correct comparison should factor in the TP and PP group sizes so that the effective per-GPU requirement is: required_per_gpu = total_required / (tp_size × pp_size) = 223.68 GiB / (8 × 2) = 13.98 GiB 13.98 GiB < 15.57 GiB → PASS ✓

The non-offloading code path does not perform this check and correctly uses all 16 GPUs' combined capacity, which is why the baseline deployment works.


Suggested Fix

In _check_enough_kv_cache_memory, divide the required KV memory by tensor_parallel_size × pipeline_parallel_size before comparing against the per-GPU available memory, or equivalently, multiply the per-GPU available memory by the total number of GPUs in the KV-sharing group.

# Current (incorrect for TP/PP > 1):
if available_kv_cache_memory < required_kv_cache_memory:
    raise ValueError(...)

# Suggested fix:
num_kv_shards = tensor_parallel_size * pipeline_parallel_size
if available_kv_cache_memory * num_kv_shards < required_kv_cache_memory:
    raise ValueError(...)

Impact

This bug makes KV offloading completely unusable for any multi-GPU distributed deployment with TP > 1 or PP > 1 and a large max_model_len. Since KV offloading is specifically motivated by scenarios where GPU memory is limited relative to context length, this affects the most common real-world use cases for the feature.


Additional Context

  • Without any --kv_offloading_backend flag, the same setup serves --max-model-len 1000000 with ~4 million KV cache tokens successfully.
  • The failure occurs for both --kv_offloading_backend native and --kv_offloading_backend lmcache.
  • The --disable-hybrid-kv-cache-manager flag is required by the offloading feature and does not affect whether this bug triggers.
  • Log confirming available per-GPU KV memory: (Worker_PP0_TP0_EP0 pid=600) INFO [gpu_worker.py:462] Available KV cache memory: 15.57 GiB

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: `_check_enough_kv_cache_memory` does not account for TP/PP sharding, making KV offloading impossible in multi-GPU distributed deployments [1 pull requests]