vllm - 💡(How to fix) Fix [Bug]: `_check_enough_kv_cache_memory` does not account for TP/PP sharding, making KV offloading impossible in multi-GPU distributed deployments [1 pull requests]

Root Cause

The bug is in vllm/v1/core/kv_cache_utils.py, function _check_enough_kv_cache_memory (around line 714).

The check compares:

Available: per-GPU KV cache memory reported by a single worker → 15.57 GiB
Required: total KV cache for max_model_len across all model layers → 223.68 GiB

This is incorrect for distributed deployments. With tp=8 and pp=2, the KV cache for any single request is sharded across all 16 GPUs: Total available KV memory = 15.57 GiB × 16 GPUs = 248.9 GiB Required for max_model_len = 223.68 GiB 248.9 GiB > 223.68 GiB → should PASS, but the check FAILS

The correct comparison should factor in the TP and PP group sizes so that the effective per-GPU requirement is: required_per_gpu = total_required / (tp_size × pp_size) = 223.68 GiB / (8 × 2) = 13.98 GiB 13.98 GiB < 15.57 GiB → PASS ✓

The non-offloading code path does not perform this check and correctly uses all 16 GPUs' combined capacity, which is why the baseline deployment works.

Code Example

# Node 0
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr "192.168.180.132" \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

# Node 1
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 1 \
  --master-addr "192.168.180.132" \
  --headless \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

---

# Current (incorrect for TP/PP > 1):
if available_kv_cache_memory < required_kv_cache_memory:
    raise ValueError(...)

# Suggested fix:
num_kv_shards = tensor_parallel_size * pipeline_parallel_size
if available_kv_cache_memory * num_kv_shards < required_kv_cache_memory:
    raise ValueError(...)

Your current environment

<details> <summary>When KV offloading is enabled (`--kv_offloading_backend native` or `lmcache`) in a multi-GPU deployment with Tensor Parallelism (TP) and/or Pipeline Parallelism (PP), `_check_enough_kv_cache_memory` incorrectly compares the **per-GPU** available KV cache memory against the **total undistributed** KV cache requirement for `max_model_len`. This causes a spurious `ValueError` that prevents the engine from starting, even though the aggregated KV capacity across all GPUs is sufficient to serve the requested context length.</summary>

Environment

vLLM version: v0.21.0
Model: DeepSeek-V4-Pro
Hardware: 2 nodes × 8 × H100 80GB (16 GPUs total)
Parallelism: --tensor-parallel-size 8 --pipeline-parallel-size 2 --nnodes 2
Docker image: vllm/vllm-openai:v0.21.0

</details>

🐛 Describe the bug

Steps to Reproduce

Launch vLLM with KV offloading enabled on a multi-node, multi-GPU setup:

# Node 0
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr "192.168.180.132" \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

# Node 1
docker run --rm --gpus all --network host \
  vllm/vllm-openai:v0.21.0 /model/DeepSeek-V4-Pro \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --nnodes 2 \
  --node-rank 1 \
  --master-addr "192.168.180.132" \
  --headless \
  --max-model-len 1000000 \
  --gpu-memory-utilization 0.95 \
  --kv_offloading_backend native \
  --kv_offloading_size 200 \
  --disable-hybrid-kv-cache-manager \
  ...

Actual Behavior

The engine fails immediately at startup with: (EngineCore pid=401) ERROR [core.py:1140] ValueError: To serve at least one request with the models's max seq len (1000000), (223.68 GiB KV cache is needed, which is larger than the available KV cache memory (15.57 GiB). Based on the available memory, the estimated maximum model length is 69376. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Expected Behavior

The engine should start successfully. The check should account for KV cache distribution across all GPUs in the TP and PP groups.

Proof that the hardware is capable: The identical deployment without KV offloading flags starts successfully and allocates ~4 million KV cache tokens, confirming that vLLM's normal distributed KV allocation works correctly across all 16 GPUs. The only change that triggers the failure is adding --kv_offloading_backend.

Root Cause Analysis

The bug is in vllm/v1/core/kv_cache_utils.py, function _check_enough_kv_cache_memory (around line 714).

The check compares:

Available: per-GPU KV cache memory reported by a single worker → 15.57 GiB
Required: total KV cache for max_model_len across all model layers → 223.68 GiB

The non-offloading code path does not perform this check and correctly uses all 16 GPUs' combined capacity, which is why the baseline deployment works.

Suggested Fix

In _check_enough_kv_cache_memory, divide the required KV memory by tensor_parallel_size × pipeline_parallel_size before comparing against the per-GPU available memory, or equivalently, multiply the per-GPU available memory by the total number of GPUs in the KV-sharing group.

# Current (incorrect for TP/PP > 1):
if available_kv_cache_memory < required_kv_cache_memory:
    raise ValueError(...)

# Suggested fix:
num_kv_shards = tensor_parallel_size * pipeline_parallel_size
if available_kv_cache_memory * num_kv_shards < required_kv_cache_memory:
    raise ValueError(...)

Impact

This bug makes KV offloading completely unusable for any multi-GPU distributed deployment with TP > 1 or PP > 1 and a large max_model_len. Since KV offloading is specifically motivated by scenarios where GPU memory is limited relative to context length, this affects the most common real-world use cases for the feature.

Additional Context

Without any --kv_offloading_backend flag, the same setup serves --max-model-len 1000000 with ~4 million KV cache tokens successfully.
The failure occurs for both --kv_offloading_backend native and --kv_offloading_backend lmcache.
The --disable-hybrid-kv-cache-manager flag is required by the offloading feature and does not affect whether this bug triggers.
Log confirming available per-GPU KV memory: (Worker_PP0_TP0_EP0 pid=600) INFO [gpu_worker.py:462] Available KV cache memory: 15.57 GiB

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: `_check_enough_kv_cache_memory` does not account for TP/PP sharding, making KV offloading impossible in multi-GPU distributed deployments [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message