vllm - ✅(Solved) Fix [Bug]: [xPyD]Potential OOM when using v1 P2pNcclConnector as KV cache transport: KV cache accumulation on decode instance. [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38472Fetched 2026-04-08 01:49:09
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Timeline (top)
referenced ×4commented ×1cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #38475: fix(p2p_nccl): free KV recv_store entries immediately to prevent OOM (#38472)

Description (problem / solution / changelog)

Summary

Fixes #38472.

On the decode (consumer) instance each received KV-cache tensor was kept in recv_store until the request finished generation. Under high QPS with long outputs this meant every in-flight request held its full KV payload (all layers, VRAM or pinned TensorMemoryPool RAM) simultaneously, filling memory and causing OOM.

Root Causes

1. recv_tensor read without popping

# Before
tensor = self.recv_store[tensor_id]   # entry stays until get_finished()
# After
tensor = self.recv_store.pop(tensor_id)  # freed as soon as injected into KV blocks

2. pool.free(addr) deferred to get_finished

For pool-backed (pinned-RAM) entries, the address was only freed at request completion. Fixed: called immediately after pool.load_tensor() inside a try/finally to also handle exceptions.

3. get_finished straggler cleanup iterated no_compile_layers

The old loop iterated all layer names from no_compile_layers and checked if tensor_id in recv_store, which (a) was slow, (b) missed entries if no_compile_layers was empty, and (c) never decremented buffer_size for non-pool straggler tensors. Replaced with recv_request_id_to_tensor_ids (already tracking exactly which tensors were received per request). The unused no_compile_layers parameter is removed from the engine method and its call-site.

Files Changed

  • vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py
  • vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py
  • tests/v1/kv_connector/unit/test_p2p_nccl_engine_recv_store.py (new, 18 unit tests)

Test Coverage

18 unit tests cover: pop-on-recv, buffer_size accounting, pool.free timing, exception safety (try/finally), straggler cleanup, double-free prevention, isolation of in-flight requests, and tracking-dict cleanup. Tests run without CUDA/NCCL/ZMQ using object.__new__ to bypass the engine constructor.

Bugs confirmed reproduced against original code before fix:

  • CONFIRMED BUG 1: entry still in recv_store after recv_tensor (old code)
  • CONFIRMED BUG 2: pool.free NOT called immediately in old recv_tensor

Why not a duplicate

No open PR addresses this. The issue was opened a few hours ago.

AI Assistance

This PR was developed with GitHub Copilot assistance. Every changed line has been reviewed and the fix logic verified manually.

Changed files

  • tests/v1/kv_connector/unit/test_p2p_nccl_engine_recv_store.py (added, +534/-0)
  • vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py (modified, +1/-2)
  • vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_engine.py (modified, +44/-13)

Code Example

CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=1 vllm serve $MODEL \
        --enforce-eager \
        --host 127.0.0.1 \
        --port 20003 \
        --tensor-parallel-size 1 \
        --seed 1024 \
        --dtype float16 \
        --max-model-len 8192 \
        --max-num-batched-tokens 32768 \
        --max-num-seqs 1024 \
        --trust-remote-code \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config \
        "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_producer\",\"kv_buffer_size\":\"1e1\",\"kv_port\":\"21001\",\"kv_connector_extra_config\":{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"30001\",\"http_port\":\"20003\",\"send_type\":\"PUT_ASYNC\",\"nccl_num_channels\":\"16\",\"mem_pool_size_gb\":\"16\"}}"

---

CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve $MODEL \
        --enforce-eager \
        --host 127.0.0.1 \
        --port 20005 \
        --tensor-parallel-size 1 \
        --seed 1024 \
        --dtype float16 \
        --max-model-len 8192 \
        --max-num-batched-tokens 10000 \
        --max-num-seqs 1024 \
        --trust-remote-code \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config \
        "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_consumer\",\"kv_buffer_size\":\"8e9\",\"kv_port\":\"22001\",\"kv_connector_extra_config\":{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"30001\",\"http_port\":\"20005\",\"send_type\":\"PUT_ASYNC\",\"nccl_num_channels\":\"16\",\"mem_pool_size_gb\":\"16\"}}"

---

vllm bench serve --host 127.0.0.1 --port 10001 --seed 42 \
        --model $MODEL \
        --backend openai-chat \
        --endpoint /v1/chat/completions \
        --dataset-name random --random-input-len 4000 --random-output-len 4000 \
        --num-prompts 1000 --burstiness 1 --request-rate 4 --ignore-eos

---

python3 disagg_proxy_p2p_nccl_xpyd.py
RAW_BUFFERClick to expand / collapse

Your current environment

My Environmet gpu A800 80GB * 2 vllm: 0.11.0 model: meta-llama/Meta-Llama-3-8B-Instruct P/D:1P1D

Prefill Instance

CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=1 vllm serve $MODEL \
        --enforce-eager \
        --host 127.0.0.1 \
        --port 20003 \
        --tensor-parallel-size 1 \
        --seed 1024 \
        --dtype float16 \
        --max-model-len 8192 \
        --max-num-batched-tokens 32768 \
        --max-num-seqs 1024 \
        --trust-remote-code \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config \
        "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_producer\",\"kv_buffer_size\":\"1e1\",\"kv_port\":\"21001\",\"kv_connector_extra_config\":{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"30001\",\"http_port\":\"20003\",\"send_type\":\"PUT_ASYNC\",\"nccl_num_channels\":\"16\",\"mem_pool_size_gb\":\"16\"}}"

Decode Instance

CUDA_VISIBLE_DEVICES=1 VLLM_USE_V1=1 vllm serve $MODEL \
        --enforce-eager \
        --host 127.0.0.1 \
        --port 20005 \
        --tensor-parallel-size 1 \
        --seed 1024 \
        --dtype float16 \
        --max-model-len 8192 \
        --max-num-batched-tokens 10000 \
        --max-num-seqs 1024 \
        --trust-remote-code \
        --gpu-memory-utilization 0.8 \
        --kv-transfer-config \
        "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_consumer\",\"kv_buffer_size\":\"8e9\",\"kv_port\":\"22001\",\"kv_connector_extra_config\":{\"proxy_ip\":\"127.0.0.1\",\"proxy_port\":\"30001\",\"http_port\":\"20005\",\"send_type\":\"PUT_ASYNC\",\"nccl_num_channels\":\"16\",\"mem_pool_size_gb\":\"16\"}}"

BenchMark

vllm bench serve --host 127.0.0.1 --port 10001 --seed 42 \
        --model $MODEL \
        --backend openai-chat \
        --endpoint /v1/chat/completions \
        --dataset-name random --random-input-len 4000 --random-output-len 4000 \
        --num-prompts 1000 --burstiness 1 --request-rate 4 --ignore-eos

Proxy

python3 disagg_proxy_p2p_nccl_xpyd.py

🐛 Describe the bug

When using P2pNcclConnector as the KV cache transport connector (with put_async mode) and disagg_proxy_p2p_nccl_xpyd.py as the proxy: I monitored the GPU memory usage of the decode instance using nvitop during testing. While the memory usage was approximately 81% upon vLLM initialization, it rose continuously after the benchmark started until OOM (Out of Memory) occurred, leading to the failure of the entire test process.

Logs indicate that the kv_buffer and tensor_memory_pool on the decode instance became completely saturated with requested KV caches, causing storage failures for newly requested KV caches.

prefill log

<img width="1106" height="312" alt="Image" src="https://github.com/user-attachments/assets/95c48516-e3cd-4ebf-b60d-8711c5edb010" />

decode log

<img width="1046" height="348" alt="Image" src="https://github.com/user-attachments/assets/08d9c770-194e-48e3-82fc-fae4999357a2" /> <img width="1007" height="830" alt="Image" src="https://github.com/user-attachments/assets/6619f26c-0a42-4428-8140-dd809ec5d2c0" />

Based on my analysis of the source code, the KV cache is only deleted from the kv_buffer or tensor_memory_pool after the request has finished execution. Under scenarios with high QPS and long output sequences, the KV cache sent from prefill instances remains in the decode instance's buffers for an extended period, significantly increasing the risk of OOM on the decode instance.

Furthermore, when the kv_buffer and tensor_memory_pool are full, newly arriving KV cache requests are evicted on the decode side and encounter errors. There appears to be no retry mechanism currently in place to handle these failures.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the OOM issue caused by the KV cache accumulation on the decode instance, we need to implement a mechanism to limit the KV cache size and handle evictions more efficiently. Here are the steps:

  • Implement a retry mechanism: Add a retry mechanism for KV cache requests that encounter errors due to full buffers. This can be achieved by modifying the disagg_proxy_p2p_nccl_xpyd.py script to retry failed requests after a short delay.
  • Limit KV cache size: Introduce a maximum size limit for the KV cache on the decode instance. When the limit is reached, evict the oldest KV cache entries to make room for new ones.
  • Optimize KV cache eviction: Modify the KV cache eviction policy to prioritize evicting entries that are least recently used (LRU) or have been idle for an extended period.

Example code snippet to implement a retry mechanism in disagg_proxy_p2p_nccl_xpyd.py:

import time

# ...

def send_kv_cache_request(kv_cache):
    max_retries = 3
    retry_delay = 0.1  # 100ms

    for attempt in range(max_retries):
        try:
            # Send KV cache request
            # ...
            return
        except Exception as e:
            if attempt < max_retries - 1:
                print(f"KV cache request failed (attempt {attempt + 1}/{max_retries}). Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
            else:
                print(f"KV cache request failed after {max_retries} attempts. Giving up.")
                raise

# ...

Example code snippet to implement a simple LRU cache eviction policy:

from collections import OrderedDict

class LRUCache:
    def __init__(self, max_size):
        self.max_size = max_size
        self.cache = OrderedDict()

    def get(self, key):
        if key in self.cache:
            value = self.cache.pop(key)
            self.cache[key] = value  # Move to end to mark as recently used
            return value
        return None

    def set(self, key, value):
        if key in self.cache:
            self.cache.pop(key)
        elif len(self.cache) >= self.max_size:
            self.cache.popitem(last=False)  # Evict oldest entry
        self.cache[key] = value

# ...

Verification

To verify that the fix worked, monitor the GPU memory usage of the decode instance during testing and check for any OOM errors. Additionally, verify that the retry mechanism is working correctly by intentionally introducing failures and checking that the requests are retried successfully.

Extra Tips

  • Consider implementing a more sophisticated cache eviction policy, such as a time-to-live (TTL) based policy, to further optimize KV cache management.
  • Monitor

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING