vllm - ✅(Solved) Fix [RFC]: Multi-tier KV offloading via the vLLM offloading connector [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38260Fetched 2026-04-08 01:36:53
View on GitHub
Comments
0
Participants
1
Timeline
22
Reactions
5
Participants
Timeline (top)
subscribed ×13mentioned ×8labeled ×1

PR fix notes

PR #40020: [kv_offload] Add multi-tier KV cache offloading framework

Description (problem / solution / changelog)

Purpose

Adds a hierarchical (tiered) KV cache offloading framework under vllm/v1/kv_offload/tiering/, extending the existing single-tier CPU offloading with support for chained secondary tiers (e.g., storage, network).

Implements the design proposed in #38260 [RFC]: Multi-tier KV offloading via the vLLM offloading connector.

Key additions:

  • SecondaryTierManager ABC (abstract.py) — interface for secondary tier backends, defining async store/load/lookup primitives and a JobResult protocol for polling completions
  • CPUPrimaryTierOffloadingManager (tiering/manager.py) — wraps CPUOffloadingManager and exposes a secondary-facing read/write alias API (prepare_read/complete_read, prepare_write/complete_write) to clarify directionality when called from the cascade/promotion paths
  • TieringOffloadingManager (tiering/manager.py) — orchestrates GPU↔CPU (primary) and CPU→secondary tier transfers:
    • Cascade on store: blocks written by GPU are fanned out to all secondary tiers
    • Staged promotion on load: blocks missing from primary are fetched from secondary → primary before the GPU can access them; lookup() returns None while promotion is in flight to signal "retry later"
    • ref_cnt protection: prepare_read() increments ref_cnt to protect blocks from eviction during async transfers
  • TieringOffloadingSpec (tiering/spec.py) — entry point for the tiered stack; a CPUOffloadingSpec subclass that reads secondary_tiers from kv_connector_extra_config and assembles the TieringOffloadingManager
  • DummySecondaryTier (secondary_tiers/dummy.py) — in-memory secondary tier for testing, with optional async simulation
  • SharedOffloadRegion integration — CPUPrimaryTierOffloadingManager accepts the existing SharedOffloadRegion so secondary tiers can memoryview primary tier buffers zero-copy

Test Plan

.venv/bin/python -m pytest tests/v1/kv_offload/test_tiering_offloading.py -v

Test Result


tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_basic_store_and_lookup PASSED                                           [  6%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_in_flight_blocks_return_none PASSED                                     [ 12%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_lru_eviction PASSED                                                     [ 18%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_async_simulation PASSED                                                 [ 25%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_basic_store_to_primary PASSED                                     [ 31%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_cascade_to_all_secondary_tiers PASSED                             [ 37%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_ref_cnt_protection_during_cascade PASSED                          [ 43%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_lookup_from_primary PASSED                                        [ 50%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_promotion_from_secondary PASSED                                   [ 56%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_partial_lookup PASSED                                             [ 62%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_eviction_in_primary_tier PASSED                                   [ 68%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_touch_propagates_to_all_tiers PASSED                              [ 75%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_failed_store_no_cascade PASSED                                    [ 81%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_multiple_secondary_tiers_independent_eviction PASSED              [ 87%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_prepare_store_processes_finished_jobs_first PASSED                [ 93%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingWithoutSecondaryTiers::test_works_without_secondary_tiers PASSED                [100%]

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/v1/kv_offload/test_tiering_offloading.py (added, +423/-0)
  • vllm/v1/kv_offload/abstract.py (modified, +154/-1)
  • vllm/v1/kv_offload/cpu/spec.py (modified, +31/-20)
  • vllm/v1/kv_offload/factory.py (modified, +5/-0)
  • vllm/v1/kv_offload/secondary_tiers/__init__.py (added, +18/-0)
  • vllm/v1/kv_offload/secondary_tiers/dummy.py (added, +299/-0)
  • vllm/v1/kv_offload/tiering/__init__.py (added, +0/-0)
  • vllm/v1/kv_offload/tiering/manager.py (added, +455/-0)
  • vllm/v1/kv_offload/tiering/spec.py (added, +231/-0)
  • vllm/v1/kv_offload/worker/cpu_gpu.py (modified, +7/-0)

Code Example

class SecondaryTierManager(ABC):
    def lookup(block_hashes) -> int | None
    def submit_store(job_metadata: JobMetadata) -> None
    def submit_load(job_metadata: JobMetadata) -> None
    def get_finished() -> Iterable[JobResult]
    def touch(block_hashes)

@dataclass
class JobMetadata:
    job_id: int                         # unique job identifier
    block_hashes: list[BlockHash]       # which blocks are being transferred
    spec: CPUMemoryViewLoadStoreSpec    # memory views into CPU tensors + block IDs

---

GPUPrimary Tier (CPU)[all] Secondary Tiers

---

GPUPrimary Tier (CPU)Secondary Tier
RAW_BUFFERClick to expand / collapse

Motivation.

To date, vLLM offers native KV offloading to CPU memory but does not support further offloading from CPU memory to other tiers such as storage. Implementations for storage offload should either work directly with storage or implement their own CPU offloading as an additional tier. This document describes a high level design to natively support multi-tier KV offloading in vLLM.

The architecture supports a single primary tier, in most cases CPU DRAM, and multiple secondary tiers.

Goals

  • Allow simple and native integration of the current vLLM CPU offloading with secondary tiers such as storage, or connection with other vLLM nodes for PD disaggregation settings or P2P communication of KV data.

  • Utilize async gpu<->cpu transfers for primary tier and async lookup for secondary tier loads

  • Support for HMA models out of the box

  • Seamless support for cross TP offloading and transfers

  • Simple, clean and performant implementation of PD communication via the CPU tier

What we don't intend to support

  • Direct GPU access (neither GPU-storage or GPU-GPU communication)

  • Limited flexibility for variance in block size. While we allow vLLM block size to vary, CPU block size must be constant across all vLLM nodes (and a multiple of the underlying vLLM block size).

High level Design

The design starting point is the current offloading connector design. In this design the offloading connector uses the vLLM V1 connector API and translates it to a simpler more abstract API for a backend. The multi-tier offloading has several key changes to this framework:

  • It designates a single (CPU DRAM) backend as the primary tier but also allows multiple backends to serve as a secondary tier. However, these backends differ from the current backend definition in that the source for offloading (and target for loading) is the CPU DRAM tier rather than the GPU HBM.

  • Introduce a new component -- the TieringManager that serves as an orchestrator for the communication with the Primary and Secondary tiers. A more detailed description of this appears in the TieringManager section below.

  • The GPU-CPU offloading is managed by the scheduler side offloading manager yet executed by the worker side backend. Namely, operations are scheduled and invoked by the scheduler thread, whereas the actual data migration is executed by the worker thread (as it is today in the offloading connector). In contrast, all secondary tier operations are both managed and executed by the scheduler side only.

  • No matter what the TP rank of a vLLM is, the KV data in CPU DRAM will be stored in canonical form, that of TP rank 1. This is the foundation for supporting cross TP rank KV offloading support. For example, if the secondary tier is storage then all KV values will be stored in a single TP1 canonical form, no matter what the TP rank on the vLLM.

Proposed Change.

In order to facilitate the design changes above we are implementing the changes described below to the current offloading connector.

The TieringManager

The native KV offloading in vLLM v1 currently supports offloading from GPU memory to an external location (like CPU memory). This RFC extends the design to allow offloading from CPU memory to additional tiers such as local storage, object storage, and remote nodes (P/D disaggregation).

<img width="1328" height="800" alt="Image" src="https://github.com/user-attachments/assets/98ef7014-2071-4517-8eb4-6bf282f40599" />

The OffloadingConnector interface is unchanged, it holds a single OffloadingManager. The new TieringManager implements that interface and orchestrates the tier hierarchy internally.

Two tier types

Primary Tier: a single tier with exclusive access to GPU KV memory. The existing CPU Manager serves as the primary tier.

Secondary Tier(s): one or more tiers with read/write access to the Primary Tier's CPU memory. No direct GPU access. Each secondary tier implements the SecondaryTierManager interface.

SecondaryTierManager Interface

class SecondaryTierManager(ABC):
    def lookup(block_hashes) -> int | None
    def submit_store(job_metadata: JobMetadata) -> None
    def submit_load(job_metadata: JobMetadata) -> None
    def get_finished() -> Iterable[JobResult]
    def touch(block_hashes)

@dataclass
class JobMetadata:
    job_id: int                         # unique job identifier
    block_hashes: list[BlockHash]       # which blocks are being transferred
    spec: CPUMemoryViewLoadStoreSpec    # memory views into CPU tensors + block IDs

spec is a zero-copy memory view into the primary tier's CPU tensors. For submit_store it is read-only (secondary tier reads from CPU); for submit_load it is writable (secondary tier writes into CPU).

CPU Manager changes

Extend the CPU Manager to expose its worker's cpu_tensors so the TieringManager can pass zero-copy memory views to secondary tiers for direct reads and writes.

Key Design Principles

  1. Always cascade to all tiers: When a block is confirmed in the primary tier, it is asynchronously pushed to every secondary tier.
  2. Primary tier is the gateway: Only the primary tier accesses GPU memory. Secondary tiers read/write CPU memory via memory views.
  3. Staged promotion: Blocks in secondary tiers must be promoted to the primary tier before the GPU can access them. lookup() returns None while promotion is in progress (scheduler retries).
  4. Non-blocking scheduler methods: All SecondaryTierManager methods run in the Scheduler process. submit_store() / submit_load() submit async jobs; get_finished() polls for completion.
  5. Secondary tiers own their evictions: Each secondary tier manages its own eviction policy independently.

Store Flow (Cascade)

GPU → Primary Tier (CPU) → [all] Secondary Tiers

When TieringManager.complete_store() is called, the KV data is confirmed in CPU memory. The TieringManager calls submit_store() on every secondary tier to cascade the data asynchronously.

Load Flow (Promotion)

GPU ← Primary Tier (CPU) ← Secondary Tier

When TieringManager.lookup() is invoked:

  1. Check primary tier first.
  2. For remaining blocks, check each secondary tier in order.
  3. On a hit, call submit_load() to initiate async promotion to the primary tier, then return None (retry later).

The TieringManager calls get_finished() on all secondary tiers each scheduling cycle to finalize completed jobs.

Canonical CPU layout

The following Diagram depicts the general idea - the CPU memory layout is in canonical form, where each block holds all the KV data relevant to this block in consecutive memory layout. Loading and offloading from and to the secondary tier is always of a single memory buffer and is handled by a scheduler thread (rather than the GPU dedicated workers).

<img width="781" height="409" alt="Image" src="https://github.com/user-attachments/assets/0951d9b1-7eb3-4e89-90f9-cf88ffbec334" />

The main changes required to facilitate this are:

  • During initialization one of the worker side connectors allocates all CPU memory for the KV cache and assigns the relevant parts of this memory to each of the other workers. This is done in shared memory between the scheduler and workers.

  • Each worker is responsible for copying its relevant slice of the KV cache to the appropriate CPU memory locations.

  • We mandate that the layout in CPU memory will be in canonical form. In a TP setting the division between workers is according to the heads count -- each worker gets an equal share of the attention heads. As in the drawing above the, we ask that the heads divide the KV CPU memory into contiguous regions and thus can be mapped easily to various TP ranks. There is no constraint on the GPU memory layout, but we achieve better performance and simpler code if that is the case in the GPU as well. This is explained in RFC https://github.com/vllm-project/vllm/issues/27742 and implemented in PR https://github.com/vllm-project/vllm/pull/27743. This layout change enables coalescing copies between GPU and CPU to fewer large contiguous memory copies and thus enables efficient use of the DMA for this purpose (see discussion in the following blog https://vllm.ai/blog/kv-offloading-connector ).

Potential Secondary Tiers

  • Storage -- This is the obvious secondary tier. Can include:

    • File System API

    • Object Storage

    • Key-Value Store

      • Can be either shared (remote) or local storage
  • PD disaggregation -- The current PD connector ("NIXL connector") in vLLM is GPU to GPU communication. Having an alternative CPU to CPU implementation while introduces more hops and therefore more latency, has several benefits:

    • Quicker offloading on the P node, once moved to local CPU can release GPU memory.

    • Shorter time GPU memory required on the D node. Only need to allocate GPU buffers after KV data arrives on the D node CPU memory

    • Simple and clean cross TP handling. No need for complex grid of GPU to GPU chatter.

    • Potentially faster communication on the P to D leg as we will only need to communicate a single unified buffer (relevant to a TP setting).

  • P2P -- a generalization of the PD setting is a general P2P communication between nodes.

Feedback Period.

No response

CC List.

@ronensc @orozery @robertgshaw2-redhat @tlrmchlsmth @WoosukKwon @njhill @NickLucche @omerpaz95

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the proposed multi-tier KV offloading in vLLM, follow these steps:

  1. Introduce the TieringManager: Create a new component, TieringManager, that orchestrates communication with primary and secondary tiers.
  2. Implement SecondaryTierManager interface: Define the SecondaryTierManager interface with methods for lookup, submit_store, submit_load, get_finished, and touch.
  3. Extend CPU Manager: Modify the CPU Manager to expose its worker's cpu_tensors and provide zero-copy memory views to secondary tiers.
  4. Implement secondary tier logic: Develop the logic for secondary tiers, including storage, PD disaggregation, and P2P communication.
  5. Implement canonical CPU layout: Allocate CPU memory for the KV cache and assign relevant parts to each worker, ensuring a canonical layout.

Example code for the TieringManager and SecondaryTierManager interface:

class TieringManager:
    def __init__(self, primary_tier, secondary_tiers):
        self.primary_tier = primary_tier
        self.secondary_tiers = secondary_tiers

    def complete_store(self, block_hashes):
        # Cascade to all secondary tiers
        for tier in self.secondary_tiers:
            tier.submit_store(block_hashes)

    def lookup(self, block_hashes):
        # Check primary tier first
        if self.primary_tier.lookup(block_hashes):
            return True
        # Check each secondary tier in order
        for tier in self.secondary_tiers:
            if tier.lookup(block_hashes):
                # Initiate async promotion to primary tier
                tier.submit_load(block_hashes)
                return None
        return False

class SecondaryTierManager(ABC):
    def lookup(self, block_hashes):
        # Implement lookup logic for secondary tier
        pass

    def submit_store(self, block_hashes):
        # Implement store logic for secondary tier
        pass

    def submit_load(self, block_hashes):
        # Implement load logic for secondary tier
        pass

    def get_finished(self):
        # Implement logic to get finished jobs
        pass

    def touch(self, block_hashes):
        # Implement touch logic for secondary tier
        pass

Verification

To verify the implementation, test the following scenarios:

  • Store and load operations with a single secondary tier
  • Store and load operations with multiple secondary tiers
  • Promotion of blocks from secondary tier to primary tier
  • Canonical CPU layout and memory allocation

Extra Tips

  • Ensure that the TieringManager and SecondaryTierManager implementations are thread-safe and handle errors properly.
  • Optimize the implementation for performance, considering factors like memory allocation, data transfer, and job scheduling.
  • Consider adding logging and monitoring mechanisms to track the performance and health of the multi-tier KV offloading system.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING