vllm - ✅(Solved) Fix [RFC]: Add Mooncake Store Connector for Shared KV Cache Reuse [1 pull requests, 9 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38474Fetched 2026-04-08 01:49:07
View on GitHub
Comments
9
Participants
5
Timeline
50
Reactions
9
Author
Timeline (top)
subscribed ×19mentioned ×17commented ×9unsubscribed ×4

PR fix notes

PR #40036: fix: skip spec decode draft rejection after scheduler rewind

Description (problem / solution / changelog)

Summary

When kv_load_failure_policy=recompute is enabled, the scheduler can rewind request.num_computed_tokens to an earlier block boundary via _update_requests_with_invalid_blocks (scheduler.py). However, the worker-side _update_states in GPUModelRunner still applies the spec decode draft rejection subtraction (num_computed_tokens -= num_rejected) on the already-rewound value.

This can drive num_computed_tokens negative when prev_num_draft_len exceeds the rewound token count, corrupting subsequent forward passes.

Scenario:

  1. Step N: request has num_computed_tokens=102 (100 prompt + 2 spec tokens), prev_num_draft_len=2
  2. KV load failure detected — scheduler rewinds to num_computed_tokens=0 (first block failed)
  3. Worker receives rewound value 0, but prev_num_draft_len is still 2 from step N
  4. Worker computes num_rejected=2, then num_computed_tokens = 0 - 2 = -2

Fix: detect a scheduler rewind by checking num_computed_tokens < req_state.num_computed_tokens. When this happens, clear prev_num_draft_len to skip the stale rejection adjustment — the rewound tokens already cover the previous draft.

Ref: discussion in #38474 (comment by @Pz1116 about spec decoding + recompute compatibility)

Test plan

  • Verify num_computed_tokens never goes negative with async scheduling + spec decoding + KV load failure recompute
  • Existing spec decoding tests continue to pass
  • Existing KV load failure recovery tests continue to pass

Changed files

  • vllm/v1/worker/gpu_model_runner.py (modified, +7/-0)

Code Example

┌──────────────────────────────────────────────────────────────────────┐
AscendStoreConnector                   (KVConnectorBase_V1)│                                                                      │
Scheduler Process                    Worker Process(es)│  ┌────────────────────┐              ┌──────────────────────────┐    │
│  │  KVPoolSchedulerZMQ RPCKVPoolWorker         │    │
│  │                    │◄────────────►│                          │    │
│  │  - Lookup cache hit│              │  - Register KV caches    │    │
│  │  - Build metadata  │              │  - LookupKeyServer       │    │
│  │  - Track requests  │                  (rank 0 only)         │    │
│  └────────────────────┘              │  - Transfer threads      │    │
│                                      └────────────┬─────────────┘    │
│                                                   │                  │
│                                      ┌────────────▼─────────────┐    │
│                                      │   SendingThread          │    │
│                                      │   RecvingThread          │    │
   (+ layerwise variants) │    │
│                                      └────────────┬─────────────┘    │
│                                                   │                  │
│                                      ┌────────────▼─────────────┐    │
│                                      │  MooncakeDistributedStore │    │
│                                      │  + TransferEngine         │    │
  (zero-copy RDMA/DMA)    │    │
│                                      └──────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────┘

---

{model_name}@pcp{pcp_rank}@dcp{dcp_rank}@head_or_tp_rank:{rank}@pp_rank:{rank}@{block_hash_hex}

---

addr = kv_cache_base_addr + block_id * block_len
size = block_len / block_size * (end - start)

---

Request Arrives
┌─────────────────────────────────────────────────────────────┐
Scheduler: get_num_new_matched_tokens()1. Send block_hashes to LookupKeyServer via ZMQ2. Worker queries Mooncake Store: exists(keys)3. Return: N tokens are cached externally                  │
4. Scheduler allocates blocks, marks N tokens as "load"└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
Scheduler: build_connector_meta()- For cached tokens: ReqMeta(load_spec=LoadSpec(...))- For new tokens: ReqMeta(can_save=True)│   → Passed to Worker via SchedulerOutput└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
Worker: start_load_kv()- For requests with load_spec:│     → RecvingThread.get(keys, addrs, sizes)│     → Blocks loaded from Mooncake Store into local KV cache │
└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ vLLM Forward Pass- Loaded blocks: KV cache is already populated            │
- New blocks: Computed normally by the model              │
└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
Worker: wait_for_save()- For newly computed blocks:│     → Record GPU/NPU event                                  │
│     → SendingThread.put(keys, addrs, sizes)│     → Blocks stored to Mooncake Store for future reuse      │
└─────────────────────────────────────────────────────────────┘

---

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_store/
├── __init__.py
├── mooncake_store_connector.py   # MooncakeStoreConnector(KVConnectorBase_V1)
├── store_scheduler.py             # Scheduler-side: lookup + metadata
├── store_worker.py                # Worker-side: threads + store interaction
├── kv_transfer.py                 # Background send/recv threads
└── config_data.py                 # Key generation, metadata structures

---

# Example configuration
serve_args = {
    "kv_connector": "MooncakeStoreConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "use_layerwise": False,
        "load_async": True,
    }
}

---

# Mooncake Store configuration
export MOONCAKE_CONFIG_PATH=/path/to/config.json
RAW_BUFFERClick to expand / collapse

Motivation

In production LLM serving, many requests share common prompt prefixes — system prompts, few-shot examples, RAG context, multi-turn conversation history, etc. vLLM's local prefix caching effectively reuses KV cache within a single instance, but it cannot help in the following scenarios:

  • Cross-instance reuse: Multiple vLLM instances serving similar traffic recompute the same KV blocks independently.
  • Post-eviction recomputation: After KV cache eviction due to memory pressure, the same blocks must be recomputed from scratch.
  • Cold-start warm-up: A newly launched instance has an empty KV cache and must compute all blocks, even those that have been computed elsewhere.

A Mooncake Store Connector addresses these gaps by using Mooncake's distributed store as a shared KV cache pool. Computed KV cache blocks are stored with content-addressable keys (block hashes). Before computing a prefill, the engine queries the store — if the blocks already exist, they are loaded directly, skipping redundant prefill computation.

This capability is orthogonal to PD disaggregation and benefits any vLLM deployment topology.

Current Implementation in vllm-ascend

We have implemented and validated this approach in the vllm-ascend project (the Ascend NPU hardware plugin for vLLM). The implementation is located at vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/. Below we describe its architecture.

Overall Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                     AscendStoreConnector                             │
│                   (KVConnectorBase_V1)                               │
│                                                                      │
│   Scheduler Process                    Worker Process(es)            │
│  ┌────────────────────┐              ┌──────────────────────────┐    │
│  │  KVPoolScheduler   │    ZMQ RPC   │     KVPoolWorker         │    │
│  │                    │◄────────────►│                          │    │
│  │  - Lookup cache hit│              │  - Register KV caches    │    │
│  │  - Build metadata  │              │  - LookupKeyServer       │    │
│  │  - Track requests  │              │    (rank 0 only)         │    │
│  └────────────────────┘              │  - Transfer threads      │    │
│                                      └────────────┬─────────────┘    │
│                                                   │                  │
│                                      ┌────────────▼─────────────┐    │
│                                      │   SendingThread          │    │
│                                      │   RecvingThread          │    │
│                                      │   (+ layerwise variants) │    │
│                                      └────────────┬─────────────┘    │
│                                                   │                  │
│                                      ┌────────────▼─────────────┐    │
│                                      │  MooncakeDistributedStore │    │
│                                      │  + TransferEngine         │    │
│                                      │  (zero-copy RDMA/DMA)    │    │
│                                      └──────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────┘

Component Details

1. Connector Entry Point

AscendStoreConnector implements KVConnectorBase_V1 and splits behavior by role:

  • KVConnectorRole.SCHEDULER: Creates KVPoolScheduler to handle cache hit queries and metadata construction.
  • KVConnectorRole.WORKER: Creates KVPoolWorker to handle KV cache registration, backend interaction, and background transfer threads. Worker at rank 0 additionally starts a LookupKeyServer — a ZMQ REP server that handles cache existence queries from the scheduler.

Key configuration options:

  • use_layerwise: Enable per-layer transfer (pipelined with forward pass) vs. whole-request transfer.
  • consumer_is_to_put: Allow consumer-role instances to also write to the store.
  • load_async: Enable async loading between scheduler steps.

2. Scheduler Side — KVPoolScheduler

The scheduler orchestrates KV cache reuse decisions without direct access to the distributed store.

Cache Hit Query (get_num_new_matched_tokens):

The scheduler sends the request's block hashes and token length to worker rank 0 via ZMQ (LookupKeyClientLookupKeyServer). The worker queries the Mooncake Store for block existence and returns the number of prefix tokens that are cached. The scheduler then tells vLLM to allocate blocks for these tokens without scheduling them for compute.

Metadata Construction (build_connector_meta):

For each scheduled request, the scheduler produces a ReqMeta object containing:

  • load_spec: How many tokens to load from the store (if cache hit).
  • can_save: Whether to store the computed blocks after prefill.
  • block_ids: Local block IDs for memory address calculation.
  • block_hashes: Content hashes for key generation.

These are bundled into AscendConnectorMetadata and passed to workers via SchedulerOutput.

Request Lifecycle Tracking:

The scheduler maintains RequestTracker objects that track per-request state: how many tokens have been saved so far, allocated block IDs, and chunk boundaries. This enables correct handling of chunked prefill, preemption, and resumed requests.

3. Worker Side — KVPoolWorker

The worker manages the actual data movement between local KV cache and the distributed store.

KV Cache Registration (register_kv_caches):

When vLLM allocates KV cache tensors, the worker:

  1. Computes base_addr and block_len for each layer's KV cache tensor.
  2. Registers all memory regions with the Mooncake Transfer Engine (register_buffer), making them accessible for zero-copy RDMA transfer.
  3. Stores the address mapping in ChunkedTokenDatabase for later address calculation.

Backend Selection:

The worker dynamically loads the storage backend based on configuration (backend field in kv_connector_extra_config). The vllm-ascend implementation supports mooncake, memcache, and yuanrong backends.

Transfer Thread Management:

Based on the role and configuration, the worker spawns daemon threads:

  • Producer role (kv_producer / kv_both): starts a SendingThread.
  • Consumer role with async load: starts a RecvingThread.
  • Layerwise mode: uses LayerSendingThread / LayerRecvingThread instead.

4. Content-Addressable Key Design

KV cache blocks are addressed by a composite key derived from vLLM's native BlockHash:

{model_name}@pcp{pcp_rank}@dcp{dcp_rank}@head_or_tp_rank:{rank}@pp_rank:{rank}@{block_hash_hex}
  • block_hash_hex: vLLM's BlockHash converted to hex string. This is a content hash of the token sequence within the block, ensuring identical prompt prefixes produce identical keys regardless of which instance computed them.
  • Parallelism-aware encoding: TP/PP/PCP/DCP rank fields ensure each parallel rank stores and retrieves its own KV head partition correctly.
  • Layerwise key variant: For per-layer transfer, a @{layer_id} suffix is appended.

ChunkedTokenDatabase.process_tokens() iterates over a request's block hashes, producing (start_idx, end_idx, PoolKey) tuples. prepare_value() then maps each block to physical memory addresses:

addr = kv_cache_base_addr + block_id * block_len
size = block_len / block_size * (end - start)

This enables the store to read from / write to exact memory locations in vLLM's paged KV cache, achieving zero-copy transfer.

5. Async Transfer Threads

Transfer threads run as daemon threads, consuming requests from a queue:

SendingThread (KVCacheStoreSendingThread):

  1. Dequeue a ReqMeta from the request queue.
  2. Generate keys via process_tokens().
  3. Call exists() to skip already-stored blocks (deduplication).
  4. Compute memory addresses via prepare_value().
  5. Wait for GPU/NPU compute event synchronization (ensuring the KV data is ready).
  6. Call put(keys, addrs, sizes) to write blocks to the store.
  7. Optionally generate BlockStored KV cache events for external event consumers.

RecvingThread (KVCacheStoreRecvingThread):

  1. Dequeue a ReqMeta with load_spec.
  2. Generate keys and compute target memory addresses.
  3. Call get(keys, addrs, sizes) to read blocks directly into local KV cache memory.
  4. Mark the request as finished.

Layerwise Variants:

LayerSendingThread and LayerRecvingThread operate per-layer, enabling pipelined transfer during the forward pass. This is compatible with vLLM's save_kv_layer() / wait_for_layer_load() interfaces.

6. Mooncake Store Integration

The Mooncake Store backend uses two Mooncake components:

  • MooncakeDistributedStore: A distributed key-value store providing batch_put_from_multi_buffers, batch_get_into_multi_buffers, and batch_is_exist APIs.
  • TransferEngine: Manages RDMA connections and memory registration for zero-copy transfer.

Initialization flow:

  1. Load config from JSON file (MOONCAKE_CONFIG_PATH): metadata server address, segment sizes, protocol, device name.
  2. Create TransferEngine (global singleton with double-checked locking) and initialize RDMA connections.
  3. Call MooncakeDistributedStore.setup() with the transfer engine, connecting to the metadata server.
  4. During register_buffer(), register KV cache memory regions with the transfer engine.

Data transfer operations (put / get) pass memory addresses and sizes directly to the Mooncake store, which handles RDMA-based zero-copy transfer between devices through the distributed store.

End-to-End Data Flow

Request Arrives
┌─────────────────────────────────────────────────────────────┐
│ Scheduler: get_num_new_matched_tokens()                     │
│   1. Send block_hashes to LookupKeyServer via ZMQ           │
│   2. Worker queries Mooncake Store: exists(keys)             │
│   3. Return: N tokens are cached externally                  │
│   4. Scheduler allocates blocks, marks N tokens as "load"    │
└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Scheduler: build_connector_meta()                           │
│   - For cached tokens: ReqMeta(load_spec=LoadSpec(...))     │
│   - For new tokens: ReqMeta(can_save=True)                  │
│   → Passed to Worker via SchedulerOutput                    │
└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Worker: start_load_kv()                                     │
│   - For requests with load_spec:                            │
│     → RecvingThread.get(keys, addrs, sizes)                 │
│     → Blocks loaded from Mooncake Store into local KV cache │
└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ vLLM Forward Pass                                           │
│   - Loaded blocks: KV cache is already populated            │
│   - New blocks: Computed normally by the model              │
└─────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Worker: wait_for_save()                                     │
│   - For newly computed blocks:                              │
│     → Record GPU/NPU event                                  │
│     → SendingThread.put(keys, addrs, sizes)                 │
│     → Blocks stored to Mooncake Store for future reuse      │
└─────────────────────────────────────────────────────────────┘

Proposed Change for vLLM Implementation

Based on the validated architecture in vllm-ascend, we propose adding a MooncakeStoreConnector to vLLM. Below is an initial design suggestion. We welcome community discussion on the design details.

Suggested File Structure

vllm/distributed/kv_transfer/kv_connector/v1/mooncake_store/
├── __init__.py
├── mooncake_store_connector.py   # MooncakeStoreConnector(KVConnectorBase_V1)
├── store_scheduler.py             # Scheduler-side: lookup + metadata
├── store_worker.py                # Worker-side: threads + store interaction
├── kv_transfer.py                 # Background send/recv threads
└── config_data.py                 # Key generation, metadata structures

Key Design Points

1. Connector Interface

MooncakeStoreConnector implements KVConnectorBase_V1 with the standard scheduler/worker role split. Scheduler-side handles cache hit lookup and metadata construction; worker-side handles KV cache registration, transfer thread management, and Mooncake Store interaction.

2. Content-Addressable Keys

Reuse vLLM's native BlockHash as the core of the cache key. The key format encodes model name, parallelism ranks (TP/PP), and block hash to ensure correctness under distributed parallelism.

3. Zero-Copy Transfer

Use Mooncake's TransferEngine for RDMA memory registration and MooncakeDistributedStore for distributed put/get operations. Memory addresses point directly into vLLM's paged KV cache tensors, avoiding intermediate copies.

4. Async Transfer

Background daemon threads handle put/get operations asynchronously, decoupled from the forward pass. Layerwise variants enable pipelined transfer during model execution. GPU event synchronization ensures data consistency.

5. Deduplication

Before storing, the sending thread calls exists() to check which blocks are already in the store, skipping redundant writes. This is a natural consequence of content-addressable keys.

6. Configuration

# Example configuration
serve_args = {
    "kv_connector": "MooncakeStoreConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "use_layerwise": False,
        "load_async": True,
    }
}
# Mooncake Store configuration
export MOONCAKE_CONFIG_PATH=/path/to/config.json

Open Discussion

We'd like to discuss the following aspects with the community:

  • Naming and placement: Is mooncake_store/ the right location? Should this be a submodule of the existing mooncake/ directory?
  • Generalization: Should the store interaction be abstracted to support other distributed stores in the future, or should we keep it Mooncake-specific for now?
  • Integration with existing prefix caching: How should the store connector interact with vLLM's local prefix caching (KVCacheManager)? Should there be a unified hierarchy (local cache → shared store)?
  • Eviction policy: Should the store connector participate in cache eviction decisions, or purely act as a passive store?

References

CC List

@ivanium @stmatengss @dtcccc @Pz1116

extent analysis

Fix Plan

To implement the Mooncake Store Connector in vLLM, follow these steps:

  • Create a new directory vllm/distributed/kv_transfer/kv_connector/v1/mooncake_store/ with the necessary files:
    • __init__.py
    • mooncake_store_connector.py
    • store_scheduler.py
    • store_worker.py
    • kv_transfer.py
    • config_data.py
  • Implement the MooncakeStoreConnector class in mooncake_store_connector.py, inheriting from KVConnectorBase_V1.
  • Define the content-addressable key format in config_data.py, reusing vLLM's native BlockHash.
  • Implement the zero-copy transfer using Mooncake's TransferEngine and MooncakeDistributedStore in kv_transfer.py.
  • Add configuration options for the Mooncake Store Connector in serve_args, including kv_connector, kv_role, and kv_connector_extra_config.

Example configuration:

serve_args = {
    "kv_connector": "MooncakeStoreConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "use_layerwise": False,
        "load_async": True,
    }
}

Mooncake Store configuration:

export MOONCAKE_CONFIG_PATH=/path/to/config.json

Verification

To verify the implementation, test the Mooncake Store Connector with the following scenarios:

  • Cache hit: Test that the connector correctly loads cached blocks from the Mooncake Store.
  • Cache miss: Test that the connector correctly stores newly computed blocks in the Mooncake Store.
  • Async transfer: Test that the connector correctly handles asynchronous transfer of blocks.
  • Layerwise transfer: Test that the connector correctly handles layerwise transfer of blocks.

Extra Tips

  • Ensure that the Mooncake Store is properly configured and running before testing the connector.
  • Use the MOONCAKE_CONFIG_PATH environment variable to point to the Mooncake Store configuration file.
  • Consider adding logging and debugging statements to the connector implementation to facilitate troubleshooting.
  • Review the vLLM documentation and Mooncake project for additional information on implementing and configuring the Mooncake Store Connector.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [RFC]: Add Mooncake Store Connector for Shared KV Cache Reuse [1 pull requests, 9 comments, 5 participants]