vllm - 💡(How to fix) Fix [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37263Fetched 2026-04-08 00:48:27
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
subscribed ×3mentioned ×2labeled ×1

Code Example

selected_blocks, scores = sparse_manager.lookup_with_sparse(
    block_hashes, query, layer_idx
)

---

load_spec = sparse_manager.prepare_sparse_load(selected_blocks)

---

for layer_idx in range(num_layers):
    # Sparse selection and allocation
    new_attn_metadata, swap_in_spec, swap_out_spec = \
        sparse_worker.handle_layer(attn_metadata, layer_idx, query)
    
    # Swap in (only selected hot blocks)
    optimized_handler.transfer_async(job_id, swap_in_spec)
    
    # Wait for swap-in completion
    optimized_handler.wait({job_id})
    
    # Attention Kernel with sparse blocks
    self.impl.forward(
        self, query, key, value, kv_cache,
        new_attn_metadata, output=output
    )
    
    # Swap out new complete blocks
    optimized_handler.transfer_async(job_id, swap_out_spec)
    
    # Update block representations
    block_repr_manager.update(layer_idx, new_blocks)
    
    # Complete load/store operations
    sparse_manager.complete_load(selected_blocks)
    sparse_manager.complete_store(new_blocks, success=True)
RAW_BUFFERClick to expand / collapse

Motivation.

Background: Long-context reasoning scenarios face a dual surge in computational overhead and memory consumption, resulting in low inference efficiency and high inference costs. Dynamic sparse attention mitigates the computational surge in long sequences, but the memory bottleneck remains. Under sparse attention, accesses to the KV cache exhibit a hot-cold distribution, presenting an opportunity for heterogeneous KV cache management.

  • Heterogeneous KV cache leverages cheap, scalable main memory to replace expensive, limited device memory, promising significant reductions in inference costs.

  • By retaining frequently accessed KV cache in device memory and swapping out infrequently accessed KV cache to main memory, the memory footprint of long sequences can be reduced.

Goals:

  • To accelerate the long sequence decoding and guarantee the accuracy, we plan implementing a hotness-ware multi-level kv cache management mechanism while keeping the sparse attention arithmetic computation unchanged.

  • To achieve good precision, different from KV compression and KV dropping methods (issue5751, issue 12254, and pr 11938), this implementation chooses keeping the whole kv cache and selects important parts to load into HBM.

Proposed Change.

In V1, the KV_offload mechanism will be extended with sparse optimization capabilities.

Architecture Overview

Scheduler Side:

  • SparseOffloadingManager extends the standard OffloadingManager to integrate sparse selection logic

  • Maintains block representations for intelligent block scoring

  • Implements query-aware sparse selection to identify hot blocks

  • Reuses the original OffloadingManager logic for block lifecycle management

Worker Side:

  • SparseOffloadingHandler extends the standard OffloadingHandler with optimized transfer mechanisms

  • Implements contiguous block merging to reduce transfer overhead

  • Supports batch transfer operations for improved efficiency

  • BlockReprManager maintains and updates block representations

Integration Flow

The sparse-optimized KV_offload process works as follows:

  1. Scheduler Side: Sparse Selection
selected_blocks, scores = sparse_manager.lookup_with_sparse(
    block_hashes, query, layer_idx
)
  1. Prepare Load (only selected blocks)
load_spec = sparse_manager.prepare_sparse_load(selected_blocks)
  1. Worker Side: Layer-wise Execution
for layer_idx in range(num_layers):
    # Sparse selection and allocation
    new_attn_metadata, swap_in_spec, swap_out_spec = \
        sparse_worker.handle_layer(attn_metadata, layer_idx, query)
    
    # Swap in (only selected hot blocks)
    optimized_handler.transfer_async(job_id, swap_in_spec)
    
    # Wait for swap-in completion
    optimized_handler.wait({job_id})
    
    # Attention Kernel with sparse blocks
    self.impl.forward(
        self, query, key, value, kv_cache,
        new_attn_metadata, output=output
    )
    
    # Swap out new complete blocks
    optimized_handler.transfer_async(job_id, swap_out_spec)
    
    # Update block representations
    block_repr_manager.update(layer_idx, new_blocks)
    
    # Complete load/store operations
    sparse_manager.complete_load(selected_blocks)
    sparse_manager.complete_store(new_blocks, success=True)

Key Improvements

  1. Sparse Selection: Only loads the most relevant KV blocks based on query-aware scoring
  2. Block Representation: Maintains compact representations for efficient similarity computation
  3. Optimized Transfer: Merges contiguous blocks and supports batch operations
  4. Standard Interface: Compliance with KV_offload standard interfaces for compatibility
  5. Adaptive Strategy: Dynamically adjusts selection based on access patterns

Integration Points

  • Scheduler: SparseOffloadingManager replaces/enhances standard OffloadingManager
  • Worker: OptimizedSwapHandler replaces/enhances standard OffloadingHandler
  • BlockReprManager is integrated into the worker pipeline
  • Configuration through SparseConfig with parameters like sparse_topk, copy_method, cache_policy

This approach maintains full compatibility with the KV_offload framework while introducing significant performance optimizations through intelligent sparse block selection and optimized data transfer mechanisms.

Implementation Steps

Phase 1: Basic Adaptation ├─ Step 1.1: Create SparseOffloadingManager ├─ Step 1.2: Create OptimizedSwapHandler
├─ Step 1.3: Implement basic sparse selection └─ Step 1.4: Integrate to scheduler and worker │ ▼ Phase 2: Performance Optimization ├─ Step 2.1: Implement block representation ├─ Step 2.2: Optimize swap transfer ├─ Step 2.3: Implement batch transfer └─ Step 2.4: Optimize synchronization │ ▼ Phase 3: Advanced Features ├─ Step 3.1: Implement query-aware selection ├─ Step 3.2: Implement adaptive top-k ├─ Step 3.3: Implement multi-strategy fusion └─ Step 3.4: Performance monitoring │ ▼ Phase 4: Testing and Validation ├─ Step 4.1: Unit tests ├─ Step 4.2: Integration tests ├─ Step 4.3: Performance benchmarks └─ Step 4.4: Correctness validation

Test Case

  • Super long context verification model: Qwen2.5-14b-1m at A100-96G*1, support 1M length input

  • TTFT and TPOT when 1M length input

Feedback Period.

No response

CC List.

No response

Any Other Things.

Co-Author: @CheYulin @amy-why-3459

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the proposed hotness-aware multi-level KV cache management mechanism, follow these steps:

  1. Create SparseOffloadingManager:
    • Extend the standard OffloadingManager to integrate sparse selection logic.
    • Implement query-aware sparse selection to identify hot blocks.
    • Example:

class SparseOffloadingManager(OffloadingManager): def init(self, config): super().init(config) self.sparse_selection_logic = SparseSelectionLogic()

def lookup_with_sparse(self, block_hashes, query, layer_idx):
    # Implement sparse selection logic here
    selected_blocks, scores = self.sparse_selection_logic.select_blocks(block_hashes, query, layer_idx)
    return selected_blocks, scores

2. **Implement OptimizedSwapHandler**:
   - Extend the standard OffloadingHandler with optimized transfer mechanisms.
   - Implement contiguous block merging to reduce transfer overhead.
   - Example:
     ```python
class OptimizedSwapHandler(OffloadingHandler):
    def __init__(self, config):
        super().__init__(config)
        self.block_merger = BlockMerger()

    def transfer_async(self, job_id, swap_spec):
        # Merge contiguous blocks before transfer
        merged_blocks = self.block_merger.merge_blocks(swap_spec)
        super().transfer_async(job_id, merged_blocks)
  1. Integrate to Scheduler and Worker:
    • Replace/enhance standard OffloadingManager with SparseOffloadingManager.
    • Replace/enhance standard OffloadingHandler with OptimizedSwapHandler.
    • Integrate BlockReprManager into the worker pipeline.
    • Example:

scheduler = SparseOffloadingManager(config) worker = OptimizedSwapHandler(config) block_repr_manager = BlockReprManager()


4. **Implement Block Representation and Optimized Transfer**:
   - Implement compact representations for efficient similarity computation.
   - Optimize swap transfer by merging contiguous blocks and supporting batch operations.
   - Example:
     ```python
class BlockReprManager:
    def __init__(self):
        self.block_reprs = {}

    def update(self, layer_idx, new_blocks):
        # Update block representations here
        self.block_reprs[layer_idx] = new_blocks
  1. Implement Query-Aware Selection and Adaptive Strategy:
    • Implement query-aware selection to identify hot blocks.
    • Dynamically adjust selection based on access patterns.
    • Example:

class SparseSelectionLogic: def init(self): self.access_patterns = {}

def select_blocks(self, block_hashes, query, layer_idx):
    # Implement query-aware selection logic here
    selected_blocks = []
    for block_hash in block_hashes:
        if self.access_patterns.get(block_hash, 0) > 0:
            selected_blocks.append(block_hash)
    return selected_blocks, []

### Verification

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING