vllm - 💡(How to fix) Fix [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention [1 participants]

vllm2026-03-17 07:32:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37263•Fetched 2026-04-08 00:48:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zengchuang-hw

Participants

zengchuang-hw

Timeline (top)

subscribed ×3mentioned ×2labeled ×1

Code Example

selected_blocks, scores = sparse_manager.lookup_with_sparse(
    block_hashes, query, layer_idx
)

---

load_spec = sparse_manager.prepare_sparse_load(selected_blocks)

---

for layer_idx in range(num_layers):
    # Sparse selection and allocation
    new_attn_metadata, swap_in_spec, swap_out_spec = \
        sparse_worker.handle_layer(attn_metadata, layer_idx, query)
    
    # Swap in (only selected hot blocks)
    optimized_handler.transfer_async(job_id, swap_in_spec)
    
    # Wait for swap-in completion
    optimized_handler.wait({job_id})
    
    # Attention Kernel with sparse blocks
    self.impl.forward(
        self, query, key, value, kv_cache,
        new_attn_metadata, output=output
    )
    
    # Swap out new complete blocks
    optimized_handler.transfer_async(job_id, swap_out_spec)
    
    # Update block representations
    block_repr_manager.update(layer_idx, new_blocks)
    
    # Complete load/store operations
    sparse_manager.complete_load(selected_blocks)
    sparse_manager.complete_store(new_blocks, success=True)

RAW_BUFFERClick to expand / collapse

Motivation.

Background: Long-context reasoning scenarios face a dual surge in computational overhead and memory consumption, resulting in low inference efficiency and high inference costs. Dynamic sparse attention mitigates the computational surge in long sequences, but the memory bottleneck remains. Under sparse attention, accesses to the KV cache exhibit a hot-cold distribution, presenting an opportunity for heterogeneous KV cache management.

Heterogeneous KV cache leverages cheap, scalable main memory to replace expensive, limited device memory, promising significant reductions in inference costs.
By retaining frequently accessed KV cache in device memory and swapping out infrequently accessed KV cache to main memory, the memory footprint of long sequences can be reduced.

Goals:

To accelerate the long sequence decoding and guarantee the accuracy, we plan implementing a hotness-ware multi-level kv cache management mechanism while keeping the sparse attention arithmetic computation unchanged.
To achieve good precision, different from KV compression and KV dropping methods (issue5751, issue 12254, and pr 11938), this implementation chooses keeping the whole kv cache and selects important parts to load into HBM.

Proposed Change.

In V1, the KV_offload mechanism will be extended with sparse optimization capabilities.

Architecture Overview

Scheduler Side:

SparseOffloadingManager extends the standard OffloadingManager to integrate sparse selection logic
Maintains block representations for intelligent block scoring
Implements query-aware sparse selection to identify hot blocks
Reuses the original OffloadingManager logic for block lifecycle management

Worker Side:

SparseOffloadingHandler extends the standard OffloadingHandler with optimized transfer mechanisms
Implements contiguous block merging to reduce transfer overhead
Supports batch transfer operations for improved efficiency
BlockReprManager maintains and updates block representations

Integration Flow

The sparse-optimized KV_offload process works as follows:

Scheduler Side: Sparse Selection

selected_blocks, scores = sparse_manager.lookup_with_sparse(
    block_hashes, query, layer_idx
)

Prepare Load (only selected blocks)

load_spec = sparse_manager.prepare_sparse_load(selected_blocks)

Worker Side: Layer-wise Execution

for layer_idx in range(num_layers):
    # Sparse selection and allocation
    new_attn_metadata, swap_in_spec, swap_out_spec = \
        sparse_worker.handle_layer(attn_metadata, layer_idx, query)
    
    # Swap in (only selected hot blocks)
    optimized_handler.transfer_async(job_id, swap_in_spec)
    
    # Wait for swap-in completion
    optimized_handler.wait({job_id})
    
    # Attention Kernel with sparse blocks
    self.impl.forward(
        self, query, key, value, kv_cache,
        new_attn_metadata, output=output
    )
    
    # Swap out new complete blocks
    optimized_handler.transfer_async(job_id, swap_out_spec)
    
    # Update block representations
    block_repr_manager.update(layer_idx, new_blocks)
    
    # Complete load/store operations
    sparse_manager.complete_load(selected_blocks)
    sparse_manager.complete_store(new_blocks, success=True)

Key Improvements

Sparse Selection: Only loads the most relevant KV blocks based on query-aware scoring
Block Representation: Maintains compact representations for efficient similarity computation
Optimized Transfer: Merges contiguous blocks and supports batch operations
Standard Interface: Compliance with KV_offload standard interfaces for compatibility
Adaptive Strategy: Dynamically adjusts selection based on access patterns

Integration Points

Scheduler: SparseOffloadingManager replaces/enhances standard OffloadingManager
Worker: OptimizedSwapHandler replaces/enhances standard OffloadingHandler
BlockReprManager is integrated into the worker pipeline
Configuration through SparseConfig with parameters like sparse_topk, copy_method, cache_policy

This approach maintains full compatibility with the KV_offload framework while introducing significant performance optimizations through intelligent sparse block selection and optimized data transfer mechanisms.

Implementation Steps

Phase 1: Basic Adaptation ├─ Step 1.1: Create SparseOffloadingManager ├─ Step 1.2: Create OptimizedSwapHandler
├─ Step 1.3: Implement basic sparse selection └─ Step 1.4: Integrate to scheduler and worker │ ▼ Phase 2: Performance Optimization ├─ Step 2.1: Implement block representation ├─ Step 2.2: Optimize swap transfer ├─ Step 2.3: Implement batch transfer └─ Step 2.4: Optimize synchronization │ ▼ Phase 3: Advanced Features ├─ Step 3.1: Implement query-aware selection ├─ Step 3.2: Implement adaptive top-k ├─ Step 3.3: Implement multi-strategy fusion └─ Step 3.4: Performance monitoring │ ▼ Phase 4: Testing and Validation ├─ Step 4.1: Unit tests ├─ Step 4.2: Integration tests ├─ Step 4.3: Performance benchmarks └─ Step 4.4: Correctness validation

Test Case

Super long context verification model: Qwen2.5-14b-1m at A100-96G*1, support 1M length input
TTFT and TPOT when 1M length input

Feedback Period.

No response

CC List.

No response

Any Other Things.

Co-Author: @CheYulin @amy-why-3459

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the proposed hotness-aware multi-level KV cache management mechanism, follow these steps:

Create SparseOffloadingManager:
- Extend the standard OffloadingManager to integrate sparse selection logic.
- Implement query-aware sparse selection to identify hot blocks.
- Example:

class SparseOffloadingManager(OffloadingManager): def init(self, config): super().init(config) self.sparse_selection_logic = SparseSelectionLogic()

def lookup_with_sparse(self, block_hashes, query, layer_idx):
    # Implement sparse selection logic here
    selected_blocks, scores = self.sparse_selection_logic.select_blocks(block_hashes, query, layer_idx)
    return selected_blocks, scores


2. **Implement OptimizedSwapHandler**:
   - Extend the standard OffloadingHandler with optimized transfer mechanisms.
   - Implement contiguous block merging to reduce transfer overhead.
   - Example:
     ```python
class OptimizedSwapHandler(OffloadingHandler):
    def __init__(self, config):
        super().__init__(config)
        self.block_merger = BlockMerger()

    def transfer_async(self, job_id, swap_spec):
        # Merge contiguous blocks before transfer
        merged_blocks = self.block_merger.merge_blocks(swap_spec)
        super().transfer_async(job_id, merged_blocks)

Integrate to Scheduler and Worker:
- Replace/enhance standard OffloadingManager with SparseOffloadingManager.
- Replace/enhance standard OffloadingHandler with OptimizedSwapHandler.
- Integrate BlockReprManager into the worker pipeline.
- Example:

scheduler = SparseOffloadingManager(config) worker = OptimizedSwapHandler(config) block_repr_manager = BlockReprManager()


4. **Implement Block Representation and Optimized Transfer**:
   - Implement compact representations for efficient similarity computation.
   - Optimize swap transfer by merging contiguous blocks and supporting batch operations.
   - Example:
     ```python
class BlockReprManager:
    def __init__(self):
        self.block_reprs = {}

    def update(self, layer_idx, new_blocks):
        # Update block representations here
        self.block_reprs[layer_idx] = new_blocks

Implement Query-Aware Selection and Adaptive Strategy:
- Implement query-aware selection to identify hot blocks.
- Dynamically adjust selection based on access patterns.
- Example:

class SparseSelectionLogic: def init(self): self.access_patterns = {}

def select_blocks(self, block_hashes, query, layer_idx):
    # Implement query-aware selection logic here
    selected_blocks = []
    for block_hash in block_hashes:
        if self.access_patterns.get(block_hash, 0) > 0:
            selected_blocks.append(block_hash)
    return selected_blocks, []


### Verification

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #optimization #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Motivation.

Proposed Change.

Architecture Overview

Integration Flow

Key Improvements

Integration Points

Implementation Steps

Test Case

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Motivation.

Proposed Change.

Architecture Overview

Integration Flow

Key Improvements

Integration Points

Implementation Steps

Test Case

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING