vllm - 💡(How to fix) Fix [Feature]: Fast KV Compaction via Attention Matching (50x compression) [1 participants]

vllm2026-03-11 01:34:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36729•Fetched 2026-04-08 00:35:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

markg85

Participants

markg85

Timeline (top)

subscribed ×3labeled ×1

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I'm just forwarding a research paper i stumbled upon. https://arxiv.org/abs/2602.16284

This is the abstract:

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

It seemed nice to bring this to the vLLM attention (no pun intended) as i'd assume this is the place where it should be implemented.

This is their github repo: https://github.com/adamzweiger/compaction

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Attention Matching approach for fast context compaction in latent space, follow these steps:

Clone the provided GitHub repository: git clone https://github.com/adamzweiger/compaction
Install required dependencies: pip install -r requirements.txt
Implement the Attention Matching algorithm in your vLLM attention model:

import torch
import torch.nn as nn
import torch.optim as optim

class AttentionMatching(nn.Module):
    def __init__(self, num_heads, hidden_size):
        super(AttentionMatching, self).__init__()
        self.num_heads = num_heads
        self.hidden_size = hidden_size
        self.compact_keys = nn.Linear(hidden_size, hidden_size)
        self.compact_values = nn.Linear(hidden_size, hidden_size)

    def forward(self, query, key, value):
        # Compute compact keys and values
        compact_key = self.compact_keys(key)
        compact_value = self.compact_values(value)

        # Compute attention outputs
        attention_outputs = torch.matmul(query, compact_key.T) / math.sqrt(self.hidden_size)
        attention_outputs = attention_outputs.softmax(dim=-1)

        # Compute compact attention outputs
        compact_attention_outputs = torch.matmul(attention_outputs, compact_value)

        return compact_attention_outputs

Integrate the Attention Matching module into your vLLM model:

class vLLM(nn.Module):
    def __init__(self, num_heads, hidden_size):
        super(vLLM, self).__init__()
        self.attention = AttentionMatching(num_heads, hidden_size)

    def forward(self, input_ids):
        # Compute attention outputs
        attention_outputs = self.attention(input_ids)

        # Compute final outputs
        final_outputs = attention_outputs

        return final_outputs

Verification

To verify the implementation, test the model on a sample dataset and evaluate its performance using metrics such as accuracy, F1-score, and ROUGE score.

Extra Tips

Make sure to adjust the hyperparameters of the Attention Matching module, such as the number of heads and hidden size, to achieve optimal performance on your specific task.
Consider using pre-trained models and fine-tuning them on your dataset to improve performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #optimization #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Fast KV Compaction via Attention Matching (50x compression) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

🚀 The feature, motivation and pitch

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Fast KV Compaction via Attention Matching (50x compression) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

🚀 The feature, motivation and pitch

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING