vllm - 💡(How to fix) Fix [Feature]: Fast KV Compaction via Attention Matching (50x compression) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36729Fetched 2026-04-08 00:35:14
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
2
Author
Participants
Timeline (top)
subscribed ×3labeled ×1
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I'm just forwarding a research paper i stumbled upon. https://arxiv.org/abs/2602.16284

This is the abstract:

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

It seemed nice to bring this to the vLLM attention (no pun intended) as i'd assume this is the place where it should be implemented.

This is their github repo: https://github.com/adamzweiger/compaction

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Attention Matching approach for fast context compaction in latent space, follow these steps:

  • Clone the provided GitHub repository: git clone https://github.com/adamzweiger/compaction
  • Install required dependencies: pip install -r requirements.txt
  • Implement the Attention Matching algorithm in your vLLM attention model:
import torch
import torch.nn as nn
import torch.optim as optim

class AttentionMatching(nn.Module):
    def __init__(self, num_heads, hidden_size):
        super(AttentionMatching, self).__init__()
        self.num_heads = num_heads
        self.hidden_size = hidden_size
        self.compact_keys = nn.Linear(hidden_size, hidden_size)
        self.compact_values = nn.Linear(hidden_size, hidden_size)

    def forward(self, query, key, value):
        # Compute compact keys and values
        compact_key = self.compact_keys(key)
        compact_value = self.compact_values(value)

        # Compute attention outputs
        attention_outputs = torch.matmul(query, compact_key.T) / math.sqrt(self.hidden_size)
        attention_outputs = attention_outputs.softmax(dim=-1)

        # Compute compact attention outputs
        compact_attention_outputs = torch.matmul(attention_outputs, compact_value)

        return compact_attention_outputs
  • Integrate the Attention Matching module into your vLLM model:
class vLLM(nn.Module):
    def __init__(self, num_heads, hidden_size):
        super(vLLM, self).__init__()
        self.attention = AttentionMatching(num_heads, hidden_size)

    def forward(self, input_ids):
        # Compute attention outputs
        attention_outputs = self.attention(input_ids)

        # Compute final outputs
        final_outputs = attention_outputs

        return final_outputs

Verification

To verify the implementation, test the model on a sample dataset and evaluate its performance using metrics such as accuracy, F1-score, and ROUGE score.

Extra Tips

  • Make sure to adjust the hyperparameters of the Attention Matching module, such as the number of heads and hidden size, to achieve optimal performance on your specific task.
  • Consider using pre-trained models and fine-tuning them on your dataset to improve performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING