vllm - 💡(How to fix) Fix [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38694Fetched 2026-04-08 02:23:27
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
2
Author
Participants
Timeline (top)
subscribed ×3commented ×1labeled ×1mentioned ×1

Error Message

  • Quality (rel-L2 Error): 0.020, which is 22x more accurate than the 1-bit QJL baseline (0.437).
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Hi vLLM team and community,

First, thank you for building and maintaining PagedAttention—it’s the absolute gold standard for LLM serving.

I’m opening this RFC to discuss a potential architectural pathway for extreme long-context serving (128k+). Currently, as context length $T$ grows, the memory bandwidth required during the generation phase scales at $O(T)$. While quantization (FP8, AWQ, or methods like TurboQuant/QJL) reduces the footprint, it often introduces significant variance explosion (degrading model quality) and still fundamentally scales linearly.

Our research team (Singularity Principle) has developed a different mathematical approach based on trace-class admissibility, resulting in the Nuclear ZFC (NZFC) architecture. By maintaining a bounded prototype memory and using an online mass-bias direct readout, we can execute attention without fully materializing the past KV cache, freezing the query readout complexity at strictly $O(1)$.

We recently completed a PyTorch-level PoC on Qwen2.5-7B (GQA, 1024 ctx). The benchmark directly compares our scatter_add_ based FixC engine against the TurboQuant baseline.

Hard Metrics:

  • Query Latency: Flat at 1.04ms ($O(1)$), compared to the baseline scaling to 16.22ms at 1K tokens.
  • End-to-End Speedup: 3.66ms vs 17.65ms (4.8x faster).
  • Quality (rel-L2 Error): 0.020, which is 22x more accurate than the 1-bit QJL baseline (0.437).
<img width="1589" height="495" alt="Image" src="https://github.com/user-attachments/assets/2cb1fa94-a8db-4977-9835-716e0fbde6e3" /> <img width="1390" height="397" alt="Image" src="https://github.com/user-attachments/assets/1dd95c60-afc3-46a2-90ea-bfade753204b" />

You can reproduce these results using our Colab PoC here: 👉 https://colab.research.google.com/drive/1tISt1MWcti8oubURkDhTlwS7rf_BG4wB?usp=sharing (Note: Core IP is patent-pending (PCT/KR2026/002215) but fully open for open-source evaluation and academic review).

Integration Question for vLLM Maintainers: Our next milestone is writing custom Triton/CUDA kernels for the FixC engine (scatter_add_ updates and mass-log bias readout).

Given vLLM's PagedAttention architecture, what would be the most non-intrusive way to introduce a custom $O(1)$ attention backend? Specifically, since NZFC modifies both the cache update (in-place accumulation) and the readout mechanics, should we look into adding a new AttentionBackend subclass, or would this require deeper modifications to the Block Manager?

We would deeply appreciate any architectural feedback, code roasting, or pointers on how to best integrate this with vLLM's ecosystem.

Thanks!

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Introduce a custom $O(1)$ attention backend by creating a new AttentionBackend subclass to integrate the Nuclear ZFC (NZFC) architecture with vLLM's PagedAttention.

Guidance

  • Review vLLM's PagedAttention architecture to understand the current attention backend implementation and identify potential integration points for the NZFC architecture.
  • Investigate the feasibility of adding a new AttentionBackend subclass to support the NZFC architecture, considering the required modifications to the cache update and readout mechanics.
  • Consult with vLLM maintainers to determine the best approach for integrating the custom attention backend, ensuring minimal disruption to the existing architecture.
  • Explore the possibility of leveraging Triton/CUDA kernels for the FixC engine to optimize performance.

Example

No code snippet is provided as the issue focuses on architectural discussion and integration.

Notes

The integration of the NZFC architecture with vLLM's PagedAttention may require significant modifications to the Block Manager or other components, and careful consideration of the trade-offs between performance, accuracy, and complexity.

Recommendation

Apply workaround: Introduce a new AttentionBackend subclass to support the NZFC architecture, allowing for a more modular and non-intrusive integration with vLLM's PagedAttention. This approach enables the evaluation and refinement of the custom attention backend without disrupting the existing architecture.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B [1 comments, 2 participants]