vllm - 💡(How to fix) Fix [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B [1 comments, 2 participants]

JEWONMOON · 2026-04-01T05:47:56Z

[vllm] 🚀 The feature, motivation and pitch Hi vLLM team and community, First, thank you for building and maintaining PagedAttention—it’s the absolute gold sta… ### 🚀 The feature, motivation and pitch Hi vLLM team and community, First, thank you for building and maintaining PagedAttention—it’s the absolute gold standard for LLM serving. I’m opening this RFC to discuss a potential architectural pathway for extreme long-context serving (128k+). Currently, as context length $T$ grows, the memory bandwidth required during the generation phase scales at $O(T)$. While quantization (FP8, AWQ, or methods like TurboQuant/QJL) reduces the footprint, it often introduces significant variance explosion (degrading model quality) and still fundamentally scales linearly. Our research team (Singularity Principle) has developed a different mathematical approach based on trace-class admissibility, resulting in the Nuclear ZFC (NZFC) architecture. By maintaining a bounded prototype memory and using an online mass-bias direct readout, we can execute attention without fully materializing the past KV cache, freezing the query readout complexity at strictly $O(1)$. We recently completed a PyTorch-level PoC on Qwen2.5-7B (GQA, 1024 ctx). The benchmark directly compares our `scatter_add_` based FixC engine against the TurboQuant baseline. **Hard Metrics:** * **Query Latency:** Flat at **1.04ms** ($O(1)$), compared to the baseline scaling to 16.22ms at 1K tokens. * **End-to-End Speedup:** 3.66ms vs 17.65ms (**4.8x faster**). * **Quality (rel-L2 Error):** **0.020**, which is 22x more accurate than the 1-bit QJL baseline (0.437). You can reproduce these results using our Colab PoC here: 👉 https://colab.research.google.com/drive/1tISt1MWcti8oubURkDhTlwS7rf_BG4wB?usp=sharing *(Note: Core IP is patent-pending (PCT/KR2026/002215) but fully open for open-source evaluation and academic review).* **Integration Question for vLLM Maintainers:** Our next milestone is writing custom Triton/CUDA kernels for the FixC engine (`scatter_add_` updates and mass-log bias readout). Given vLLM's PagedAttention architecture, what would be the most non-intrusive way to introduce a custom $O(1)$ attention backend? Specifically, since NZFC modifies both the cache update (in-place accumulation) and the readout mechanics, should we look into adding a new `AttentionBackend` subclass, or would this require deeper modifications to the Block Manager? We would deeply appreciate any architectural feedback, code roasting, or pointers on how to best integrate this with vLLM's ecosystem. Thanks! ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-01 05:47:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38694•Fetched 2026-04-08 02:23:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

JEWONMOON

Participants

gaby

JEWONMOON

Timeline (top)

subscribed ×3commented ×1labeled ×1mentioned ×1

Error Message

Quality (rel-L2 Error): 0.020, which is 22x more accurate than the 1-bit QJL baseline (0.437).

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Hi vLLM team and community,

First, thank you for building and maintaining PagedAttention—it’s the absolute gold standard for LLM serving.

I’m opening this RFC to discuss a potential architectural pathway for extreme long-context serving (128k+). Currently, as context length $T$ grows, the memory bandwidth required during the generation phase scales at $O(T)$. While quantization (FP8, AWQ, or methods like TurboQuant/QJL) reduces the footprint, it often introduces significant variance explosion (degrading model quality) and still fundamentally scales linearly.

Our research team (Singularity Principle) has developed a different mathematical approach based on trace-class admissibility, resulting in the Nuclear ZFC (NZFC) architecture. By maintaining a bounded prototype memory and using an online mass-bias direct readout, we can execute attention without fully materializing the past KV cache, freezing the query readout complexity at strictly $O(1)$.

We recently completed a PyTorch-level PoC on Qwen2.5-7B (GQA, 1024 ctx). The benchmark directly compares our scatter_add_ based FixC engine against the TurboQuant baseline.

Hard Metrics:

Query Latency: Flat at 1.04ms ($O(1)$), compared to the baseline scaling to 16.22ms at 1K tokens.
End-to-End Speedup: 3.66ms vs 17.65ms (4.8x faster).
Quality (rel-L2 Error): 0.020, which is 22x more accurate than the 1-bit QJL baseline (0.437).

You can reproduce these results using our Colab PoC here: 👉 https://colab.research.google.com/drive/1tISt1MWcti8oubURkDhTlwS7rf_BG4wB?usp=sharing (Note: Core IP is patent-pending (PCT/KR2026/002215) but fully open for open-source evaluation and academic review).

Integration Question for vLLM Maintainers: Our next milestone is writing custom Triton/CUDA kernels for the FixC engine (scatter_add_ updates and mass-log bias readout).

Given vLLM's PagedAttention architecture, what would be the most non-intrusive way to introduce a custom $O(1)$ attention backend? Specifically, since NZFC modifies both the cache update (in-place accumulation) and the readout mechanics, should we look into adding a new AttentionBackend subclass, or would this require deeper modifications to the Block Manager?

We would deeply appreciate any architectural feedback, code roasting, or pointers on how to best integrate this with vLLM's ecosystem.

Thanks!

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Introduce a custom $O(1)$ attention backend by creating a new AttentionBackend subclass to integrate the Nuclear ZFC (NZFC) architecture with vLLM's PagedAttention.

Guidance

Review vLLM's PagedAttention architecture to understand the current attention backend implementation and identify potential integration points for the NZFC architecture.
Investigate the feasibility of adding a new AttentionBackend subclass to support the NZFC architecture, considering the required modifications to the cache update and readout mechanics.
Consult with vLLM maintainers to determine the best approach for integrating the custom attention backend, ensuring minimal disruption to the existing architecture.
Explore the possibility of leveraging Triton/CUDA kernels for the FixC engine to optimize performance.

Example

No code snippet is provided as the issue focuses on architectural discussion and integration.

Notes

The integration of the NZFC architecture with vLLM's PagedAttention may require significant modifications to the Block Manager or other components, and careful consideration of the trade-offs between performance, accuracy, and complexity.

Recommendation

Apply workaround: Introduce a new AttentionBackend subclass to support the NZFC architecture, allowing for a more modular and non-intrusive integration with vLLM's PagedAttention. This approach enables the evaluation and refinement of the custom attention backend without disrupting the existing architecture.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING