vllm - 💡(How to fix) Fix RFC: Memory-backend interface for KV cache hot paths and attention decode [1 participants]

vllm2026-05-03 04:54:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41535•Fetched 2026-05-04 04:59:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ixu2486

Participants

ixu2486

Timeline (top)

labeled ×1

RAW_BUFFERClick to expand / collapse

Motivation.

Hi vLLM team,

I am experimenting with a TurboQuant-compatible memory backend evaluation path, focused on KV-cache residency, kv_dot, and attention decode hot-path pressure.

The goal is not to replace vLLM’s scheduler or execution model, but to explore whether selected KV-cache / attention-decode operations can be routed through an external memory-centric backend while keeping fallback-safe behavior.

Current evaluation repository:

https://github.com/ixu2486/tq_compat_eval

The current direction includes:

KV-cache hot-path analysis
TurboQuant-compatible evaluation layout
memory-backend routing experiments
PIM-compatible execution path exploration
fallback-safe CPU/backend comparison
future compatibility with larger-context inference and speculative decoding pressure points

If the vLLM team or maintainers are interested in testing this direction, please reply here. I can add the relevant personnel to this project repository for evaluation access.

I would also appreciate guidance on whether this kind of work would fit better as:

an external backend experiment
a plugin-style integration
a future RFC
or a benchmark/evaluation-only companion project

Thanks.

Proposed Change.

Introduce an optional external memory-backend interface for selected KV-cache hot paths, especially kv_dot and attention decode.

The proposed change is not to replace vLLM’s scheduler, executor, or existing attention backends. Instead, it would allow experimental memory-centric backends to evaluate selected KV-cache / attention-decode operations through a fallback-safe interface.

The initial scope could be limited to an external evaluation path or plugin-style backend, with CPU/reference fallback and benchmark-only validation before any deeper integration.

This would make it possible to test TurboQuant-compatible KV residency layouts, memory-backend routing, and PIM-compatible execution paths without requiring disruptive changes to vLLM core.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Explore implementing an optional external memory-backend interface for selected KV-cache hot paths as a plugin-style integration or external backend experiment.

Guidance

Review the proposed change and evaluate its compatibility with the existing vLLM architecture, focusing on the fallback-safe interface and CPU/reference fallback.
Investigate the potential benefits of using a plugin-style integration versus an external backend experiment for the memory-centric backend evaluation.
Consider the trade-offs between deeper integration and the initial scope of an external evaluation path or plugin-style backend.
Examine the TurboQuant-compatible KV residency layouts, memory-backend routing, and PIM-compatible execution paths to determine their feasibility and potential impact.

Example

No code snippet is provided as the issue does not contain specific code-related details.

Notes

The solution may vary depending on the specific requirements and constraints of the vLLM project, and further discussion with the vLLM team or maintainers may be necessary to determine the best approach.

Recommendation

Apply workaround: Explore the proposed change as a plugin-style integration or external backend experiment to allow for fallback-safe evaluation of selected KV-cache hot paths without disrupting the vLLM core.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#chain error #conversation history #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix RFC: Memory-backend interface for KV cache hot paths and attention decode [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix RFC: Memory-backend interface for KV cache hot paths and attention decode [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING