vllm - 💡(How to fix) Fix RFC: Memory-backend interface for KV cache hot paths and attention decode [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41535Fetched 2026-05-04 04:59:01
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1
RAW_BUFFERClick to expand / collapse

Motivation.

Hi vLLM team,

I am experimenting with a TurboQuant-compatible memory backend evaluation path, focused on KV-cache residency, kv_dot, and attention decode hot-path pressure.

The goal is not to replace vLLM’s scheduler or execution model, but to explore whether selected KV-cache / attention-decode operations can be routed through an external memory-centric backend while keeping fallback-safe behavior.

Current evaluation repository:

https://github.com/ixu2486/tq_compat_eval

The current direction includes:

  • KV-cache hot-path analysis
  • TurboQuant-compatible evaluation layout
  • memory-backend routing experiments
  • PIM-compatible execution path exploration
  • fallback-safe CPU/backend comparison
  • future compatibility with larger-context inference and speculative decoding pressure points

If the vLLM team or maintainers are interested in testing this direction, please reply here. I can add the relevant personnel to this project repository for evaluation access.

I would also appreciate guidance on whether this kind of work would fit better as:

  • an external backend experiment
  • a plugin-style integration
  • a future RFC
  • or a benchmark/evaluation-only companion project

Thanks.

Proposed Change.

Introduce an optional external memory-backend interface for selected KV-cache hot paths, especially kv_dot and attention decode.

The proposed change is not to replace vLLM’s scheduler, executor, or existing attention backends. Instead, it would allow experimental memory-centric backends to evaluate selected KV-cache / attention-decode operations through a fallback-safe interface.

The initial scope could be limited to an external evaluation path or plugin-style backend, with CPU/reference fallback and benchmark-only validation before any deeper integration.

This would make it possible to test TurboQuant-compatible KV residency layouts, memory-backend routing, and PIM-compatible execution paths without requiring disruptive changes to vLLM core.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Explore implementing an optional external memory-backend interface for selected KV-cache hot paths as a plugin-style integration or external backend experiment.

Guidance

  • Review the proposed change and evaluate its compatibility with the existing vLLM architecture, focusing on the fallback-safe interface and CPU/reference fallback.
  • Investigate the potential benefits of using a plugin-style integration versus an external backend experiment for the memory-centric backend evaluation.
  • Consider the trade-offs between deeper integration and the initial scope of an external evaluation path or plugin-style backend.
  • Examine the TurboQuant-compatible KV residency layouts, memory-backend routing, and PIM-compatible execution paths to determine their feasibility and potential impact.

Example

No code snippet is provided as the issue does not contain specific code-related details.

Notes

The solution may vary depending on the specific requirements and constraints of the vLLM project, and further discussion with the vLLM team or maintainers may be necessary to determine the best approach.

Recommendation

Apply workaround: Explore the proposed change as a plugin-style integration or external backend experiment to allow for fallback-safe evaluation of selected KV-cache hot paths without disrupting the vLLM core.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix RFC: Memory-backend interface for KV cache hot paths and attention decode [1 participants]