vllm - ✅(Solved) Fix [RFC]: Selective KV Cache offload [1 pull requests, 1 participants]

vllm2026-04-08 13:38:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39305•Fetched 2026-04-09 07:52:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ruocco

Participants

ruocco

Timeline (top)

mentioned ×2subscribed ×2labeled ×1

PR fix notes

PR #39983: Add prompt-percentage base selective offload in OffloadConnector

Repository: vllm-project/vllm
Author: ruocco
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39983

Description (problem / solution / changelog)

Purpose

See RFC #39305

Test Plan

Simple test two instances of three prompts test.

Prompt A
Prompt B
Prompt A

In the first instance, all is default, i.e. the OffloadConnector offloads everything to Cache. On the second instance Prompt A is 100% offloaded, prompt B is not offloaded at all. Cache capacity has been chosen so that all prompts are guaranteed to fill it.

(If necessary) Future Test Plan

Planning on using PR #39795 and add selective offload on realistic traces.

Test Result

The test without selective offload takes ~36s to complete, or ~12s per prompt. Adding selective offload and caching only request A, the test completes in ~25s, or ~(12+12+1), showing that the last prompt request was able to use cached values and save most of the computation time.

CC: @animeshtrivedi @tdoublep @vMaroon @orozery

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/distributed/kv_transfer/kv_connector/v1/offloading/scheduler.py (modified, +18/-1)
vllm/entrypoints/openai/chat_completion/protocol.py (modified, +2/-0)
vllm/entrypoints/openai/completion/protocol.py (modified, +2/-0)
vllm/sampling_params.py (modified, +14/-0)

RAW_BUFFERClick to expand / collapse

Motivation.

Implement a "selective offloading" directive in llm-d/vLLM offload path, allowing CPU and FS offloading connectors (and possibly other connectors), to only offload parts of the prompt outside of GPU Memory.

At the moment, there is no control available to control KV Cache offloading. It's all (all generated KVCache is offloaded during the forward pass) or nothing (when the connector is disabled).

Public traces like Mooncake or Alibaba show that 40-60% of the KV Cache is stored but never reused. Creating a mechanism to only store some tokens while discarding others can allow vLLM orchestrators (e.g. llm-d) to implement policies that can improve performance or reduce cost in certain scenarios. A few non comprehensive example given:

Long one-off requests in between more common prompts;
KV Cache stored on some remote server with limited bandwidth.

Proposed Change.

I have implemented a draft in this fork, adding an API to the prompt that allow storing only some configurable percentage of the prompt, starting from the beginning and stop when prompt percentage is reached. This is not the final design, and it was implemented only to test the hypothesis (results coming soon).

Implementation discussion.

As of right now, the ^^^aforementioned fork adds a standalone directive in the call to instruct the offloadConnector. Another option would be to extend the APIs discussed in RFC #37003

Feedback Period.

1-2 Weeks.

Feedback welcome, especially regarding:

API shape and abstractions;
Implementation logic and location, e.g. kv_connector vs OffloadConnector;
Alignment with the vLLM and llm-d project goals.

CC List.

@vMaroon @animeshtrivedi

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement a selective offloading directive in the llm-d/vLLM offload path to control KV Cache offloading, allowing for partial offloading of prompts.

Guidance

Review the proposed implementation in the provided fork and test the hypothesis to determine the effectiveness of partial offloading.
Consider extending the APIs discussed in RFC #37003 as an alternative to the standalone directive.
Evaluate the API shape and abstractions to ensure alignment with the vLLM and llm-d project goals.
Assess the implementation logic and location, such as whether it should be part of the kv_connector or OffloadConnector.

Example

No code snippet is provided as the issue does not contain explicit code examples.

Notes

The proposed change is still in the draft stage, and the implementation is not final. Feedback is welcome, especially regarding the API shape, implementation logic, and alignment with project goals.

Recommendation

Apply a workaround by implementing the selective offloading directive as proposed in the fork, while continuing to discuss and refine the implementation to ensure alignment with project goals.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [RFC]: Selective KV Cache offload [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #39983: Add prompt-percentage base selective offload in OffloadConnector

Description (problem / solution / changelog)

Purpose

Test Plan

(If necessary) Future Test Plan

Test Result

Changed files

Motivation.

Proposed Change.

Implementation discussion.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [RFC]: Selective KV Cache offload [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #39983: Add prompt-percentage base selective offload in OffloadConnector

Description (problem / solution / changelog)

Purpose

Test Plan

(If necessary) Future Test Plan

Test Result

Changed files

Motivation.

Proposed Change.

Implementation discussion.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING