vllm - 💡(How to fix) Fix [RFC]: Add optional `before_update_states` hook to `KVConnectorBase_V1` for external KV cache integrations [8 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39696Fetched 2026-04-15 06:20:57
View on GitHub
Comments
8
Participants
2
Timeline
31
Reactions
0
Timeline (top)
mentioned ×10subscribed ×10commented ×8closed ×1

Add an optional, default no-op hook before_update_states to KVConnectorBase_V1. Call it from GPUModelRunner.execute_model (or equivalent location in the v1 worker path) immediately before self._update_states(scheduler_output).

Root Cause

Today there is no public extension point for this. The only way to plug into that location is to patch gpu_model_runner.py directly. Each connector project carries one .patch file per supported vLLM release (AIBrix has 4 patches today: v0.8.5, v0.9.1, v0.10.2, v0.14.0), and each patch is several thousand lines because the surrounding context drifts between releases.

Fix Action

Fix / Workaround

Today there is no public extension point for this. The only way to plug into that location is to patch gpu_model_runner.py directly. Each connector project carries one .patch file per supported vLLM release (AIBrix has 4 patches today: v0.8.5, v0.9.1, v0.10.2, v0.14.0), and each patch is several thousand lines because the surrounding context drifts between releases.

Concrete numbers from the AIBrix v0.14 patch:

  1. Eliminate the need to patch gpu_model_runner.py for the most common external-cache use case.
  2. Let connector projects ship as plain pip install packages that register themselves via KVConnectorFactory.register_connector() — same pattern already used by Mooncake and LMCache.
  3. Have zero behavior change for users not using external connectors (default implementation is a no-op returning {}).
  4. Reduce the per-release maintenance burden across the whole external connector ecosystem, not just one project.

Code Example

# vllm/distributed/kv_transfer/kv_connector/v1/base.py

class KVConnectorBase_V1:
    ...

    def before_update_states(
        self,
        scheduler_output: "SchedulerOutput",
    ) -> dict[str, int]:
        """Optional hook called before the model runner updates its
        per-request state for the current step.

        Connectors can use this to look up externally-cached KV blocks
        and report how many tokens of each request are already available
        in the external cache, so the scheduler can skip recomputation.

        Returns:
            A mapping of ``request_id -> num_external_tokens``. Requests
            not present in the dict (or with value 0) are treated as
            having no external cache hit. Default implementation returns
            an empty dict.
        """
        return {}

---

# vllm/v1/worker/gpu_model_runner.py (sketch)

def execute_model(self, scheduler_output: "SchedulerOutput", ...):
    ...
    if self.kv_connector is not None:
        external_hits = self.kv_connector.before_update_states(scheduler_output)
        if external_hits:
            scheduler_output.apply_external_hits(external_hits)  # or equivalent

    self._update_states(scheduler_output)
    ...
RAW_BUFFERClick to expand / collapse

Motivation.

External KV cache projects (AIBrix, LMCache, Mooncake, NixlConnector and similar) need to inject custom lookup logic into the request scheduling path — specifically, just before GPUModelRunner._update_states() runs, so that externally-cached prefix tokens can be reported back to the scheduler and skipped from recomputation.

Today there is no public extension point for this. The only way to plug into that location is to patch gpu_model_runner.py directly. Each connector project carries one .patch file per supported vLLM release (AIBrix has 4 patches today: v0.8.5, v0.9.1, v0.10.2, v0.14.0), and each patch is several thousand lines because the surrounding context drifts between releases.

Concrete numbers from the AIBrix v0.14 patch:

  • 4,057 lines, 9 modified files.
  • The actual hook insertion in gpu_model_runner.py is ~5 hunks of <20 lines each.
  • The remaining ~4,000 lines are connectors and helpers that could live outside the vLLM tree, except they need this hook.

Between vLLM v0.14 and v0.19, gpu_model_runner.py has a 4,209-line diff across 154 hunks. Re-locating those 5 small hunks is the single largest source of friction in supporting a new vLLM release for these projects.

A minimal, optional hook on KVConnectorBase_V1 would:

  1. Eliminate the need to patch gpu_model_runner.py for the most common external-cache use case.
  2. Let connector projects ship as plain pip install packages that register themselves via KVConnectorFactory.register_connector() — same pattern already used by Mooncake and LMCache.
  3. Have zero behavior change for users not using external connectors (default implementation is a no-op returning {}).
  4. Reduce the per-release maintenance burden across the whole external connector ecosystem, not just one project.

This RFC is filed in coordination with a companion RFC in vllm-project/aibrix proposing the corresponding refactor on the AIBrix side. That refactor reduces the AIBrix patch from ~4,057 lines to ~125 lines today, and to ~0 lines once this hook lands upstream.

Proposed Change.

Summary

Add an optional, default no-op hook before_update_states to KVConnectorBase_V1. Call it from GPUModelRunner.execute_model (or equivalent location in the v1 worker path) immediately before self._update_states(scheduler_output).

API

# vllm/distributed/kv_transfer/kv_connector/v1/base.py

class KVConnectorBase_V1:
    ...

    def before_update_states(
        self,
        scheduler_output: "SchedulerOutput",
    ) -> dict[str, int]:
        """Optional hook called before the model runner updates its
        per-request state for the current step.

        Connectors can use this to look up externally-cached KV blocks
        and report how many tokens of each request are already available
        in the external cache, so the scheduler can skip recomputation.

        Returns:
            A mapping of ``request_id -> num_external_tokens``. Requests
            not present in the dict (or with value 0) are treated as
            having no external cache hit. Default implementation returns
            an empty dict.
        """
        return {}

Call site

# vllm/v1/worker/gpu_model_runner.py (sketch)

def execute_model(self, scheduler_output: "SchedulerOutput", ...):
    ...
    if self.kv_connector is not None:
        external_hits = self.kv_connector.before_update_states(scheduler_output)
        if external_hits:
            scheduler_output.apply_external_hits(external_hits)  # or equivalent

    self._update_states(scheduler_output)
    ...

The exact integration with _update_states / scheduler accounting is the part that benefits most from maintainer feedback — we want to land a shape that vLLM is comfortable maintaining long-term, not a shape that fits one specific external project.

Backwards compatibility

  • KVConnectorBase_V1 ships a default implementation returning {}.
  • Existing connectors that don't override the hook see no behavior change.
  • The call site is gated on self.kv_connector is not None, identical to how other connector hooks are dispatched today.
  • No public class is renamed or removed.

Out of scope for this RFC

  • Changes to scheduler internals beyond consuming the returned dict.
  • Multi-tier cache policy (handled inside each connector).
  • Changes to KVConnectorBase (V0). This RFC only touches V1.

Implementation plan

If maintainers approve, we propose to land this in three small PRs:

  1. Add the hook with default no-op + unit tests covering the default path. Zero behavior change.
  2. Wire the call site in gpu_model_runner.py and add an integration test using a stub connector that returns a fixed dict.
  3. Migrate at least one in-tree connector (or a doc example) to use the hook, to lock in the contract.

We are happy to implement all three. We can also split or combine PRs as maintainers prefer.

Why a hook instead of a richer abstraction

We considered exposing a full pluggable scheduler-stage interface, but:

  • It would be a much larger surface area for vLLM to commit to.
  • Most external KV projects only need this one insertion point today.
  • A small hook can always be extended later; a large interface is hard to shrink.

We'd rather start minimal and let usage reveal whether a richer abstraction is justified.

Feedback Period.

2 weeks

CC List.

@NickLucche @MatthewBonanni @markmc @orozery @LucasWilkinson @KuntaiDu

Any Other Things.

  • Companion RFC filed in vllm-project/aibrix: https://github.com/vllm-project/aibrix/issues/2104 — describes how AIBrix would consume this hook to drop ~125 lines of patch entirely.
  • We are willing to do the implementation work and shepherd the PR through review.
  • We have validated the call-site location against vLLM main (v0.19 line) and against v0.14, v0.10.2, v0.9.1 patches. The same insertion point exists in all of them, just at different line numbers.

Related prior work

We searched existing issues before filing. Closest neighbours:

  • #19854 [RFC]: KV cache offloading (open). Proposes a new OffloadingConnector + OffloadingManager with pluggable backends and asynchronous worker threads. It targets the same problem space (CPU offloading) but takes a different approach: a self-contained connector rather than an extension hook on KVConnectorBase_V1. The proposals are complementary rather than overlapping — the hook proposed here would also benefit OffloadingConnector if it ever needs to report external hits back to the scheduler. Happy to coordinate if the maintainers prefer to unify both efforts.

  • #21772 [RFC]: Support KV Cache sparse with common framework (open). Proposes SparseKVBase / SparseKVManager for sparse algorithms. Touches the scheduler and layer.py, not KVConnectorBase_V1. No conceptual overlap with this RFC; mentioned only for completeness.

  • #16669 [RFC]: KVBlocks and Metrics Publishing In Inference Frameworks (closed/completed). Useful precedent for extending the KVConnector API with observation hooks (KVEventSink was discussed there). Confirms that small, optional hooks on the connector base class are an accepted pattern in vLLM.

If we missed a closer prior discussion, please point us to it.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Add an optional before_update_states hook to KVConnectorBase_V1 to allow external KV cache projects to inject custom lookup logic into the request scheduling path.

Guidance

  • Review the proposed before_update_states hook and its call site in gpu_model_runner.py to ensure it meets the requirements of external KV cache projects.
  • Consider the backwards compatibility implications of adding this hook, including the default no-op implementation and the gated call site.
  • Evaluate the potential benefits of this hook, including reduced maintenance burden and improved support for external cache use cases.
  • Discuss the proposed implementation plan, including the three small PRs and the migration of at least one in-tree connector to use the hook.

Example

class KVConnectorBase_V1:
    ...

    def before_update_states(
        self,
        scheduler_output: "SchedulerOutput",
    ) -> dict[str, int]:
        # Example implementation that returns a fixed dict
        return {"request_id": 10}

Notes

  • The proposed hook is designed to be minimal and extensible, allowing for future growth and modification as needed.
  • The implementation plan is designed to be incremental and testable, with clear milestones and deliverables.
  • The hook's default no-op implementation and gated call site ensure backwards compatibility and minimize the risk of breaking changes.

Recommendation

Apply the proposed workaround by adding the before_update_states hook to KVConnectorBase_V1 and implementing the call site in gpu_model_runner.py, as this will provide a flexible and extensible solution for external KV cache projects.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING