vllm - 💡(How to fix) Fix KVConnector V1 external hit lookup is consumed as a reservation, but has no plan/abort lifecycle

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In the current V1 KVConnector scheduler flow, get_num_new_matched_tokens() looks like a query-style API, but a positive result is later consumed as a commitment by update_state_after_alloc().

That creates an ambiguous contract for external KV backends that need eviction protection, remote leases, prefetch state, pinning, or any other reservation-like behavior.

Root Cause

For a pure in-process lookup, this may be fine. For external KV backends, it is much harder:

  • external blocks may be evicted between lookup and allocation update;
  • remote availability may change;
  • an async prefetch may need to be started and then polled;
  • hit blocks may need to be pinned or leased so the later load can consume them;
  • repeated scheduler probes for the same request/hash range must not double-acquire resources.

To be compatible with the current two-hook API, reservation-based connectors need to implement extra idempotency and lifecycle logic themselves:

  • memoize repeated get_num_new_matched_tokens() probes for the same request and hash slice;
  • avoid double pinning / double leasing when the scheduler asks again;
  • track whether a positive match was actually consumed by update_state_after_alloc();
  • release reservations if the request changes, allocation does not proceed, or the request is cancelled.

That is a lot of connector-side machinery caused by the API shape, not by backend-specific business logic.

Code Example

ext_tokens, load_kv_async = connector.get_num_new_matched_tokens(
    request,
    num_new_local_computed_tokens,
)

---

new_blocks = kv_cache_manager.allocate_slots(
    request,
    num_new_tokens,
    num_external_computed_tokens=ext_tokens,
    delay_cache_blocks=load_kv_async,
    ...
)

---

connector.update_state_after_alloc(
    request,
    kv_cache_manager.get_blocks(request_id),
    ext_tokens,
)

---

get_num_new_matched_tokens(...) -> (100, True)

---

update_state_after_alloc(..., num_external_tokens=100)

---

connector.abort_external_match(request)

---

@dataclass
   class ExternalKVMatch:
       num_tokens: int
       load_kv_async: bool
       query_result: object | None = None

---

match = connector.get_num_new_matched_tokens(...)
   connector.update_state_after_alloc(request, blocks, match.query_result)

---

plan = connector.prepare_external_match(request, num_computed_tokens)
   connector.commit_external_match(request, blocks, plan)
   connector.abort_external_match(request, plan)
RAW_BUFFERClick to expand / collapse

Summary

In the current V1 KVConnector scheduler flow, get_num_new_matched_tokens() looks like a query-style API, but a positive result is later consumed as a commitment by update_state_after_alloc().

That creates an ambiguous contract for external KV backends that need eviction protection, remote leases, prefetch state, pinning, or any other reservation-like behavior.

Current flow

The scheduler calls:

ext_tokens, load_kv_async = connector.get_num_new_matched_tokens(
    request,
    num_new_local_computed_tokens,
)

If ext_tokens is positive, the scheduler uses that value to allocate cache blocks:

new_blocks = kv_cache_manager.allocate_slots(
    request,
    num_new_tokens,
    num_external_computed_tokens=ext_tokens,
    delay_cache_blocks=load_kv_async,
    ...
)

Then it calls:

connector.update_state_after_alloc(
    request,
    kv_cache_manager.get_blocks(request_id),
    ext_tokens,
)

update_state_after_alloc() has no return value and no normal way to say: the earlier lookup returned 100 external tokens, but only 50 are still valid now.

So once get_num_new_matched_tokens() returns a positive token count, that value is effectively a reservation-level commitment. It is not merely advisory.

Why this matters

For a pure in-process lookup, this may be fine. For external KV backends, it is much harder:

  • external blocks may be evicted between lookup and allocation update;
  • remote availability may change;
  • an async prefetch may need to be started and then polled;
  • hit blocks may need to be pinned or leased so the later load can consume them;
  • repeated scheduler probes for the same request/hash range must not double-acquire resources.

To be compatible with the current two-hook API, reservation-based connectors need to implement extra idempotency and lifecycle logic themselves:

  • memoize repeated get_num_new_matched_tokens() probes for the same request and hash slice;
  • avoid double pinning / double leasing when the scheduler asks again;
  • track whether a positive match was actually consumed by update_state_after_alloc();
  • release reservations if the request changes, allocation does not proceed, or the request is cancelled.

That is a lot of connector-side machinery caused by the API shape, not by backend-specific business logic.

Example

Suppose an external backend observes 100 matching tokens:

get_num_new_matched_tokens(...) -> (100, True)

The scheduler allocates blocks for 100 external tokens and later calls:

update_state_after_alloc(..., num_external_tokens=100)

If the earlier lookup was side-effect-free, only 50 tokens may still be safely loadable by this point. The current API gives the connector no clean way to downgrade the accepted token count to 50. At that point the scheduler has already made allocation and scheduling decisions based on 100.

Suggested direction

I think vLLM should make this lifecycle explicit. A few possible options:

  1. Document the current contract more strongly

    If get_num_new_matched_tokens() returns a positive token count, that count must remain consumable by update_state_after_alloc(). Connectors should be allowed to reserve, pin, lease, or start prefetch work during get_num_new_matched_tokens().

  2. Add an abort/release hook

    If the scheduler receives a positive external match but allocation does not proceed, the connector should be notified so it can release any reservation created by the query step.

    For example:

    connector.abort_external_match(request)
  3. Pass an optional query result / plan object between the two hooks

    Instead of passing only num_external_tokens, the scheduler could preserve an opaque connector-owned result from the lookup phase and pass it into update_state_after_alloc().

    Sketch:

    @dataclass
    class ExternalKVMatch:
        num_tokens: int
        load_kv_async: bool
        query_result: object | None = None

    Then:

    match = connector.get_num_new_matched_tokens(...)
    connector.update_state_after_alloc(request, blocks, match.query_result)

    Or, more explicitly:

    plan = connector.prepare_external_match(request, num_computed_tokens)
    connector.commit_external_match(request, blocks, plan)
    connector.abort_external_match(request, plan)

    This would let vLLM own the lifecycle shape while still keeping backend-specific reservation details opaque.

Compatibility

This could be introduced as an optional V1 extension first, while preserving the existing tuple-returning API for current connectors.

The important part is to make the implicit contract explicit: a positive external match is consumed later as a reservation. Treating it as a side-effect-free query makes repeated scheduler probes and backend eviction semantics unnecessarily fragile.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix KVConnector V1 external hit lookup is consumed as a reservation, but has no plan/abort lifecycle