vllm - 💡(How to fix) Fix KVConnector V1 external hit lookup is consumed as a reservation, but has no plan/abort lifecycle

Root Cause

For a pure in-process lookup, this may be fine. For external KV backends, it is much harder:

external blocks may be evicted between lookup and allocation update;
remote availability may change;
an async prefetch may need to be started and then polled;
hit blocks may need to be pinned or leased so the later load can consume them;
repeated scheduler probes for the same request/hash range must not double-acquire resources.

To be compatible with the current two-hook API, reservation-based connectors need to implement extra idempotency and lifecycle logic themselves:

memoize repeated get_num_new_matched_tokens() probes for the same request and hash slice;
avoid double pinning / double leasing when the scheduler asks again;
track whether a positive match was actually consumed by update_state_after_alloc();
release reservations if the request changes, allocation does not proceed, or the request is cancelled.

That is a lot of connector-side machinery caused by the API shape, not by backend-specific business logic.

Code Example

ext_tokens, load_kv_async = connector.get_num_new_matched_tokens(
    request,
    num_new_local_computed_tokens,
)

---

new_blocks = kv_cache_manager.allocate_slots(
    request,
    num_new_tokens,
    num_external_computed_tokens=ext_tokens,
    delay_cache_blocks=load_kv_async,
    ...
)

---

connector.update_state_after_alloc(
    request,
    kv_cache_manager.get_blocks(request_id),
    ext_tokens,
)

---

get_num_new_matched_tokens(...) -> (100, True)

---

update_state_after_alloc(..., num_external_tokens=100)

---

connector.abort_external_match(request)

---

@dataclass
   class ExternalKVMatch:
       num_tokens: int
       load_kv_async: bool
       query_result: object | None = None

---

match = connector.get_num_new_matched_tokens(...)
   connector.update_state_after_alloc(request, blocks, match.query_result)

---

plan = connector.prepare_external_match(request, num_computed_tokens)
   connector.commit_external_match(request, blocks, plan)
   connector.abort_external_match(request, plan)

Summary

In the current V1 KVConnector scheduler flow, get_num_new_matched_tokens() looks like a query-style API, but a positive result is later consumed as a commitment by update_state_after_alloc().

That creates an ambiguous contract for external KV backends that need eviction protection, remote leases, prefetch state, pinning, or any other reservation-like behavior.

Current flow

The scheduler calls:

ext_tokens, load_kv_async = connector.get_num_new_matched_tokens(
    request,
    num_new_local_computed_tokens,
)

If ext_tokens is positive, the scheduler uses that value to allocate cache blocks:

new_blocks = kv_cache_manager.allocate_slots(
    request,
    num_new_tokens,
    num_external_computed_tokens=ext_tokens,
    delay_cache_blocks=load_kv_async,
    ...
)

Then it calls:

connector.update_state_after_alloc(
    request,
    kv_cache_manager.get_blocks(request_id),
    ext_tokens,
)

update_state_after_alloc() has no return value and no normal way to say: the earlier lookup returned 100 external tokens, but only 50 are still valid now.

So once get_num_new_matched_tokens() returns a positive token count, that value is effectively a reservation-level commitment. It is not merely advisory.

Why this matters

For a pure in-process lookup, this may be fine. For external KV backends, it is much harder:

external blocks may be evicted between lookup and allocation update;
remote availability may change;
an async prefetch may need to be started and then polled;
hit blocks may need to be pinned or leased so the later load can consume them;
repeated scheduler probes for the same request/hash range must not double-acquire resources.

To be compatible with the current two-hook API, reservation-based connectors need to implement extra idempotency and lifecycle logic themselves:

memoize repeated get_num_new_matched_tokens() probes for the same request and hash slice;
avoid double pinning / double leasing when the scheduler asks again;
track whether a positive match was actually consumed by update_state_after_alloc();
release reservations if the request changes, allocation does not proceed, or the request is cancelled.

That is a lot of connector-side machinery caused by the API shape, not by backend-specific business logic.

Example

Suppose an external backend observes 100 matching tokens:

get_num_new_matched_tokens(...) -> (100, True)

The scheduler allocates blocks for 100 external tokens and later calls:

update_state_after_alloc(..., num_external_tokens=100)

If the earlier lookup was side-effect-free, only 50 tokens may still be safely loadable by this point. The current API gives the connector no clean way to downgrade the accepted token count to 50. At that point the scheduler has already made allocation and scheduling decisions based on 100.

Suggested direction

I think vLLM should make this lifecycle explicit. A few possible options:

Document the current contract more strongly

If get_num_new_matched_tokens() returns a positive token count, that count must remain consumable by update_state_after_alloc(). Connectors should be allowed to reserve, pin, lease, or start prefetch work during get_num_new_matched_tokens().
Add an abort/release hook

If the scheduler receives a positive external match but allocation does not proceed, the connector should be notified so it can release any reservation created by the query step.

For example:
```
connector.abort_external_match(request)
```

Pass an optional query result / plan object between the two hooks

Instead of passing only num_external_tokens, the scheduler could preserve an opaque connector-owned result from the lookup phase and pass it into update_state_after_alloc().

Sketch:

@dataclass
class ExternalKVMatch:
    num_tokens: int
    load_kv_async: bool
    query_result: object | None = None

Then:

match = connector.get_num_new_matched_tokens(...)
connector.update_state_after_alloc(request, blocks, match.query_result)

Or, more explicitly:

plan = connector.prepare_external_match(request, num_computed_tokens)
connector.commit_external_match(request, blocks, plan)
connector.abort_external_match(request, plan)

This would let vLLM own the lifecycle shape while still keeping backend-specific reservation details opaque.

Compatibility

This could be introduced as an optional V1 extension first, while preserving the existing tuple-returning API for current connectors.

The important part is to make the implicit contract explicit: a positive external match is consumed later as a reservation. Treating it as a side-effect-free query makes repeated scheduler probes and backend eviction semantics unnecessarily fragile.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix KVConnector V1 external hit lookup is consumed as a reservation, but has no plan/abort lifecycle

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Current flow

Why this matters

Example

Suggested direction

Compatibility

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix KVConnector V1 external hit lookup is consumed as a reservation, but has no plan/abort lifecycle

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Current flow

Why this matters

Example

Suggested direction

Compatibility

Still need to ship something?

RELATED_DISCOVERY

TRENDING