vllm - 💡(How to fix) Fix [RFC]: Hybrid checkpoint ABI for non-KV prefix resume [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40533Fetched 2026-04-22 07:44:00
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Root Cause

Prefix caching is comparatively clean for transformer KV state because the cache object and the reuse boundary align closely. Hybrid models complicate that with non-KV state such as recurrent state, conv buffers, and local-window or hybrid side state. That state is often updated in place, materialized differently across phases, backend-specific in storage layout, and not safely validated by raw backing-buffer equality alone.

RAW_BUFFERClick to expand / collapse

Motivation.

Prefix caching is comparatively clean for transformer KV state because the cache object and the reuse boundary align closely. Hybrid models complicate that with non-KV state such as recurrent state, conv buffers, and local-window or hybrid side state. That state is often updated in place, materialized differently across phases, backend-specific in storage layout, and not safely validated by raw backing-buffer equality alone.

A recent Metal/MLX proof point showed that storing live runtime-backed arrays in a hybrid checkpoint cache can be unsafe, and that a correct restore contract must distinguish payload immutability from logical state equality.

Public vLLM docs already describe hybrid cache coordination and still note that Mamba-style prefix caching is a work in progress. A checkpoint ABI would give hybrid recurrent-state resume the same kind of explicit correctness contract that APC already gives KV prefix reuse.

Numbers from the Metal/MLX proof point:

  • first TTFT: 10.93s → repeated replay TTFT: 1.90s (~5.7x)
  • 8-request soak: warm-tail TTFT stable at 1.88–1.90s after request 1
  • tool calls, content, and finish reason identical across replays
  • semantic drift fully eliminated on the tested Hermes workload

Before the fix, repeated replay was fast but wrong: cache hits were real and TTFT collapsed, but a request that previously emitted a web_search tool call would instead terminate with stop and no tool call on replay. After the fix, that drift is gone.

This RFC proposes making that distinction explicit in the architecture rather than rediscovering it per-backend.

Proposed Change.

Summary

This RFC proposes a backend-agnostic correctness contract for hybrid prefix resume.

The central idea is that hybrid cross-request resume is a two-key system:

  1. APC / KV logic proves reusable prefix identity at a logical boundary.
  2. Checkpoint ABI proves logically equivalent non-KV state at that same boundary.

This RFC does not propose replacing the existing KV cache managers. It proposes adding an explicit checkpoint ABI for the non-KV side of hybrid resume.

Proposed model

Two-key resume

Hybrid resume is legal only when both hold at the same boundary B:

  • Prefix identity: the paged KV side proves that boundary B is reusable.
  • State identity: the checkpoint side proves that non-KV hybrid state at B is logically equivalent.

This means hybrid resume should not be treated as "KV hit plus some side state." It should be treated as one coherent restart boundary where both proofs meet.

Checkpoint ABI

The checkpoint ABI should define:

  • checkpoint key: boundary identity aligned with the same logical boundary discipline used by prefix caching
  • tensor metadata: tensor kind, layer id, shape, logical dtype, storage dtype, layout tag
  • payload: immutable authoritative payload bytes
  • hashes:
    • payload_hash for stored-payload immutability
    • value_hash for canonical logical equality
  • restore phases:
    1. payload → scratch
    2. scratch → live slot
    3. live slot → post-restore materialized live view

Equality contract

The key proposal is to separate two notions of equality:

  • payload equality: the stored checkpoint bytes have not changed
  • logical value equality: the restored state means the same thing, even if the backend realizes it differently in memory

This explicitly allows backends to change pointer identity, physical layout, packing/contiguity, and page or block placement, as long as the canonical logical tensor value is preserved.

Design goals

  • preserve the existing KV ownership model
  • make non-KV state identity explicit and testable
  • support backend-specific realization without backend-specific correctness contracts
  • keep correctness work separate from optional performance work
  • allow derived mirrors / materialized caches without making them authoritative

Non-goals

This RFC does not attempt to:

  • replace existing KV cache managers
  • mandate a single physical storage format across all backends
  • force every backend to land the same implementation at once
  • claim that all hybrid cache cases are already solved

Backend split

Shared core answers:

  • what a checkpoint means
  • what metadata must be declared
  • what equality must hold
  • what restore phases are permitted

Each backend answers:

  • how payload bytes are materialized
  • how scratch tensors are built
  • how live slots are assigned
  • how any optional mirror / fast path is realized

acceptance criteria for any implementation

A backend implementation of this ABI should be able to show:

  1. repeated exact replay correctness at a fixed boundary
  2. stable first-step logits (or equivalent deterministic behavior) at that boundary
  3. fail-closed behavior when logical equality cannot be established
  4. a clean separation between authoritative payload, optional materialized mirror, and live mutable request state

questions for discussion

  1. What metadata is the minimum shared checkpoint schema?
  2. Should canonical value_hash rules be fully standardized or partially backend-defined?
  3. How should checkpoint boundaries align with the existing prefix-cache identity machinery?
  4. How should fail-closed resume surface to the scheduler/runtime?
  5. Should an optional read-only mirror be part of the shared abstraction, or remain purely backend-local?

About this RFC

I am opening this as a design discussion before submitting an implementation PR. A working Metal/MLX implementation exists with repeated-replay correctness verified, and is ready to be submitted as a correctness-first PR to vllm-project/vllm-metal. A parallel CUDA instantiation is in progress.

I would like to align with maintainers on the ABI shape before submitting code. Happy to iterate on scope, naming, or boundaries based on maintainer feedback.

Feedback Period.

1-2 weeks

CC List.

No response

Any Other Things.

A technical writeup of this work is being prepared for arXiv and will be linked here once posted. The Metal/MLX implementation is ready to submit to vllm-project/vllm-metal as a correctness-first PR, and I would prefer to get directional feedback on the ABI shape here first before opening that PR.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing a backend-agnostic correctness contract for hybrid prefix resume, using a two-key system with prefix identity and state identity, can ensure repeated exact replay correctness and eliminate semantic drift.

Guidance

  • Review the proposed checkpoint ABI to ensure it meets the requirements for hybrid prefix resume, including defining checkpoint key, tensor metadata, payload, hashes, and restore phases.
  • Verify that the implementation separates payload equality from logical value equality, allowing backends to change pointer identity and physical layout while preserving canonical logical tensor value.
  • Ensure the implementation preserves the existing KV ownership model and makes non-KV state identity explicit and testable.
  • Test the implementation to show repeated exact replay correctness, stable first-step logits, and fail-closed behavior when logical equality cannot be established.

Example

No code example is provided as the issue does not include specific code snippets.

Notes

The proposed solution is based on the Metal/MLX proof point, which showed that storing live runtime-backed arrays in a hybrid checkpoint cache can be unsafe. The implementation should be tested thoroughly to ensure correctness and stability.

Recommendation

Apply the proposed checkpoint ABI to ensure repeated exact replay correctness and eliminate semantic drift, as it provides a clear and explicit correctness contract for hybrid prefix resume.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING