vllm - ✅(Solved) Fix [Bug]: SimpleCPUOffloadScheduler misses final full block when request finishes in the same scheduler step [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41704Fetched 2026-05-06 06:15:19
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

PR fix notes

PR #41777: [Bugfix] Flush final KV block when SimpleCPUOffload request finishes in same step as its last full block

Description (problem / solution / changelog)

Summary

Fixes #41704.

In eager mode, SimpleCPUOffloadScheduler silently drops the last full KV block of a request when that block is computed in the same scheduler step that the request finishes.

Root cause: build_connector_meta_prepare_eager_store_specs runs before _update_after_schedule advances request.num_computed_tokens. The newly-confirmed block from the current step is therefore invisible to the store planner. When the request then finishes in that same step, request_finished previously only deferred cleanup for already in-flight stores and never flushed the newly confirmed block. The next build_connector_meta call never sees the request again (it's being cleaned up), so the block is permanently dropped.

Changes

vllm/v1/simple_kv_offload/manager.py

  • Add _pending_flush_{gpu,cpu,req}_* lists to buffer blocks queued by _flush_final_blocks_on_finish.
  • Add _flush_final_blocks_on_finish(request): called from request_finished (eager mode only) after _update_after_schedule has already advanced num_computed_tokens. Scans for any unconfirmed-but-now-complete full blocks using the updated token count, allocates CPU blocks, touch()-es the GPU blocks to prevent early eviction, and enqueues the pairs in _pending_flush_*.
  • request_finished: call _flush_final_blocks_on_finish before the cleanup check; defer _cleanup_store_request if a flush was queued (has_pending_flush), so the state stays alive until the flush store event completes.
  • build_connector_meta: after prepare_store_specs, merge any _pending_flush_* blocks into the regular store lists before creating the store event. The existing store_req_idsstore_state.store_events tracking then handles deferred cleanup correctly.

tests/v1/simple_kv_offload/test_scheduler.py

  • Add test_eager_flush_final_block_on_finish (Test 11): reproduces the exact scheduler ordering (build_connector_meta before _update_after_schedule), verifies that _pending_flush_gpu_blocks is populated after request_finished, that the next build_connector_meta emits a store event for the flush, and that block 1's hash is discoverable in the CPU cache after the store completes.

Test plan

  • pytest tests/v1/simple_kv_offload/test_scheduler.py -v — all 11 tests pass
  • Existing tests test_eager_store_and_load_roundtrip, test_eager_store_preemption_cleanup, test_inflight_finish_deferred_cleanup continue to pass (deferred-cleanup paths unchanged)

Changed files

  • tests/v1/simple_kv_offload/test_scheduler.py (modified, +76/-0)
  • vllm/v1/simple_kv_offload/manager.py (modified, +113/-3)

Code Example

# Only considers blocks whose KV data has been **confirmed computed** by
# the GPU. This means blocks from the current step are NOT stored until the
# next step. If a request finishes in the same step as its last full block,
# that block may be missed. (TODO: flush on finish.)

---

missed_final_store_event -1
missed_final_store_gpu_blocks []
extended_prompt_hit_tokens_after_miss 16

---

stored_second_store_event 1
stored_second_store_gpu_blocks [3]
extended_prompt_hit_tokens_when_second_cached 32
RAW_BUFFERClick to expand / collapse

Describe the bug

In eager mode, SimpleCPUOffloadScheduler can miss the final full KV block of a request when that block is computed in the same scheduler step where the request finishes.

The store planner only considers blocks whose KV data has been confirmed computed before the current scheduler step. The code already documents this edge case in vllm/v1/simple_kv_offload/manager.py:

# Only considers blocks whose KV data has been **confirmed computed** by
# the GPU. This means blocks from the current step are NOT stored until the
# next step. If a request finishes in the same step as its last full block,
# that block may be missed. (TODO: flush on finish.)

When the request finishes in that same step, request_finished() clears the eager store tracking state before a later scheduler step can observe and offload the newly completed block.

Impact

A later request that extends the same prompt can get a shorter CPU KV cache hit than expected. The final full block of the prior request is recomputed instead of being loaded from CPU offload.

This affects the SimpleCPUOffloadScheduler / SimpleCPUOffloadConnector path, not the older weight offload path.

Why this happens

The scheduler builds connector metadata before it advances request.num_computed_tokens for the current step:

  1. Scheduler.schedule() calls connector.build_connector_meta(scheduler_output).
  2. Only after that, _update_after_schedule() increments request.num_computed_tokens by the scheduled token count.
  3. The model executes and the request may finish.
  4. _free_request() calls the connector's request_finished() path.
  5. SimpleCPUOffloadScheduler.request_finished() only defers cleanup for already in-flight stores. It does not flush the final block that became complete during the just-executed step.

So the current-step final block is visible too late for build_connector_meta(), and cleanup happens before the next opportunity to store it.

Minimal reproduction

This can be reproduced with the existing helpers in tests/v1/simple_kv_offload/test_scheduler.py:

  1. Create an eager SimpleCPUOffloadScheduler.
  2. Schedule a two-block request.
  3. Store block 1 normally after it is confirmed computed.
  4. Allocate block 2, but leave it unconfirmed when build_connector_meta() runs, matching the real scheduler order where connector metadata is built before _update_after_schedule().
  5. Simulate _update_after_schedule() advancing num_computed_tokens so the second block is now complete.
  6. Finish the request.
  7. Submit an extended prompt whose prefix includes the original two-block request plus an extra block.

Observed result:

missed_final_store_event -1
missed_final_store_gpu_blocks []
extended_prompt_hit_tokens_after_miss 16

The second block was not scheduled for store, so the extended prompt only gets one block of CPU cache hit.

For comparison, if the second block is already confirmed before build_connector_meta() runs, it is stored and the extended prompt can hit both blocks:

stored_second_store_event 1
stored_second_store_gpu_blocks [3]
extended_prompt_hit_tokens_when_second_cached 32

Expected behavior

When a request finishes after computing a final full block, eager simple KV offload should either flush/store that final block before cleanup or preserve enough state to store it safely after the step completes.

Related work / not duplicates

I checked open issues and PRs for SimpleCPUOffloadScheduler, simple kv offload missed block, eager offload last block, and flush on finish kv offload.

Related but different:

  • #39702: TOCTOU crash in update_state_after_alloc()
  • #41289: deduplicating in-flight CPU offload stores across scheduler steps
  • #39860: KV event support
  • #19854: general KV cache offloading RFC

extent analysis

TL;DR

To fix the issue where SimpleCPUOffloadScheduler misses the final full KV block of a request, consider modifying the request_finished() method to flush or preserve the state of the newly completed block before cleanup.

Guidance

  • Review the SimpleCPUOffloadScheduler code to understand how the request_finished() method handles the cleanup of eager store tracking state.
  • Consider adding a flush mechanism to store the final block before cleanup, as hinted by the TODO comment in vllm/v1/simple_kv_offload/manager.py.
  • Verify the fix by reproducing the issue using the provided test case in tests/v1/simple_kv_offload/test_scheduler.py and checking if the extended prompt can hit both blocks.
  • Check the related issues (#39702, #41289, #39860, #19854) to ensure the fix does not introduce any new problems or conflicts with existing solutions.

Example

# Example of how the request_finished() method could be modified to flush the final block
def request_finished(self):
    #... existing code...
    # Flush the final block before cleanup
    self.flush_final_block()
    #... existing code...

Note: The actual implementation of flush_final_block() is not provided, as it depends on the specific requirements and constraints of the SimpleCPUOffloadScheduler code.

Notes

The fix may require careful consideration of the scheduler's state and the interaction between the request_finished() method and the build_connector_meta() method.

Recommendation

Apply a workaround by modifying the request_finished() method to flush or preserve the state of the newly completed block before cleanup, as this approach is more targeted and less likely to introduce new issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When a request finishes after computing a final full block, eager simple KV offload should either flush/store that final block before cleanup or preserve enough state to store it safely after the step completes.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: SimpleCPUOffloadScheduler misses final full block when request finishes in the same scheduler step [1 pull requests, 1 participants]