vllm - 💡(How to fix) Fix [Bug]: SimpleCPUOffloadScheduler crashes with AssertionError: Expected N hit tokens, got 0 (TOCTOU race in update_state_after_alloc) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39702Fetched 2026-04-15 06:20:50
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
referenced ×5

Error Message

AssertionError: Expected 19264 hit tokens, got 0 File "vllm/v1/simple_kv_offload/manager.py", line 267, in update_state_after_alloc assert hit_length == num_external_tokens, ( f"Expected {num_external_tokens} hit tokens, got {hit_length}" )

Root Cause

This is a TOCTOU (time-of-check/time-of-use) race between get_num_new_matched_tokens() and update_state_after_alloc():

1. scheduler.py:619
   get_num_new_matched_tokens()
   └─ find_longest_cache_hit() → finds N tokens in CPU LRU cache
      returns hit_length=N, BUT discards cpu_hit_blocks (_ = ...)
      *** no touch() called — blocks are NOT pinned ***

2. scheduler.py:746-754
   kv_cache_manager.allocate_slots()
   └─ may call cpu_block_pool.get_new_blocks()
      └─ triggers CPU LRU eviction
         *** the N blocks found in step 1 are evicted here ***

3. scheduler.py:771
   update_state_after_alloc(request, blocks, num_external_tokens=N)
   └─ find_longest_cache_hit() again → returns 0 (blocks evicted)
      assert 0 == N  →  CRASH

The issue is in get_num_new_matched_tokens() at manager.py:222:

_, hit_length = self.cpu_coordinator.find_longest_cache_hit(   # blocks discarded!
    remaining_hashes, max_hit_len
)

The found CPU blocks are discarded (_), so their ref_cnt is never incremented. When allocate_slots() runs between the two calls it can trigger CPU LRU eviction, silently removing those blocks before update_state_after_alloc() can use them.

Code Example

AssertionError: Expected 19264 hit tokens, got 0
  File "vllm/v1/simple_kv_offload/manager.py", line 267, in update_state_after_alloc
    assert hit_length == num_external_tokens, (
        f"Expected {num_external_tokens} hit tokens, got {hit_length}"
    )

---

1. scheduler.py:619
   get_num_new_matched_tokens()
   └─ find_longest_cache_hit() → finds N tokens in CPU LRU cache
      returns hit_length=N, BUT discards cpu_hit_blocks (_ = ...)
      *** no touch() called — blocks are NOT pinned ***

2. scheduler.py:746-754
   kv_cache_manager.allocate_slots()
   └─ may call cpu_block_pool.get_new_blocks()
      └─ triggers CPU LRU eviction
         *** the N blocks found in step 1 are evicted here ***

3. scheduler.py:771
   update_state_after_alloc(request, blocks, num_external_tokens=N)
   └─ find_longest_cache_hit() again → returns 0 (blocks evicted)
      assert 0 == NCRASH

---

_, hit_length = self.cpu_coordinator.find_longest_cache_hit(   # blocks discarded!
    remaining_hashes, max_hit_len
)

---

# In get_num_new_matched_tokens():
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
if hit_length > 0:
    # Pin immediately to prevent LRU eviction before update_state_after_alloc()
    all_hit_blocks = [blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null]
    self.cpu_block_pool.touch(all_hit_blocks)          # temporary pin
    self._pending_cpu_hits[request.request_id] = (cpu_hit_blocks, hit_length)
    return hit_length, True

# In update_state_after_alloc():
# Pop cached result — no second find_longest_cache_hit() needed
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is None:
    return  # graceful fallback instead of assert
cpu_hit_blocks, actual_hit_length = pending
# ... use cpu_hit_blocks directly ...
self.cpu_block_pool.touch(cpu_blocks_to_touch)     # persistent pin for async load
self.cpu_block_pool.free_blocks(all_hit_blocks)    # release temporary pin

# In request_finished():
# Release temporary pin if request is preempted before update_state_after_alloc()
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is not None:
    self.cpu_block_pool.free_blocks([...])

---

VLLM_USE_SIMPLE_KV_OFFLOAD=1 vllm serve <model> \
  --kv-offloading-size 32 \
  --no-disable-hybrid-kv-cache-manager \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 262144
RAW_BUFFERClick to expand / collapse

Describe the bug

SimpleCPUOffloadScheduler.update_state_after_alloc() crashes with an AssertionError during long-running sessions when CPU KV offloading is enabled.

AssertionError: Expected 19264 hit tokens, got 0
  File "vllm/v1/simple_kv_offload/manager.py", line 267, in update_state_after_alloc
    assert hit_length == num_external_tokens, (
        f"Expected {num_external_tokens} hit tokens, got {hit_length}"
    )

The server crashes and all in-flight requests are lost. It reproduces reliably after extended use (typically after the CPU cache fills up and LRU eviction begins).

Root cause

This is a TOCTOU (time-of-check/time-of-use) race between get_num_new_matched_tokens() and update_state_after_alloc():

1. scheduler.py:619
   get_num_new_matched_tokens()
   └─ find_longest_cache_hit() → finds N tokens in CPU LRU cache
      returns hit_length=N, BUT discards cpu_hit_blocks (_ = ...)
      *** no touch() called — blocks are NOT pinned ***

2. scheduler.py:746-754
   kv_cache_manager.allocate_slots()
   └─ may call cpu_block_pool.get_new_blocks()
      └─ triggers CPU LRU eviction
         *** the N blocks found in step 1 are evicted here ***

3. scheduler.py:771
   update_state_after_alloc(request, blocks, num_external_tokens=N)
   └─ find_longest_cache_hit() again → returns 0 (blocks evicted)
      assert 0 == N  →  CRASH

The issue is in get_num_new_matched_tokens() at manager.py:222:

_, hit_length = self.cpu_coordinator.find_longest_cache_hit(   # blocks discarded!
    remaining_hashes, max_hit_len
)

The found CPU blocks are discarded (_), so their ref_cnt is never incremented. When allocate_slots() runs between the two calls it can trigger CPU LRU eviction, silently removing those blocks before update_state_after_alloc() can use them.

Proposed fix

Pin the found blocks immediately in get_num_new_matched_tokens(), cache the result, and reuse it in update_state_after_alloc() instead of searching again.

# In get_num_new_matched_tokens():
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
if hit_length > 0:
    # Pin immediately to prevent LRU eviction before update_state_after_alloc()
    all_hit_blocks = [blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null]
    self.cpu_block_pool.touch(all_hit_blocks)          # temporary pin
    self._pending_cpu_hits[request.request_id] = (cpu_hit_blocks, hit_length)
    return hit_length, True

# In update_state_after_alloc():
# Pop cached result — no second find_longest_cache_hit() needed
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is None:
    return  # graceful fallback instead of assert
cpu_hit_blocks, actual_hit_length = pending
# ... use cpu_hit_blocks directly ...
self.cpu_block_pool.touch(cpu_blocks_to_touch)     # persistent pin for async load
self.cpu_block_pool.free_blocks(all_hit_blocks)    # release temporary pin

# In request_finished():
# Release temporary pin if request is preempted before update_state_after_alloc()
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is not None:
    self.cpu_block_pool.free_blocks([...])

This completely eliminates the TOCTOU window: the blocks are pinned from the moment they are found until either the persistent async-load pin takes over or the request finishes.

Configuration

VLLM_USE_SIMPLE_KV_OFFLOAD=1 vllm serve <model> \
  --kv-offloading-size 32 \
  --no-disable-hybrid-kv-cache-manager \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 262144

Environment

  • vLLM version: 0.19.1rc1.dev203+g0f3ce4c74.d20260411
  • GPU: 2× RTX 4090 (TP=2)
  • Model: Gemma4-31B AWQ-4bit
  • Python: 3.12
  • OS: Ubuntu 22.04 (Linux 6.8.0)
  • Introduced by: PR #37160 (merged 2026-04-01) — SimpleCPUOffloadConnector is new, manager.py has had zero follow-up commits

Additional context

The assert at line 267 was always racy by design — the two find_longest_cache_hit() calls have no mutual exclusion and nothing prevents the CPU LRU from evicting blocks between them. It only manifests once the CPU cache is full and LRU eviction is active (typically after 10–30 minutes of use with a large context window).

Note: This is NOT related to the existing --cpu-offload-gb path (cpu_offload/ directory). This is specific to the simple_kv_offload/ path enabled by VLLM_USE_SIMPLE_KV_OFFLOAD=1 / --kv-offloading-size.

extent analysis

TL;DR

Pin the found CPU blocks immediately in get_num_new_matched_tokens() to prevent LRU eviction and cache the result for reuse in update_state_after_alloc().

Guidance

  • Identify the get_num_new_matched_tokens() function in manager.py and modify it to pin the found CPU blocks using self.cpu_block_pool.touch(all_hit_blocks).
  • Cache the result of find_longest_cache_hit() in get_num_new_matched_tokens() and reuse it in update_state_after_alloc() to avoid the TOCTOU race.
  • Implement a fallback mechanism in update_state_after_alloc() to handle cases where the cached result is not available.
  • Review the request_finished() function to ensure that temporary pins are released when requests are preempted.

Example

# In get_num_new_matched_tokens():
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
if hit_length > 0:
    all_hit_blocks = [blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null]
    self.cpu_block_pool.touch(all_hit_blocks)  # temporary pin
    self._pending_cpu_hits[request.request_id] = (cpu_hit_blocks, hit_length)
    return hit_length, True

Notes

The proposed fix assumes that the cpu_block_pool.touch() method is thread-safe and can be used to pin the CPU blocks. Additionally, the fix relies on the request_finished() function to release temporary pins when requests are preempted.

Recommendation

Apply the proposed workaround by modifying the get_num_new_matched_tokens() and update_state_after_alloc() functions to pin and cache the found CPU blocks, as this completely eliminates the TOCTOU window and prevents the AssertionError.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING