vllm - 💡(How to fix) Fix [Bug]: SimpleCPUOffloadScheduler crashes with AssertionError: Expected N hit tokens, got 0 (TOCTOU race in update_state_after_alloc) [1 participants]

vllm2026-04-13 12:46:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39702•Fetched 2026-04-15 06:20:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

go-bai

Participants

go-bai

Timeline (top)

referenced ×5

Error Message

AssertionError: Expected 19264 hit tokens, got 0 File "vllm/v1/simple_kv_offload/manager.py", line 267, in update_state_after_alloc assert hit_length == num_external_tokens, ( f"Expected {num_external_tokens} hit tokens, got {hit_length}" )

Root Cause

This is a TOCTOU (time-of-check/time-of-use) race between get_num_new_matched_tokens() and update_state_after_alloc():

1. scheduler.py:619
   get_num_new_matched_tokens()
   └─ find_longest_cache_hit() → finds N tokens in CPU LRU cache
      returns hit_length=N, BUT discards cpu_hit_blocks (_ = ...)
      *** no touch() called — blocks are NOT pinned ***

2. scheduler.py:746-754
   kv_cache_manager.allocate_slots()
   └─ may call cpu_block_pool.get_new_blocks()
      └─ triggers CPU LRU eviction
         *** the N blocks found in step 1 are evicted here ***

3. scheduler.py:771
   update_state_after_alloc(request, blocks, num_external_tokens=N)
   └─ find_longest_cache_hit() again → returns 0 (blocks evicted)
      assert 0 == N  →  CRASH

The issue is in get_num_new_matched_tokens() at manager.py:222:

_, hit_length = self.cpu_coordinator.find_longest_cache_hit(   # blocks discarded!
    remaining_hashes, max_hit_len
)

The found CPU blocks are discarded (_), so their ref_cnt is never incremented. When allocate_slots() runs between the two calls it can trigger CPU LRU eviction, silently removing those blocks before update_state_after_alloc() can use them.

Code Example

AssertionError: Expected 19264 hit tokens, got 0
  File "vllm/v1/simple_kv_offload/manager.py", line 267, in update_state_after_alloc
    assert hit_length == num_external_tokens, (
        f"Expected {num_external_tokens} hit tokens, got {hit_length}"
    )

---

1. scheduler.py:619
   get_num_new_matched_tokens()
   └─ find_longest_cache_hit() → finds N tokens in CPU LRU cache
      returns hit_length=N, BUT discards cpu_hit_blocks (_ = ...)
      *** no touch() called — blocks are NOT pinned ***

2. scheduler.py:746-754
   kv_cache_manager.allocate_slots()
   └─ may call cpu_block_pool.get_new_blocks()
      └─ triggers CPU LRU eviction
         *** the N blocks found in step 1 are evicted here ***

3. scheduler.py:771
   update_state_after_alloc(request, blocks, num_external_tokens=N)
   └─ find_longest_cache_hit() again → returns 0 (blocks evicted)
      assert 0 == N  →  CRASH

---

_, hit_length = self.cpu_coordinator.find_longest_cache_hit(   # blocks discarded!
    remaining_hashes, max_hit_len
)

---

# In get_num_new_matched_tokens():
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
if hit_length > 0:
    # Pin immediately to prevent LRU eviction before update_state_after_alloc()
    all_hit_blocks = [blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null]
    self.cpu_block_pool.touch(all_hit_blocks)          # temporary pin
    self._pending_cpu_hits[request.request_id] = (cpu_hit_blocks, hit_length)
    return hit_length, True

# In update_state_after_alloc():
# Pop cached result — no second find_longest_cache_hit() needed
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is None:
    return  # graceful fallback instead of assert
cpu_hit_blocks, actual_hit_length = pending
# ... use cpu_hit_blocks directly ...
self.cpu_block_pool.touch(cpu_blocks_to_touch)     # persistent pin for async load
self.cpu_block_pool.free_blocks(all_hit_blocks)    # release temporary pin

# In request_finished():
# Release temporary pin if request is preempted before update_state_after_alloc()
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is not None:
    self.cpu_block_pool.free_blocks([...])

---

VLLM_USE_SIMPLE_KV_OFFLOAD=1 vllm serve <model> \
  --kv-offloading-size 32 \
  --no-disable-hybrid-kv-cache-manager \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 262144

RAW_BUFFERClick to expand / collapse

Describe the bug

SimpleCPUOffloadScheduler.update_state_after_alloc() crashes with an AssertionError during long-running sessions when CPU KV offloading is enabled.

AssertionError: Expected 19264 hit tokens, got 0
  File "vllm/v1/simple_kv_offload/manager.py", line 267, in update_state_after_alloc
    assert hit_length == num_external_tokens, (
        f"Expected {num_external_tokens} hit tokens, got {hit_length}"
    )

The server crashes and all in-flight requests are lost. It reproduces reliably after extended use (typically after the CPU cache fills up and LRU eviction begins).

Root cause

This is a TOCTOU (time-of-check/time-of-use) race between get_num_new_matched_tokens() and update_state_after_alloc():

1. scheduler.py:619
   get_num_new_matched_tokens()
   └─ find_longest_cache_hit() → finds N tokens in CPU LRU cache
      returns hit_length=N, BUT discards cpu_hit_blocks (_ = ...)
      *** no touch() called — blocks are NOT pinned ***

2. scheduler.py:746-754
   kv_cache_manager.allocate_slots()
   └─ may call cpu_block_pool.get_new_blocks()
      └─ triggers CPU LRU eviction
         *** the N blocks found in step 1 are evicted here ***

3. scheduler.py:771
   update_state_after_alloc(request, blocks, num_external_tokens=N)
   └─ find_longest_cache_hit() again → returns 0 (blocks evicted)
      assert 0 == N  →  CRASH

The issue is in get_num_new_matched_tokens() at manager.py:222:

_, hit_length = self.cpu_coordinator.find_longest_cache_hit(   # blocks discarded!
    remaining_hashes, max_hit_len
)

Proposed fix

Pin the found blocks immediately in get_num_new_matched_tokens(), cache the result, and reuse it in update_state_after_alloc() instead of searching again.

# In get_num_new_matched_tokens():
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
if hit_length > 0:
    # Pin immediately to prevent LRU eviction before update_state_after_alloc()
    all_hit_blocks = [blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null]
    self.cpu_block_pool.touch(all_hit_blocks)          # temporary pin
    self._pending_cpu_hits[request.request_id] = (cpu_hit_blocks, hit_length)
    return hit_length, True

# In update_state_after_alloc():
# Pop cached result — no second find_longest_cache_hit() needed
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is None:
    return  # graceful fallback instead of assert
cpu_hit_blocks, actual_hit_length = pending
# ... use cpu_hit_blocks directly ...
self.cpu_block_pool.touch(cpu_blocks_to_touch)     # persistent pin for async load
self.cpu_block_pool.free_blocks(all_hit_blocks)    # release temporary pin

# In request_finished():
# Release temporary pin if request is preempted before update_state_after_alloc()
pending = self._pending_cpu_hits.pop(req_id, None)
if pending is not None:
    self.cpu_block_pool.free_blocks([...])

This completely eliminates the TOCTOU window: the blocks are pinned from the moment they are found until either the persistent async-load pin takes over or the request finishes.

Configuration

VLLM_USE_SIMPLE_KV_OFFLOAD=1 vllm serve <model> \
  --kv-offloading-size 32 \
  --no-disable-hybrid-kv-cache-manager \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 262144

Environment

vLLM version: 0.19.1rc1.dev203+g0f3ce4c74.d20260411
GPU: 2× RTX 4090 (TP=2)
Model: Gemma4-31B AWQ-4bit
Python: 3.12
OS: Ubuntu 22.04 (Linux 6.8.0)
Introduced by: PR #37160 (merged 2026-04-01) — SimpleCPUOffloadConnector is new, manager.py has had zero follow-up commits

Additional context

The assert at line 267 was always racy by design — the two find_longest_cache_hit() calls have no mutual exclusion and nothing prevents the CPU LRU from evicting blocks between them. It only manifests once the CPU cache is full and LRU eviction is active (typically after 10–30 minutes of use with a large context window).

Note: This is NOT related to the existing --cpu-offload-gb path (cpu_offload/ directory). This is specific to the simple_kv_offload/ path enabled by VLLM_USE_SIMPLE_KV_OFFLOAD=1 / --kv-offloading-size.

extent analysis

TL;DR

Pin the found CPU blocks immediately in get_num_new_matched_tokens() to prevent LRU eviction and cache the result for reuse in update_state_after_alloc().

Guidance

Identify the get_num_new_matched_tokens() function in manager.py and modify it to pin the found CPU blocks using self.cpu_block_pool.touch(all_hit_blocks).
Cache the result of find_longest_cache_hit() in get_num_new_matched_tokens() and reuse it in update_state_after_alloc() to avoid the TOCTOU race.
Implement a fallback mechanism in update_state_after_alloc() to handle cases where the cached result is not available.
Review the request_finished() function to ensure that temporary pins are released when requests are preempted.

Example

# In get_num_new_matched_tokens():
cpu_hit_blocks, hit_length = self.cpu_coordinator.find_longest_cache_hit(...)
if hit_length > 0:
    all_hit_blocks = [blk for grp in cpu_hit_blocks for blk in grp if not blk.is_null]
    self.cpu_block_pool.touch(all_hit_blocks)  # temporary pin
    self._pending_cpu_hits[request.request_id] = (cpu_hit_blocks, hit_length)
    return hit_length, True

Notes

The proposed fix assumes that the cpu_block_pool.touch() method is thread-safe and can be used to pin the CPU blocks. Additionally, the fix relies on the request_finished() function to release temporary pins when requests are preempted.

Recommendation

Apply the proposed workaround by modifying the get_num_new_matched_tokens() and update_state_after_alloc() functions to pin and cache the found CPU blocks, as this completely eliminates the TOCTOU window and prevents the AssertionError.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#authentication setup #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: SimpleCPUOffloadScheduler crashes with AssertionError: Expected N hit tokens, got 0 (TOCTOU race in update_state_after_alloc) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Describe the bug

Root cause

Proposed fix

Configuration

Environment

Additional context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: SimpleCPUOffloadScheduler crashes with AssertionError: Expected N hit tokens, got 0 (TOCTOU race in update_state_after_alloc) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Describe the bug

Root cause

Proposed fix

Configuration

Environment

Additional context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING