vllm - ✅(Solved) Fix Reasoning model thinking tokens pollute prefix cache with unreachable entries [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39321Fetched 2026-04-09 07:51:53
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Root Cause

Turn 2 arrives with [Q1, A1, Q2] (thinking stripped). Block hash computation: hash(Q1) → hash(Q1,A1) → ... — the second block's hash differs because its parent chain doesn't include T1. Only Q1 blocks match. A1 must be recomputed despite its KV being in the cache (stranded behind T1 in the hash chain).

PR fix notes

PR #39806: [Core] Immediately evict thinking token blocks from prefix cache

Description (problem / solution / changelog)

Summary

When serving reasoning models (DeepSeek-R1, QwQ, etc.) with --reasoning-parser, all output tokens — including <think>...</think> tokens — are cached in the prefix cache. Since most clients strip thinking tokens from subsequent turns (per DeepSeek API docs), these cached entries create dead branches that:

  1. Waste GPU memory: ~1.3–1.6 GB per dead branch (5000 thinking tokens × 256–320 KB/token)
  2. Block prefix matching: answer tokens stranded behind thinking branches are unreachable, forcing recomputation every turn
  3. Compete with useful entries in LRU eviction — thinking blocks are append_n'd to the tail of the free queue (most recently used), so they're the last to be evicted despite being unreachable

With 50 concurrent conversations, dead branches consume 65–80 GB.

A benchmark on the equivalent SGLang fix (sgl-project/sglang#22617) measured with QwQ-32B on 2× B300 GPUs:

MetricBaselineWith fixDelta
Overall cache hit rate11.1%17.1%+6.0%
TTFT p509.98s8.96s−1.02s

Approach

On request completion, if reasoning tokens are detected:

  • Prompt blocks → normal eviction path (tail of free queue, hash retained for prefix matching)
  • Thinking + answer blocks → immediately evicted: hash removed from cached_block_hash_to_block, block prepended to head of free queue (first to be reused)

Answer tokens after thinking are also evicted because their RoPE positional encodings are mismatched — computed at positions [input_len + thinking_len, ...] but would appear at [input_len, ...] in the next turn without thinking.

Adds cache_reasoning_tokens config flag (default False) so users who include thinking tokens in subsequent prompts (e.g., MiniMax models that append thinking to visible content) can opt in to caching them.

Changes

FileChange
vllm/config/reasoning.pyAdd cache_reasoning_tokens: bool = False
vllm/v1/request.pyAdd num_reasoning_tokens + depth-based counting method
vllm/v1/core/kv_cache_utils.pyAdd FreeKVCacheBlockQueue.prepend_n() (head insertion)
vllm/v1/core/block_pool.pyAdd BlockPool.free_blocks_immediate_evict()
vllm/v1/core/single_type_kv_cache_manager.pySplit free() into prompt vs thinking block paths
vllm/v1/core/kv_cache_coordinator.pyPass through num_thinking_blocks
vllm/v1/core/kv_cache_manager.pyPass through num_thinking_blocks
vllm/v1/core/sched/scheduler.pyAdd _get_num_thinking_blocks() orchestration

Test plan

  • Unit tests for prepend_n() (basic, empty list, empty queue, ordering)
  • Unit tests for reasoning token counting (simple, nested, edge cases)
  • Integration tests for free_blocks_immediate_evict() (head ordering, hash removal)
  • Existing test_prefix_caching.py passes (56/56, no regressions)
  • Existing test_kv_cache_utils.py passes (no regressions)
  • E2E multi-turn test with reasoning model (needs GPU CI)

Related

AI assistance was used (Claude). All changes reviewed and tested by human submitter.

Changed files

  • tests/v1/core/test_thinking_token_cache.py (added, +273/-0)
  • vllm/config/reasoning.py (modified, +7/-0)
  • vllm/v1/core/block_pool.py (modified, +25/-0)
  • vllm/v1/core/kv_cache_coordinator.py (modified, +4/-2)
  • vllm/v1/core/kv_cache_manager.py (modified, +9/-2)
  • vllm/v1/core/kv_cache_utils.py (modified, +31/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +60/-1)
  • vllm/v1/core/single_type_kv_cache_manager.py (modified, +32/-7)
  • vllm/v1/request.py (modified, +56/-0)
RAW_BUFFERClick to expand / collapse

When serving reasoning models (DeepSeek-R1, QwQ, etc.) with --reasoning-parser, all output tokens — including <think> tokens — are hashed and cached via Automatic Prefix Caching (APC). If the user does not include thinking tokens when constructing the next turn's prompt (e.g., DeepSeek's API docs explicitly say not to), these cached blocks will never be matched by any future prefix, becoming dead weight that wastes GPU memory until LRU eviction.

Code path

In request.py, append_output_token_ids adds output tokens to all_token_ids and calls update_block_hashes(), which hashes them into blocks eligible for prefix caching. When the request finishes, these blocks are freed (ref count decremented) but remain in cached_block_hash_to_block as eviction candidates — not deleted.

Since each block's hash depends on parent_block_hash (see hash_block_tokens in kv_cache_utils.py), the hash chain means blocks after the thinking tokens encode the entire prefix including thinking. A future request without thinking tokens computes different hashes — no match.

Why this causes wasted cache and recomputation

Consider a multi-turn conversation. Let Q = user prompt, T = thinking tokens, A = answer.

After turn 1, the block hash chain is: hash(Q1) → hash(Q1,T1) → ... → hash(Q1,T1,A1)

Turn 2 arrives with [Q1, A1, Q2] (thinking stripped). Block hash computation: hash(Q1) → hash(Q1,A1) → ... — the second block's hash differs because its parent chain doesn't include T1. Only Q1 blocks match. A1 must be recomputed despite its KV being in the cache (stranded behind T1 in the hash chain).

When turn 2 completes, its output creates a new hash chain: hash(Q1) → hash(Q1,A1) → hash(Q1,A1,Q2) → hash(Q1,A1,Q2,T2) → ...

The old T1 → A1 blocks from turn 1 are now permanently unreachable. This repeats every turn, accumulating dead blocks.

Memory impact

With ~5000 thinking tokens per turn (typical for DeepSeek-R1), each dead chain wastes:

  • QwQ-32B: 5000 × 256 KB/token ≈ 1.3 GB
  • DeepSeek-R1-Distill-70B: 5000 × 320 KB/token ≈ 1.6 GB

50 concurrent conversations with 1 recent dead chain each: 65-80 GB of dead KV cache competing with useful entries for eviction.

Proposal

The core issue is that what gets cached and what gets sent in future requests are decided independently. For example, vLLM's own chat completion API separates reasoning_content from content in responses via --reasoning-parser, yet the caching layer hashes all output tokens indiscriminately.

When --reasoning-parser is set, update_block_hashes should skip thinking tokens from the hash chain, so that only origin_input_ids + answer_tokens are cached. A flag (e.g., --cache-thinking-tokens) could let users opt in if they use custom prompt construction that includes thinking tokens.

Note: filed a parallel issue on SGLang (sgl-project/sglang#22373) where the same problem exists with RadixAttention.

Environment

  • vLLM: main branch (verified 2026-04-08)
  • Models: DeepSeek-R1, QwQ, and other reasoning models using --reasoning-parser

extent analysis

TL;DR

Modify the caching logic to exclude thinking tokens when --reasoning-parser is set, allowing only relevant tokens to be hashed and cached.

Guidance

  • Identify the update_block_hashes function in request.py and modify it to conditionally exclude thinking tokens based on the presence of --reasoning-parser.
  • Consider adding a flag, such as --cache-thinking-tokens, to provide users with control over caching behavior.
  • Review the hash_block_tokens function in kv_cache_utils.py to ensure it correctly handles the modified hash chain.
  • Test the changes with various models, including DeepSeek-R1 and QwQ, to verify the fix.

Example

# Modified update_block_hashes function
def update_block_hashes(token_ids, reasoning_parser=False):
    if reasoning_parser:
        # Exclude thinking tokens from the hash chain
        token_ids = [token for token in token_ids if token not in thinking_tokens]
    # Proceed with hashing and caching
    ...

Notes

The proposed fix assumes that the thinking tokens can be reliably identified and excluded from the hash chain. Additional testing may be necessary to ensure the correctness of the modified caching logic.

Recommendation

Apply the proposed workaround by modifying the update_block_hashes function to exclude thinking tokens when --reasoning-parser is set, as this directly addresses the identified issue and reduces memory waste.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING