vllm - ✅(Solved) Fix Reasoning model thinking tokens pollute prefix cache with unreachable entries [1 pull requests, 1 participants]

vllm2026-04-08 17:12:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39321•Fetched 2026-04-09 07:51:53

View on GitHub

Comments

Participants

Timeline

Reactions

Author

wenxinzhang0

Participants

wenxinzhang0

Root Cause

Turn 2 arrives with [Q1, A1, Q2] (thinking stripped). Block hash computation: hash(Q1) → hash(Q1,A1) → ... — the second block's hash differs because its parent chain doesn't include T1. Only Q1 blocks match. A1 must be recomputed despite its KV being in the cache (stranded behind T1 in the hash chain).

PR fix notes

PR #39806: [Core] Immediately evict thinking token blocks from prefix cache

Repository: vllm-project/vllm
Author: wenxinzhang0
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39806

Description (problem / solution / changelog)

Summary

When serving reasoning models (DeepSeek-R1, QwQ, etc.) with --reasoning-parser, all output tokens — including <think>...</think> tokens — are cached in the prefix cache. Since most clients strip thinking tokens from subsequent turns (per DeepSeek API docs), these cached entries create dead branches that:

Waste GPU memory: ~1.3–1.6 GB per dead branch (5000 thinking tokens × 256–320 KB/token)
Block prefix matching: answer tokens stranded behind thinking branches are unreachable, forcing recomputation every turn
Compete with useful entries in LRU eviction — thinking blocks are append_n'd to the tail of the free queue (most recently used), so they're the last to be evicted despite being unreachable

With 50 concurrent conversations, dead branches consume 65–80 GB.

A benchmark on the equivalent SGLang fix (sgl-project/sglang#22617) measured with QwQ-32B on 2× B300 GPUs:

Metric	Baseline	With fix	Delta
Overall cache hit rate	11.1%	17.1%	+6.0%
TTFT p50	9.98s	8.96s	−1.02s

Approach

On request completion, if reasoning tokens are detected:

Prompt blocks → normal eviction path (tail of free queue, hash retained for prefix matching)
Thinking + answer blocks → immediately evicted: hash removed from cached_block_hash_to_block, block prepended to head of free queue (first to be reused)

Answer tokens after thinking are also evicted because their RoPE positional encodings are mismatched — computed at positions [input_len + thinking_len, ...] but would appear at [input_len, ...] in the next turn without thinking.

Adds cache_reasoning_tokens config flag (default False) so users who include thinking tokens in subsequent prompts (e.g., MiniMax models that append thinking to visible content) can opt in to caching them.

Changes

File	Change
`vllm/config/reasoning.py`	Add `cache_reasoning_tokens: bool = False`
`vllm/v1/request.py`	Add `num_reasoning_tokens` + depth-based counting method
`vllm/v1/core/kv_cache_utils.py`	Add `FreeKVCacheBlockQueue.prepend_n()` (head insertion)
`vllm/v1/core/block_pool.py`	Add `BlockPool.free_blocks_immediate_evict()`
`vllm/v1/core/single_type_kv_cache_manager.py`	Split `free()` into prompt vs thinking block paths
`vllm/v1/core/kv_cache_coordinator.py`	Pass through `num_thinking_blocks`
`vllm/v1/core/kv_cache_manager.py`	Pass through `num_thinking_blocks`
`vllm/v1/core/sched/scheduler.py`	Add `_get_num_thinking_blocks()` orchestration

Test plan

Unit tests for prepend_n() (basic, empty list, empty queue, ordering)
Unit tests for reasoning token counting (simple, nested, edge cases)
Integration tests for free_blocks_immediate_evict() (head ordering, hash removal)
Existing test_prefix_caching.py passes (56/56, no regressions)
Existing test_kv_cache_utils.py passes (no regressions)
E2E multi-turn test with reasoning model (needs GPU CI)

Fixes https://github.com/vllm-project/vllm/issues/39321
Same problem addressed in SGLang: sgl-project/sglang#22373, sgl-project/sglang#22617

AI assistance was used (Claude). All changes reviewed and tested by human submitter.

Changed files

tests/v1/core/test_thinking_token_cache.py (added, +273/-0)
vllm/config/reasoning.py (modified, +7/-0)
vllm/v1/core/block_pool.py (modified, +25/-0)
vllm/v1/core/kv_cache_coordinator.py (modified, +4/-2)
vllm/v1/core/kv_cache_manager.py (modified, +9/-2)
vllm/v1/core/kv_cache_utils.py (modified, +31/-0)
vllm/v1/core/sched/scheduler.py (modified, +60/-1)
vllm/v1/core/single_type_kv_cache_manager.py (modified, +32/-7)
vllm/v1/request.py (modified, +56/-0)

RAW_BUFFERClick to expand / collapse

When serving reasoning models (DeepSeek-R1, QwQ, etc.) with --reasoning-parser, all output tokens — including <think> tokens — are hashed and cached via Automatic Prefix Caching (APC). If the user does not include thinking tokens when constructing the next turn's prompt (e.g., DeepSeek's API docs explicitly say not to), these cached blocks will never be matched by any future prefix, becoming dead weight that wastes GPU memory until LRU eviction.

Code path

In request.py, append_output_token_ids adds output tokens to all_token_ids and calls update_block_hashes(), which hashes them into blocks eligible for prefix caching. When the request finishes, these blocks are freed (ref count decremented) but remain in cached_block_hash_to_block as eviction candidates — not deleted.

Since each block's hash depends on parent_block_hash (see hash_block_tokens in kv_cache_utils.py), the hash chain means blocks after the thinking tokens encode the entire prefix including thinking. A future request without thinking tokens computes different hashes — no match.

Why this causes wasted cache and recomputation

Consider a multi-turn conversation. Let Q = user prompt, T = thinking tokens, A = answer.

After turn 1, the block hash chain is: hash(Q1) → hash(Q1,T1) → ... → hash(Q1,T1,A1)

When turn 2 completes, its output creates a new hash chain: hash(Q1) → hash(Q1,A1) → hash(Q1,A1,Q2) → hash(Q1,A1,Q2,T2) → ...

The old T1 → A1 blocks from turn 1 are now permanently unreachable. This repeats every turn, accumulating dead blocks.

Memory impact

With ~5000 thinking tokens per turn (typical for DeepSeek-R1), each dead chain wastes:

QwQ-32B: 5000 × 256 KB/token ≈ 1.3 GB
DeepSeek-R1-Distill-70B: 5000 × 320 KB/token ≈ 1.6 GB

50 concurrent conversations with 1 recent dead chain each: 65-80 GB of dead KV cache competing with useful entries for eviction.

Proposal

The core issue is that what gets cached and what gets sent in future requests are decided independently. For example, vLLM's own chat completion API separates reasoning_content from content in responses via --reasoning-parser, yet the caching layer hashes all output tokens indiscriminately.

When --reasoning-parser is set, update_block_hashes should skip thinking tokens from the hash chain, so that only origin_input_ids + answer_tokens are cached. A flag (e.g., --cache-thinking-tokens) could let users opt in if they use custom prompt construction that includes thinking tokens.

Note: filed a parallel issue on SGLang (sgl-project/sglang#22373) where the same problem exists with RadixAttention.

Environment

vLLM: main branch (verified 2026-04-08)
Models: DeepSeek-R1, QwQ, and other reasoning models using --reasoning-parser

extent analysis

TL;DR

Modify the caching logic to exclude thinking tokens when --reasoning-parser is set, allowing only relevant tokens to be hashed and cached.

Guidance

Identify the update_block_hashes function in request.py and modify it to conditionally exclude thinking tokens based on the presence of --reasoning-parser.
Consider adding a flag, such as --cache-thinking-tokens, to provide users with control over caching behavior.
Review the hash_block_tokens function in kv_cache_utils.py to ensure it correctly handles the modified hash chain.
Test the changes with various models, including DeepSeek-R1 and QwQ, to verify the fix.

Example

# Modified update_block_hashes function
def update_block_hashes(token_ids, reasoning_parser=False):
    if reasoning_parser:
        # Exclude thinking tokens from the hash chain
        token_ids = [token for token in token_ids if token not in thinking_tokens]
    # Proceed with hashing and caching
    ...

Notes

The proposed fix assumes that the thinking tokens can be reliably identified and excluded from the hash chain. Additional testing may be necessary to ensure the correctness of the modified caching logic.

Recommendation

Apply the proposed workaround by modifying the update_block_hashes function to exclude thinking tokens when --reasoning-parser is set, as this directly addresses the identified issue and reduces memory waste.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retrieval issue #search optimization #API routing #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix Reasoning model thinking tokens pollute prefix cache with unreachable entries [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #39806: [Core] Immediately evict thinking token blocks from prefix cache

Description (problem / solution / changelog)

Summary

Approach

Changes

Test plan

Related

Changed files

Code path

Why this causes wasted cache and recomputation

Memory impact

Proposal

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix Reasoning model thinking tokens pollute prefix cache with unreachable entries [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #39806: [Core] Immediately evict thinking token blocks from prefix cache

Description (problem / solution / changelog)

Summary

Approach

Changes

Test plan

Related

Changed files

Code path

Why this causes wasted cache and recomputation

Memory impact

Proposal

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING