vllm - 💡(How to fix) Fix [Bug][PD] Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns [5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#43094Fetched 2026-05-20 03:39:52
View on GitHub
Comments
5
Participants
4
Timeline
14
Reactions
0
Timeline (top)
commented ×5mentioned ×3subscribed ×3cross-referenced ×1

Bidirectional KV transfer (PR #32553, RFC #32733) can produce incorrect inference results when used with reasoning models (e.g. DeepSeek-R1) whose thinking traces are stripped from the conversation history between turns.

Root Cause

Bidirectional KV transfer (PR #32553, RFC #32733) can produce incorrect inference results when used with reasoning models (e.g. DeepSeek-R1) whose thinking traces are stripped from the conversation history between turns.

Code Example

remote_block_ids[i] = remote_group[-num_local_blocks:]
RAW_BUFFERClick to expand / collapse

This is an issue to make sure this behavior is tracked and consistently addressed by upper-level routers.

Summary

Bidirectional KV transfer (PR #32553, RFC #32733) can produce incorrect inference results when used with reasoning models (e.g. DeepSeek-R1) whose thinking traces are stripped from the conversation history between turns.

Problem

When D generates a response with thinking traces, its kv_transfer_params records:

  • remote_num_tokens = request.num_computed_tokens — covering [prompt | thinking_tokens | response_tokens]
  • remote_block_ids — physical blocks for the entire sequence

On the next turn, if the client strips thinking traces before sending, P receives a prompt like [prompt | response_tokens | new_user_msg] — the thinking tokens are missing from the middle.

The block-alignment logic in _apply_prefix_caching (worker.py:2310-2316) does suffix trimming:

remote_block_ids[i] = remote_group[-num_local_blocks:]

This assumes P's prompt is a strict prefix of D's sequence — true in the normal case, but broken when tokens are removed from the middle (for whatever reason, like compacting history or dropping thinking traces).

Result: P loads KV cache computed for different tokens than its actual input, leading to silently incorrect inference.

Suggested Fixes

Router-level detection: (currently suggested approach) A production router can keep the previous turn tokens, indexing by conversation_id. On turn 2, the router can compare prompt tokens. If the prompt is different than expected (traces were stripped), clear the cached kv params and let P fall back to full recomputation.

vllm-level fix: We would need to design a thinking tokens-aware solution, possibly block-aligned so that only non-thinking tokens are pulled. This would require significantly more effort and an RFC first.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug][PD] Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns [5 comments, 4 participants]