vllm - 💡(How to fix) Fix [Bug][PD] Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns [5 comments, 4 participants]

vllm2026-05-19 11:52:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#43094•Fetched 2026-05-20 03:39:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5mentioned ×3subscribed ×3cross-referenced ×1

Bidirectional KV transfer (PR #32553, RFC #32733) can produce incorrect inference results when used with reasoning models (e.g. DeepSeek-R1) whose thinking traces are stripped from the conversation history between turns.

Root Cause

Code Example

remote_block_ids[i] = remote_group[-num_local_blocks:]

RAW_BUFFERClick to expand / collapse

This is an issue to make sure this behavior is tracked and consistently addressed by upper-level routers.

Summary

Problem

When D generates a response with thinking traces, its kv_transfer_params records:

remote_num_tokens = request.num_computed_tokens — covering [prompt | thinking_tokens | response_tokens]
remote_block_ids — physical blocks for the entire sequence

On the next turn, if the client strips thinking traces before sending, P receives a prompt like [prompt | response_tokens | new_user_msg] — the thinking tokens are missing from the middle.

The block-alignment logic in _apply_prefix_caching (worker.py:2310-2316) does suffix trimming:

remote_block_ids[i] = remote_group[-num_local_blocks:]

This assumes P's prompt is a strict prefix of D's sequence — true in the normal case, but broken when tokens are removed from the middle (for whatever reason, like compacting history or dropping thinking traces).

Result: P loads KV cache computed for different tokens than its actual input, leading to silently incorrect inference.

Suggested Fixes

Router-level detection: (currently suggested approach) A production router can keep the previous turn tokens, indexing by conversation_id. On turn 2, the router can compare prompt tokens. If the prompt is different than expected (traces were stripped), clear the cached kv params and let P fall back to full recomputation.

vllm-level fix: We would need to design a thinking tokens-aware solution, possibly block-aligned so that only non-thinking tokens are pulled. This would require significantly more effort and an RFC first.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#conversation history #network issue #logging issue #authentication issue #prompt issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug][PD] Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns [5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Problem

Suggested Fixes

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug][PD] Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns [5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Problem

Suggested Fixes

Still need to ship something?

RELATED_DISCOVERY

TRENDING