vllm - 💡(How to fix) Fix [Bug]: [PD + SpecDec] Prefix-cache trimming drops wrong block when P has extra lookahead block

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

_apply_prefix_caching assumes num_local < num_remote means the extra remote blocks are a cached prefix at the front (to be skipped). It trims with remote[-num_local:]. But here the extra block is a lookahead block at the back, so the trim drops the wrong end.

Fix Action

Fix / Workaround

Chatting with @benchislett re: #43733, we realized the original workaround introduced for PD (#22317) can cause similar silent KV Cache corruption issues.

When P/D disaggregation is used with speculative decoding (EAGLE), the effective_lookahead_tokens workaround (scheduler.py:L700-707) only zeroes lookahead on D (gated by load_kv_async). P keeps its full num_lookahead_tokens, so at block boundaries P allocates one more block than D. The connector's prefix-cache trimming then drops the first data block instead of the extra lookahead block at the end.

Related #22317 -- introduced the zeroing workaround #33702 -- RFC for explicit P/D roles (would allow addressing this cleanly) #43733 -- DFlash regression caused by the same num_computed_tokens == 0 condition #39266 -- Skip draft propose for disagg prefill instance

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Chatting with @benchislett re: #43733, we realized the original workaround introduced for PD (#22317) can cause similar silent KV Cache corruption issues.

When P/D disaggregation is used with speculative decoding (EAGLE), the effective_lookahead_tokens workaround (scheduler.py:L700-707) only zeroes lookahead on D (gated by load_kv_async). P keeps its full num_lookahead_tokens, so at block boundaries P allocates one more block than D. The connector's prefix-cache trimming then drops the first data block instead of the extra lookahead block at the end.

Example

block_size=16, num_prompt_tokens=16, P has num_lookahead_tokens=1:

P allocates ceil(17/16) = 2 blocks: [b0, b1] b0: KV for tokens 0-15 (prompt data) b1: KV for token 16 (lookahead / draft) D allocates ceil(16/16) = 1 block: [a0] _apply_prefix_caching (worker.py:L2312-2316) sees 1 local < 2 remote, trims: remote[-1:] = [b1] Transfer: P's b1 (lookahead KV for token 16) -> D's a0 (should hold KV for tokens 0-15) D ends up with the wrong data: KV for a speculated token instead of the first block of prompt KV. This is a correctness bug at block boundaries (prompt length is an exact multiple of block_size).

Root cause

_apply_prefix_caching assumes num_local < num_remote means the extra remote blocks are a cached prefix at the front (to be skipped). It trims with remote[-num_local:]. But here the extra block is a lookahead block at the back, so the trim drops the wrong end.

Why it doesn't always manifest

The bug only triggers when num_prompt_tokens aligns exactly with a block boundary, causing the lookahead to spill into an extra block. Most test prompts don't hit this alignment.

Additionally, P wastes compute on drafter prefill + sampling + drafting that D discards entirely (to be addressed by something like #39266).

Proposed fix

I believe the specialized logic above introduces unnecessary complexity that we should address from a design prospective. Therefore I propose the following

Related #22317 -- introduced the zeroing workaround #33702 -- RFC for explicit P/D roles (would allow addressing this cleanly) #43733 -- DFlash regression caused by the same num_computed_tokens == 0 condition #39266 -- Skip draft propose for disagg prefill instance

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: [PD + SpecDec] Prefix-cache trimming drops wrong block when P has extra lookahead block