vllm - 💡(How to fix) Fix [Bug]: [PD + SpecDec] Prefix-cache trimming drops wrong block when P has extra lookahead block

vllm2026-05-29 15:25:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

_apply_prefix_caching assumes num_local < num_remote means the extra remote blocks are a cached prefix at the front (to be skipped). It trims with remote[-num_local:]. But here the extra block is a lookahead block at the back, so the trim drops the wrong end.

Fix Action

Fix / Workaround

Chatting with @benchislett re: #43733, we realized the original workaround introduced for PD (#22317) can cause similar silent KV Cache corruption issues.

When P/D disaggregation is used with speculative decoding (EAGLE), the effective_lookahead_tokens workaround (scheduler.py:L700-707) only zeroes lookahead on D (gated by load_kv_async). P keeps its full num_lookahead_tokens, so at block boundaries P allocates one more block than D. The connector's prefix-cache trimming then drops the first data block instead of the extra lookahead block at the end.

Related #22317 -- introduced the zeroing workaround #33702 -- RFC for explicit P/D roles (would allow addressing this cleanly) #43733 -- DFlash regression caused by the same num_computed_tokens == 0 condition #39266 -- Skip draft propose for disagg prefill instance

Code Example

Your output of `python collect_env.py` here

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

Chatting with @benchislett re: #43733, we realized the original workaround introduced for PD (#22317) can cause similar silent KV Cache corruption issues.

Example

block_size=16, num_prompt_tokens=16, P has num_lookahead_tokens=1:

P allocates ceil(17/16) = 2 blocks: [b0, b1] b0: KV for tokens 0-15 (prompt data) b1: KV for token 16 (lookahead / draft) D allocates ceil(16/16) = 1 block: [a0] _apply_prefix_caching (worker.py:L2312-2316) sees 1 local < 2 remote, trims: remote[-1:] = [b1] Transfer: P's b1 (lookahead KV for token 16) -> D's a0 (should hold KV for tokens 0-15) D ends up with the wrong data: KV for a speculated token instead of the first block of prompt KV. This is a correctness bug at block boundaries (prompt length is an exact multiple of block_size).

Root cause

Why it doesn't always manifest

The bug only triggers when num_prompt_tokens aligns exactly with a block boundary, causing the lookahead to spill into an extra block. Most test prompts don't hit this alignment.

Additionally, P wastes compute on drafter prefill + sampling + drafting that D discards entirely (to be addressed by something like #39266).

Proposed fix

I believe the specialized logic above introduces unnecessary complexity that we should address from a design prospective. Therefore I propose the following

Introduce roles as described in https://github.com/vllm-project/vllm/issues/43807 so we have a reliable way to tell whether we're prefilling or decoding with current instance in a PD setup from config
skip sampling from draft model entirely on P and set num_lookahead_tokens=0 on P; no extra lookahead_tokens are ever allocated on P.
on D scheduler only allocates N blocks when loading from remote, then (1+)K on subsuquent decoding steps (async loading takes one+ scheduler steps) . Current fixed version https://github.com/vllm-project/vllm/blob/11dfa3169d16b85adf74f7a9fc386b50b67bc732/vllm/v1/core/sched/scheduler.py#L710 already accounts for this (and currently handles loading in chunks). Other PD connector classes won't need to handle this internally. @benchislett should we extend it to other non-eagle SD methods?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering