vllm - 💡(How to fix) Fix [CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Under async scheduling, pause_generation(mode="keep") with the default clear_cache=True causes AsyncScheduler to silently drop the first valid post-resume token. The remaining tokens line up shifted by one. Production blast radius is small (RLHF training absorbs the loss into normal noise), but it's deterministic and incorrect.

Root Cause

Scheduler.reset_prefix_cache(reset_running_requests=True) (scheduler.py:1813-1819) unconditionally sets request.discard_latest_async_tokens = True for each preempted running request. AsyncScheduler._update_request_with_output (async_scheduler.py:37-44) then drops the next output frame.

The flag is meant to discard in-flight async output frames invalidated by the reset — but it's set even when the engine has drained to zero in-flight frames (e.g. after pause_generation quiesces the engine). The first valid post-resume token is dropped instead. The flag is also under-specified for pipeline depth ≥ 2 (e.g. spec decode adds 1 + cur_num_spec_tokens placeholders per step).

Fix Action

Fix / Workaround

Distributed Tests (2 GPUs)(H100) running examples/rl/rlhf_async_new_apis.py started failing 0/13 prompts when #41421 flipped VLLM_USE_RAY_V2_EXECUTOR_BACKEND default to 1. RayExecutorV2 inherits async scheduling from MultiprocExecutor, which exposes a long-latent bug. #42042 disables async_scheduling in the example as a workaround.

Code Example

# scheduler.py reset_prefix_cache loop
request.async_tokens_to_discard = request.num_output_placeholders
request.num_output_placeholders = 0

---

# async_scheduler.py _update_request_with_output
if request.async_tokens_to_discard > 0:
    request.async_tokens_to_discard -= 1
    return [], False
RAW_BUFFERClick to expand / collapse

Summary

Under async scheduling, pause_generation(mode="keep") with the default clear_cache=True causes AsyncScheduler to silently drop the first valid post-resume token. The remaining tokens line up shifted by one. Production blast radius is small (RLHF training absorbs the loss into normal noise), but it's deterministic and incorrect.

Reproduction

Distributed Tests (2 GPUs)(H100) running examples/rl/rlhf_async_new_apis.py started failing 0/13 prompts when #41421 flipped VLLM_USE_RAY_V2_EXECUTOR_BACKEND default to 1. RayExecutorV2 inherits async scheduling from MultiprocExecutor, which exposes a long-latent bug. #42042 disables async_scheduling in the example as a workaround.

Root cause

Scheduler.reset_prefix_cache(reset_running_requests=True) (scheduler.py:1813-1819) unconditionally sets request.discard_latest_async_tokens = True for each preempted running request. AsyncScheduler._update_request_with_output (async_scheduler.py:37-44) then drops the next output frame.

The flag is meant to discard in-flight async output frames invalidated by the reset — but it's set even when the engine has drained to zero in-flight frames (e.g. after pause_generation quiesces the engine). The first valid post-resume token is dropped instead. The flag is also under-specified for pipeline depth ≥ 2 (e.g. spec decode adds 1 + cur_num_spec_tokens placeholders per step).

Suggested fix

Replace the boolean with a counter initialized from request.num_output_placeholders at reset time:

# scheduler.py reset_prefix_cache loop
request.async_tokens_to_discard = request.num_output_placeholders
request.num_output_placeholders = 0
# async_scheduler.py _update_request_with_output
if request.async_tokens_to_discard > 0:
    request.async_tokens_to_discard -= 1
    return [], False

This correctly handles the drained-engine case (count == 0, no spurious drop), the existing in-flight case (count == 1, one drop), and spec-decode pipelines (count == N, N drops).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache