vllm - 💡(How to fix) Fix [CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache

vllm2026-05-08 08:09:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Under async scheduling, pause_generation(mode="keep") with the default clear_cache=True causes AsyncScheduler to silently drop the first valid post-resume token. The remaining tokens line up shifted by one. Production blast radius is small (RLHF training absorbs the loss into normal noise), but it's deterministic and incorrect.

Root Cause

Scheduler.reset_prefix_cache(reset_running_requests=True) (scheduler.py:1813-1819) unconditionally sets request.discard_latest_async_tokens = True for each preempted running request. AsyncScheduler._update_request_with_output (async_scheduler.py:37-44) then drops the next output frame.

The flag is meant to discard in-flight async output frames invalidated by the reset — but it's set even when the engine has drained to zero in-flight frames (e.g. after pause_generation quiesces the engine). The first valid post-resume token is dropped instead. The flag is also under-specified for pipeline depth ≥ 2 (e.g. spec decode adds 1 + cur_num_spec_tokens placeholders per step).

Fix Action

Fix / Workaround

Distributed Tests (2 GPUs)(H100) running examples/rl/rlhf_async_new_apis.py started failing 0/13 prompts when #41421 flipped VLLM_USE_RAY_V2_EXECUTOR_BACKEND default to 1. RayExecutorV2 inherits async scheduling from MultiprocExecutor, which exposes a long-latent bug. #42042 disables async_scheduling in the example as a workaround.

Code Example

# scheduler.py reset_prefix_cache loop
request.async_tokens_to_discard = request.num_output_placeholders
request.num_output_placeholders = 0

---

# async_scheduler.py _update_request_with_output
if request.async_tokens_to_discard > 0:
    request.async_tokens_to_discard -= 1
    return [], False

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

Root cause

Suggested fix

Replace the boolean with a counter initialized from request.num_output_placeholders at reset time:

# scheduler.py reset_prefix_cache loop
request.async_tokens_to_discard = request.num_output_placeholders
request.num_output_placeholders = 0

# async_scheduler.py _update_request_with_output
if request.async_tokens_to_discard > 0:
    request.async_tokens_to_discard -= 1
    return [], False

This correctly handles the drained-engine case (count == 0, no spurious drop), the existing in-flight case (count == 1, one drop), and spec-decode pipelines (count == N, N drops).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #tensor shape #autograd error #model save/load

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Reproduction

Root cause

Suggested fix

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Reproduction

Root cause

Suggested fix

Still need to ship something?

RELATED_DISCOVERY

TRENDING