vllm - ✅(Solved) Fix Flaky test: test_abort_during_final_step[False] fails intermittently [4 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38221Fetched 2026-04-08 01:37:16
View on GitHub
Comments
3
Participants
2
Timeline
13
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×6commented ×3subscribed ×2closed ×1

tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False] fails intermittently in CI, causing the Engine (1 GPU) job to fail.

Error Message

AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER'] tests/v1/engine/test_abort_final_step.py:287

Root Cause

tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False] fails intermittently in CI, causing the Engine (1 GPU) job to fail.

Fix Action

Fixed

PR fix notes

PR #38009: [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request

Description (problem / solution / changelog)

Purpose

Closes https://github.com/vllm-project/vllm/issues/36594 (remaining race)

PR #37024 fixed the START_DP_WAVE / pause race in add_request() by checking scheduler.pause_state before setting engines_running = True. However, the same unguarded pattern exists in DPEngineCoreProc._handle_client_request().

When pause_generation() + collective_rpc() is used for online weight update, a late START_DP_WAVE from the DP coordinator can re-arm the engine loop via _handle_client_request while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in collective_rpc, causing a one-sided collective deadlock (NCCL timeout after 600s).

Race timeline

Engine 0 (serving)       Coordinator            Engine 1 (serving)
     |                       |                       |
 new req ----FIRST_REQ ----->|--- START_DP_WAVE ---->| (queued in zmq)
     |                       |                       |
 pause_scheduler ------------|---------------------->| pause_scheduler
 (returns: idle fast path)   |                       | (returns: idle fast path)
     |                       |                       |
 collective_rpc -------------|                 _handle_client_request
     |                       |                   START_DP_WAVE arrives
     |                       |                   engines_running = True ← BUG
     |                       |                   → dummy ALLREDUCE
     |                       |                       |
 collective_rpc ALLREDUCE ←--X-- rank mismatch --→ dummy ALLREDUCE
     |                  NCCL TIMEOUT (600s)          |

Fix

Add the same PauseState.UNPAUSED guard to _handle_client_request that #37024 added to add_request.

Test Plan

Reproduced consistently with DPEP (DP=2, TP=8, EP) + online weight sync on MoE model. The race is timing-dependent — enabling --enable-return-routed-experts (which adds a small per-step GPU buffer write) widens the race window enough to hit it on ~100% of runs. Without the fix: 3/3 runs deadlocked within 5 minutes of first weight sync. With the fix: pending verification.

Test Result

Pending large-scale training run with fix applied.

Changed files

  • vllm/v1/engine/core.py (modified, +4/-1)

PR #37460: [Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats

Description (problem / solution / changelog)

Related to the discussion in #36859 and the Counters can only be incremented by non-negative amounts error with the vllm:prompt_tokens_by_source_total metric.

In OutputProcessor, we take the first EngineCoreOutput as a signal that prefill has completed, and record certain statistics about it.

On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use PrefillStats to track these stats until the first prefill is completed, return this to the frontend via EngineCoreOutput, and then stop tracking PrefillStats.

num_cached_tokens was previously used for KV transfer failure recovery, but this is no longer true as of #38096.

Changed files

  • tests/v1/core/test_async_scheduler.py (modified, +18/-11)
  • tests/v1/engine/test_output_processor.py (modified, +10/-4)
  • tests/v1/engine/utils.py (modified, +27/-6)
  • tests/v1/kv_connector/unit/test_invalid_blocks_correctness.py (modified, +59/-0)
  • tests/v1/metrics/test_stats.py (modified, +62/-29)
  • vllm/v1/core/sched/scheduler.py (modified, +16/-12)
  • vllm/v1/engine/__init__.py (modified, +4/-5)
  • vllm/v1/engine/output_processor.py (modified, +7/-3)
  • vllm/v1/metrics/stats.py (modified, +66/-24)
  • vllm/v1/request.py (modified, +14/-5)

PR #34789: [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop

Description (problem / solution / changelog)

Purpose

Fix event loop blocking caused by multimodal request preprocessing (base64 decoding, image transforms, HF processor operations) and chat template rendering. Under high concurrency, these synchronous CPU-bound operations block the asyncio event loop, causing /health, /v1/models, and /metrics endpoints to become unresponsive (P95 latency >200ms, with spikes over 1s).

Changes:

  • Add a shared ThreadPoolExecutor on BaseRenderer (size controlled by --preprocessing-thread-pool-workers, default 1)
  • Always offload multimodal preprocessing to the shared thread pool to keep the event loop responsive
  • Wrap chat template rendering in HfRenderer, MistralRenderer, DeepseekV32Renderer, and Grok2Renderer with the shared executor via make_async
  • Consolidate MistralRenderer's separate ThreadPoolExecutor into the shared one
  • Serialize clear_mm_cache through the shared executor to avoid races with concurrent process_inputs on the mm_processor_cache

Test Plan

Benchmarked on 1x NVIDIA A100-SXM4-80GB using vllm bench serve with --request-rate 20 --num-prompts 200 and a custom high-concurrency benchmark with PaddleOCR-VL-1.5.

Tests performed:

  1. vllm bench serve with Llama-3.1-8B-Instruct (text-only, --request-rate 20 --num-prompts 200)
  2. vllm bench serve with Qwen2.5-VL-7B-Instruct (multimodal, --request-rate 20 --num-prompts 200)
  3. Custom high-concurrency benchmark with PaddleOCR-VL-1.5 (500 real OmniDocBench images, 300 concurrency)
  4. --preprocessing-thread-pool-workers comparison (1 vs 2 vs 4) with PaddleOCR-VL-1.5

Test Results

1. Text-Only (meta-llama/Llama-3.1-8B-Instruct)

vllm bench serve --request-rate 20 --num-prompts 200

MetricThis PRMainDiff
Throughput (req/s)14.7714.80-0.2%
Output tok/s1,889.971,893.27-0.2%
Mean TTFT (ms)278.02273.71+1.6%
P99 TTFT (ms)518.97525.17-1.2%
Mean TPOT (ms)54.1453.79+0.7%

No regression. All metrics within noise.

2. Multimodal (Qwen/Qwen2.5-VL-7B-Instruct)

vllm bench serve --request-rate 20 --num-prompts 200 --backend openai-chat --dataset-name random-mm

MetricThis PRMainDiff
Throughput (req/s)6.746.73+0.1%
Output tok/s862.46861.79+0.1%
Mean TTFT (ms)7,489.437,483.03+0.1%
P99 TTFT (ms)17,005.2816,974.36+0.2%
Mean TPOT (ms)125.91126.07-0.1%

No regression. All metrics within noise.

3. PaddleOCR-VL-1.5 High Concurrency (500 prompts, 300 concurrency)

Custom benchmark with real OmniDocBench document images. --max-num-batched-tokens 131072 --no-enable-prefix-caching --mm-processor-cache-gb 0 --gpu-memory-utilization 0.5

MetricThis PRMainDiff
Throughput (req/s)5.885.67+3.7%
Token throughput (tok/s)4,712.974,504.46+4.6%
TTFT mean (ms)31,117.7133,058.79-5.9%
TTFT P99 (ms)47,273.2251,613.86-8.4%
/health median (ms)0.70222.44318x better
/health P99 (ms)18.881,641.5387x better

Event loop stays fully responsive under high multimodal concurrency. The /health endpoint drops from 222ms to <1ms median.

4. --preprocessing-thread-pool-workers Comparison (PaddleOCR-VL-1.5, 500 prompts, 300 concurrency)

Metricworkers=1workers=2workers=4
Throughput (req/s)5.865.925.87
Token throughput (tok/s)4,624.784,559.034,554.05
TTFT mean (ms)31,31930,97431,294
TTFT P99 (ms)47,53647,08447,830
/health median (ms)0.670.620.67
/health P99 (ms)17.2017.4720.13

All worker counts perform identically. This is consistent with https://github.com/vllm-project/vllm/pull/34789#issuecomment-3055653700. The key improvement comes from offloading preprocessing off the event loop (so /health stays responsive), not from parallelizing it. Default of workers=1 is sufficient.

Summary

WhatResult
Event loop liveness (/health)318x improvement (222ms → 0.7ms median)
Request throughput (high concurrency)+3.7% (5.67 → 5.88 req/s)
TTFT (high concurrency)-5.9% (33.1s → 31.1s mean)
Text-only regressionNone (-0.2% throughput, within noise)
Multimodal regressionNone (+0.1% throughput, within noise)
"Already borrowed" errorsZero across all tests

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/entrypoints/openai/chat_completion/test_chat_error.py (modified, +1/-0)
  • tests/entrypoints/openai/chat_completion/test_serving_chat.py (modified, +1/-0)
  • tests/entrypoints/openai/completion/test_completion_error.py (modified, +1/-0)
  • tests/entrypoints/openai/completion/test_lora_resolvers.py (modified, +1/-0)
  • tests/renderers/test_completions.py (modified, +1/-0)
  • tests/renderers/test_mistral.py (modified, +1/-0)
  • vllm/config/model.py (modified, +4/-0)
  • vllm/engine/arg_utils.py (modified, +6/-0)
  • vllm/renderers/base.py (modified, +118/-6)
  • vllm/renderers/deepseek_v32.py (modified, +18/-4)
  • vllm/renderers/grok2.py (modified, +18/-4)
  • vllm/renderers/hf.py (modified, +20/-9)
  • vllm/renderers/mistral.py (modified, +1/-3)
  • vllm/utils/async_utils.py (modified, +3/-1)
  • vllm/v1/engine/async_llm.py (modified, +1/-1)

Code Example

AssertionError: Expected at least 1 captured finish status, got 0. 
File content: ['INIT:WORKER', 'INIT:SCHEDULER']
tests/v1/engine/test_abort_final_step.py:287
RAW_BUFFERClick to expand / collapse

Summary

tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False] fails intermittently in CI, causing the Engine (1 GPU) job to fail.

Failure Signature

AssertionError: Expected at least 1 captured finish status, got 0. 
File content: ['INIT:WORKER', 'INIT:SCHEDULER']
tests/v1/engine/test_abort_final_step.py:287

The test expects the KV connector's request_finished() method to be called with FINISHED_ABORTED status, but it's never invoked despite successful initialization.

Key Observations

  • test_abort_during_final_step[False] (async_scheduling=False) - fails intermittently
  • test_abort_during_final_step[True] (async_scheduling=True) - passes consistently

This indicates a race condition specific to synchronous scheduling mode.

Flakiness Pattern

Likely Cause

Race condition where the request is completed/freed before abort processing:

  1. Request reaches max_tokens=1 and is marked FINISHED_LENGTH_CAPPED
  2. Request is freed or removed from tracking
  3. When abort is processed, finish_requests() skips it (already finished or not found)
  4. Connector never notified

The async scheduling path likely has timing differences that avoid this race.

Related

  • PR #29987: Original abort handling fix
  • Test file: tests/v1/engine/test_abort_final_step.py
  • Code: vllm/v1/core/sched/scheduler.py

extent analysis

Fix Plan

To address the intermittent test failure caused by a race condition in synchronous scheduling mode, we need to ensure that the request_finished() method is called with FINISHED_ABORTED status even when the request is completed or freed before abort processing.

Step-by-Step Solution:

  1. Modify the finish_requests() function in vllm/v1/core/sched/scheduler.py to check for requests that have been marked as FINISHED_LENGTH_CAPPED and notify the KV connector accordingly.
  2. Add a check for FINISHED_LENGTH_CAPPED requests in the finish_requests() function:

def finish_requests(self, status): # ... existing code ... for request in self.requests: if request.status == FINISHED_LENGTH_CAPPED: # Notify the KV connector with FINISHED_ABORTED status self.kv_connector.request_finished(request.id, FINISHED_ABORTED) # ... existing code ...

3. **Update the test case** in `tests/v1/engine/test_abort_final_step.py` to account for the new behavior:
   ```python
def test_abort_during_final_step(self, async_scheduling):
    # ... existing code ...
    if not async_scheduling:
        # Wait for the request to be marked as FINISHED_LENGTH_CAPPED
        self.wait_for_request_status(FINISHED_LENGTH_CAPPED)
    # ... existing code ...

Verification

To verify that the fix worked, run the test_abort_during_final_step test case with async_scheduling=False and check that the request_finished() method is called with FINISHED_ABORTED status.

Extra Tips

  • Make sure to update the test case to account for the new behavior and avoid false positives.
  • Consider adding additional logging or debugging statements to help identify and diagnose similar issues in the future.
  • Review the code changes and test updates to ensure they do not introduce any new regressions or issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING