vllm - ✅(Solved) Fix Flaky test: test_abort_during_final_step[False] fails intermittently [4 pull requests, 3 comments, 2 participants]

markmc · 2026-03-26T11:28:48Z

[vllm] tests/v1/engine/test abort final step.py::test abort during final step False fails intermittently in CI, causing the Engine 1 GPU job to fail. `tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False]` fails intermittently in CI, causing the `Engine (1 GPU)` job to fail. # PR #38009: [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request - Repository: vllm-project/vllm - Author: junjzhang - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/38009 ## Description (problem / solution / changelog) ## Purpose Closes https://github.com/vllm-project/vllm/issues/36594 (remaining race) PR #37024 fixed the `START_DP_WAVE` / pause race in `add_request()` by checking `scheduler.pause_state` before setting `engines_running = True`. However, the same unguarded pattern exists in `DPEngineCoreProc._handle_client_request()`. When `pause_generation()` + `collective_rpc()` is used for online weight update, a late `START_DP_WAVE` from the DP coordinator can re-arm the engine loop via `_handle_client_request` while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in `collective_rpc`, causing a one-sided collective deadlock (NCCL timeout after 600s). ### Race timeline ```text Engine 0 (serving) Coordinator Engine 1 (serving) | | | new req ----FIRST_REQ ----->|--- START_DP_WAVE ---->| (queued in zmq) | | | pause_scheduler ------------|---------------------->| pause_scheduler (returns: idle fast path) | | (returns: idle fast path) | | | collective_rpc -------------| _handle_client_request | | START_DP_WAVE arrives | | engines_running = True ← BUG | | → dummy ALLREDUCE | | | collective_rpc ALLREDUCE ←--X-- rank mismatch --→ dummy ALLREDUCE | NCCL TIMEOUT (600s) | ``` ### Fix Add the same `PauseState.UNPAUSED` guard to `_handle_client_request` that #37024 added to `add_request`. ## Test Plan Reproduced consistently with DPEP (DP=2, TP=8, EP) + online weight sync on MoE model. The race is timing-dependent — enabling `--enable-return-routed-experts` (which adds a small per-step GPU buffer write) widens the race window enough to hit it on ~100% of runs. Without the fix: 3/3 runs deadlocked within 5 minutes of first weight sync. With the fix: pending verification. ## Test Result Pending large-scale training run with fix applied. ## Changed files - `vllm/v1/engine/core.py` (modified, +4/-1) --- # PR #37460: [Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats - Repository: vllm-project/vllm - Author: markmc - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/37460 ## Description (problem / solution / changelog) Related to the [discussion](https://github.com/vllm-project/vllm/pull/36859#issuecomment-4067047671) in #36859 and the `Counters can only be incremented by non-negative amounts` error with the `vllm:prompt_tokens_by_source_total` metric. In `OutputProcessor`, we take the first `EngineCoreOutput` as a signal that prefill has completed, and record certain statistics about it. On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use `PrefillStats` to track these stats until the first prefill is completed, return this to the frontend via `EngineCoreOutput`, and then stop tracking `PrefillStats`. `num_cached_tokens` was previously used for KV transfer failure recovery, but this is no longer true as of #38096. ## Changed files - `tests/v1/core/test_async_scheduler.py` (modified, +18/-11) - `tests/v1/engine/test_output_processor.py` (modified, +10/-4) - `tests/v1/engine/utils.py` (modified, +27/-6) - `tests/v1/kv_connector/unit/test_invalid_blocks_correctness.py` (modified, +59/-0) - `tests/v1/metrics/test_stats.py` (modified, +62/-29) - `vllm/v1/core/sched/scheduler.py` (modified, +16/-12) - `vllm/v1/engine/__init__.py` (modified, +4/-5) - `vllm/v1/engine/output_processor.py` (modified, +7/-3) - `vllm/v1/metrics/stats.py` (modified, +66/-24) - `vllm/v1/request.py` (modified, +14/-5) --- # PR #34789: [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop - Repository: vllm-project/vllm - Author: scyyh11 - State: closed | merged: True - Link: https://github.com/vllm-project/vllm/pull/34789 ## Description (problem / solution / changelog) ## Purpose Fix event loop blocking caused by multimodal request preprocessing (base64 decoding, image transforms, HF processor operations) and chat template rendering. Under high concurrency, these synchronous CPU-bound operations block the asyncio event loop, causing `/health`, `/v1/models`, and `/metrics` endpoints to become unresponsive (P95 latency >200ms, with spikes over 1s). **Changes:** - Add a shared `ThreadPoolExecutor` on `BaseRenderer` (size controlled by `--preprocessing-thread-pool-workers`, default 1) - Always offload multimodal preprocessi

vllm2026-03-26 11:28:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38221•Fetched 2026-04-08 01:37:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

markmc

Participants

markmc

njhill

Timeline (top)

cross-referenced ×6commented ×3subscribed ×2closed ×1

tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False] fails intermittently in CI, causing the Engine (1 GPU) job to fail.

Error Message

AssertionError: Expected at least 1 captured finish status, got 0. File content: ['INIT:WORKER', 'INIT:SCHEDULER'] tests/v1/engine/test_abort_final_step.py:287

Root Cause

tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False] fails intermittently in CI, causing the Engine (1 GPU) job to fail.

Fix Action

Fixed

Fixed by PR: [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request (https://github.com/vllm-project/vllm/pull/38009)
Fixed by PR: [Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats (https://github.com/vllm-project/vllm/pull/37460)
Fixed by PR: [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop (https://github.com/vllm-project/vllm/pull/34789)

PR fix notes

PR #38009: [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request

Repository: vllm-project/vllm
Author: junjzhang
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38009

Description (problem / solution / changelog)

Purpose

Closes https://github.com/vllm-project/vllm/issues/36594 (remaining race)

PR #37024 fixed the START_DP_WAVE / pause race in add_request() by checking scheduler.pause_state before setting engines_running = True. However, the same unguarded pattern exists in DPEngineCoreProc._handle_client_request().

When pause_generation() + collective_rpc() is used for online weight update, a late START_DP_WAVE from the DP coordinator can re-arm the engine loop via _handle_client_request while the engine is paused. The re-armed engine enters the dummy-batch ALLREDUCE while the peer engine is in collective_rpc, causing a one-sided collective deadlock (NCCL timeout after 600s).

Race timeline

Engine 0 (serving)       Coordinator            Engine 1 (serving)
     |                       |                       |
 new req ----FIRST_REQ ----->|--- START_DP_WAVE ---->| (queued in zmq)
     |                       |                       |
 pause_scheduler ------------|---------------------->| pause_scheduler
 (returns: idle fast path)   |                       | (returns: idle fast path)
     |                       |                       |
 collective_rpc -------------|                 _handle_client_request
     |                       |                   START_DP_WAVE arrives
     |                       |                   engines_running = True ← BUG
     |                       |                   → dummy ALLREDUCE
     |                       |                       |
 collective_rpc ALLREDUCE ←--X-- rank mismatch --→ dummy ALLREDUCE
     |                  NCCL TIMEOUT (600s)          |

Fix

Add the same PauseState.UNPAUSED guard to _handle_client_request that #37024 added to add_request.

Test Plan

Reproduced consistently with DPEP (DP=2, TP=8, EP) + online weight sync on MoE model. The race is timing-dependent — enabling --enable-return-routed-experts (which adds a small per-step GPU buffer write) widens the race window enough to hit it on ~100% of runs. Without the fix: 3/3 runs deadlocked within 5 minutes of first weight sync. With the fix: pending verification.

Test Result

Pending large-scale training run with fix applied.

Changed files

vllm/v1/engine/core.py (modified, +4/-1)

PR #37460: [Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats

Repository: vllm-project/vllm
Author: markmc
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37460

Description (problem / solution / changelog)

Related to the discussion in #36859 and the Counters can only be incremented by non-negative amounts error with the vllm:prompt_tokens_by_source_total metric.

In OutputProcessor, we take the first EngineCoreOutput as a signal that prefill has completed, and record certain statistics about it.

On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use PrefillStats to track these stats until the first prefill is completed, return this to the frontend via EngineCoreOutput, and then stop tracking PrefillStats.

num_cached_tokens was previously used for KV transfer failure recovery, but this is no longer true as of #38096.

Changed files

tests/v1/core/test_async_scheduler.py (modified, +18/-11)
tests/v1/engine/test_output_processor.py (modified, +10/-4)
tests/v1/engine/utils.py (modified, +27/-6)
tests/v1/kv_connector/unit/test_invalid_blocks_correctness.py (modified, +59/-0)
tests/v1/metrics/test_stats.py (modified, +62/-29)
vllm/v1/core/sched/scheduler.py (modified, +16/-12)
vllm/v1/engine/__init__.py (modified, +4/-5)
vllm/v1/engine/output_processor.py (modified, +7/-3)
vllm/v1/metrics/stats.py (modified, +66/-24)
vllm/v1/request.py (modified, +14/-5)

PR #34789: [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop

Repository: vllm-project/vllm
Author: scyyh11
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/34789

Description (problem / solution / changelog)

Purpose

Fix event loop blocking caused by multimodal request preprocessing (base64 decoding, image transforms, HF processor operations) and chat template rendering. Under high concurrency, these synchronous CPU-bound operations block the asyncio event loop, causing /health, /v1/models, and /metrics endpoints to become unresponsive (P95 latency >200ms, with spikes over 1s).

Changes:

Add a shared ThreadPoolExecutor on BaseRenderer (size controlled by --preprocessing-thread-pool-workers, default 1)
Always offload multimodal preprocessing to the shared thread pool to keep the event loop responsive
Wrap chat template rendering in HfRenderer, MistralRenderer, DeepseekV32Renderer, and Grok2Renderer with the shared executor via make_async
Consolidate MistralRenderer's separate ThreadPoolExecutor into the shared one
Serialize clear_mm_cache through the shared executor to avoid races with concurrent process_inputs on the mm_processor_cache

Test Plan

Benchmarked on 1x NVIDIA A100-SXM4-80GB using vllm bench serve with --request-rate 20 --num-prompts 200 and a custom high-concurrency benchmark with PaddleOCR-VL-1.5.

Tests performed:

vllm bench serve with Llama-3.1-8B-Instruct (text-only, --request-rate 20 --num-prompts 200)
vllm bench serve with Qwen2.5-VL-7B-Instruct (multimodal, --request-rate 20 --num-prompts 200)
Custom high-concurrency benchmark with PaddleOCR-VL-1.5 (500 real OmniDocBench images, 300 concurrency)
--preprocessing-thread-pool-workers comparison (1 vs 2 vs 4) with PaddleOCR-VL-1.5

Test Results

1. Text-Only (`meta-llama/Llama-3.1-8B-Instruct`)

vllm bench serve --request-rate 20 --num-prompts 200

Metric	This PR	Main	Diff
Throughput (req/s)	14.77	14.80	-0.2%
Output tok/s	1,889.97	1,893.27	-0.2%
Mean TTFT (ms)	278.02	273.71	+1.6%
P99 TTFT (ms)	518.97	525.17	-1.2%
Mean TPOT (ms)	54.14	53.79	+0.7%

No regression. All metrics within noise.

2. Multimodal (`Qwen/Qwen2.5-VL-7B-Instruct`)

vllm bench serve --request-rate 20 --num-prompts 200 --backend openai-chat --dataset-name random-mm

Metric	This PR	Main	Diff
Throughput (req/s)	6.74	6.73	+0.1%
Output tok/s	862.46	861.79	+0.1%
Mean TTFT (ms)	7,489.43	7,483.03	+0.1%
P99 TTFT (ms)	17,005.28	16,974.36	+0.2%
Mean TPOT (ms)	125.91	126.07	-0.1%

No regression. All metrics within noise.

3. PaddleOCR-VL-1.5 High Concurrency (500 prompts, 300 concurrency)

Custom benchmark with real OmniDocBench document images. --max-num-batched-tokens 131072 --no-enable-prefix-caching --mm-processor-cache-gb 0 --gpu-memory-utilization 0.5

Metric	This PR	Main	Diff
Throughput (req/s)	5.88	5.67	+3.7%
Token throughput (tok/s)	4,712.97	4,504.46	+4.6%
TTFT mean (ms)	31,117.71	33,058.79	-5.9%
TTFT P99 (ms)	47,273.22	51,613.86	-8.4%
/health median (ms)	0.70	222.44	318x better
/health P99 (ms)	18.88	1,641.53	87x better

Event loop stays fully responsive under high multimodal concurrency. The /health endpoint drops from 222ms to <1ms median.

4. `--preprocessing-thread-pool-workers` Comparison (PaddleOCR-VL-1.5, 500 prompts, 300 concurrency)

Metric	workers=1	workers=2	workers=4
Throughput (req/s)	5.86	5.92	5.87
Token throughput (tok/s)	4,624.78	4,559.03	4,554.05
TTFT mean (ms)	31,319	30,974	31,294
TTFT P99 (ms)	47,536	47,084	47,830
/health median (ms)	0.67	0.62	0.67
/health P99 (ms)	17.20	17.47	20.13

All worker counts perform identically. This is consistent with https://github.com/vllm-project/vllm/pull/34789#issuecomment-3055653700. The key improvement comes from offloading preprocessing off the event loop (so /health stays responsive), not from parallelizing it. Default of workers=1 is sufficient.

Summary

What	Result
Event loop liveness (`/health`)	318x improvement (222ms → 0.7ms median)
Request throughput (high concurrency)	+3.7% (5.67 → 5.88 req/s)
TTFT (high concurrency)	-5.9% (33.1s → 31.1s mean)
Text-only regression	None (-0.2% throughput, within noise)
Multimodal regression	None (+0.1% throughput, within noise)
"Already borrowed" errors	Zero across all tests

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

tests/entrypoints/openai/chat_completion/test_chat_error.py (modified, +1/-0)
tests/entrypoints/openai/chat_completion/test_serving_chat.py (modified, +1/-0)
tests/entrypoints/openai/completion/test_completion_error.py (modified, +1/-0)
tests/entrypoints/openai/completion/test_lora_resolvers.py (modified, +1/-0)
tests/renderers/test_completions.py (modified, +1/-0)
tests/renderers/test_mistral.py (modified, +1/-0)
vllm/config/model.py (modified, +4/-0)
vllm/engine/arg_utils.py (modified, +6/-0)
vllm/renderers/base.py (modified, +118/-6)
vllm/renderers/deepseek_v32.py (modified, +18/-4)
vllm/renderers/grok2.py (modified, +18/-4)
vllm/renderers/hf.py (modified, +20/-9)
vllm/renderers/mistral.py (modified, +1/-3)
vllm/utils/async_utils.py (modified, +3/-1)
vllm/v1/engine/async_llm.py (modified, +1/-1)

Code Example

AssertionError: Expected at least 1 captured finish status, got 0. 
File content: ['INIT:WORKER', 'INIT:SCHEDULER']
tests/v1/engine/test_abort_final_step.py:287

RAW_BUFFERClick to expand / collapse

Summary

tests/v1/engine/test_abort_final_step.py::test_abort_during_final_step[False] fails intermittently in CI, causing the Engine (1 GPU) job to fail.

Failure Signature

AssertionError: Expected at least 1 captured finish status, got 0. 
File content: ['INIT:WORKER', 'INIT:SCHEDULER']
tests/v1/engine/test_abort_final_step.py:287

The test expects the KV connector's request_finished() method to be called with FINISHED_ABORTED status, but it's never invoked despite successful initialization.

Key Observations

test_abort_during_final_step[False] (async_scheduling=False) - fails intermittently
test_abort_during_final_step[True] (async_scheduling=True) - passes consistently

This indicates a race condition specific to synchronous scheduling mode.

Flakiness Pattern

Buildkite Analytics shows recent failures
Example failure: Nightly build #58181

Likely Cause

Race condition where the request is completed/freed before abort processing:

Request reaches max_tokens=1 and is marked FINISHED_LENGTH_CAPPED
Request is freed or removed from tracking
When abort is processed, finish_requests() skips it (already finished or not found)
Connector never notified

The async scheduling path likely has timing differences that avoid this race.

PR #29987: Original abort handling fix
Test file: tests/v1/engine/test_abort_final_step.py
Code: vllm/v1/core/sched/scheduler.py

extent analysis

Fix Plan

To address the intermittent test failure caused by a race condition in synchronous scheduling mode, we need to ensure that the request_finished() method is called with FINISHED_ABORTED status even when the request is completed or freed before abort processing.

Step-by-Step Solution:

Modify the finish_requests() function in vllm/v1/core/sched/scheduler.py to check for requests that have been marked as FINISHED_LENGTH_CAPPED and notify the KV connector accordingly.
Add a check for FINISHED_LENGTH_CAPPED requests in the finish_requests() function:

def finish_requests(self, status): # ... existing code ... for request in self.requests: if request.status == FINISHED_LENGTH_CAPPED: # Notify the KV connector with FINISHED_ABORTED status self.kv_connector.request_finished(request.id, FINISHED_ABORTED) # ... existing code ...

3. **Update the test case** in `tests/v1/engine/test_abort_final_step.py` to account for the new behavior:
   ```python
def test_abort_during_final_step(self, async_scheduling):
    # ... existing code ...
    if not async_scheduling:
        # Wait for the request to be marked as FINISHED_LENGTH_CAPPED
        self.wait_for_request_status(FINISHED_LENGTH_CAPPED)
    # ... existing code ...

Verification

To verify that the fix worked, run the test_abort_during_final_step test case with async_scheduling=False and check that the request_finished() method is called with FINISHED_ABORTED status.

Extra Tips

Make sure to update the test case to account for the new behavior and avoid false positives.
Consider adding additional logging or debugging statements to help identify and diagnose similar issues in the future.
Review the code changes and test updates to ensure they do not introduce any new regressions or issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#autograd error #model save/load #optimization #mixed precision #training loop

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Flaky test: test_abort_during_final_step[False] fails intermittently [4 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #38009: [bug] Fix remaining START_DP_WAVE pause race in _handle_client_request

Description (problem / solution / changelog)

Purpose

Race timeline

Fix

Test Plan

Test Result

Changed files

PR #37460: [Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats

Description (problem / solution / changelog)

Changed files

PR #34789: [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop

Description (problem / solution / changelog)

Purpose

Test Plan

Test Results

1. Text-Only (meta-llama/Llama-3.1-8B-Instruct)

2. Multimodal (Qwen/Qwen2.5-VL-7B-Instruct)

3. PaddleOCR-VL-1.5 High Concurrency (500 prompts, 300 concurrency)

4. --preprocessing-thread-pool-workers Comparison (PaddleOCR-VL-1.5, 500 prompts, 300 concurrency)

Summary

Changed files

Code Example

Summary

Failure Signature

Key Observations

Flakiness Pattern

Likely Cause

Related

extent analysis

Fix Plan

Step-by-Step Solution:

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Text-Only (`meta-llama/Llama-3.1-8B-Instruct`)

2. Multimodal (`Qwen/Qwen2.5-VL-7B-Instruct`)

4. `--preprocessing-thread-pool-workers` Comparison (PaddleOCR-VL-1.5, 500 prompts, 300 concurrency)