vllm - ✅(Solved) Fix [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc [1 pull requests, 1 participants]

junjzhang · 2026-03-10T05:42:37Z

[vllm] PR 36608: Bugfix Fix DP wave race condition re-arming engine while paused - Repository: vllm-project/vllm - Author: AjAnubolu - State: open | merged: Fa… # PR #36608: [Bugfix] Fix DP wave race condition re-arming engine while paused - Repository: vllm-project/vllm - Author: AjAnubolu - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/36608 ## Description (problem / solution / changelog) Gate START_DP_WAVE on scheduler pause state to prevent race with collective_rpc. Closes #36594 ## Changed files - `vllm/v1/engine/core.py` (modified, +1/-1) ## Fixed - Fixed by PR: [Bugfix] Fix DP wave race condition re-arming engine while paused (https://github.com/vllm-project/vllm/pull/36608) ## Title [Bug]: `DPEngineCoreProc` may re-arm DP wave while paused (`START_DP_WAVE` ignores pause state), causing collective timeout after `pause_generation` + `collective_rpc` ## Your current environment The output of python collect_env.py ```text OS : Ubuntu 24.04.3 LTS (x86_64) PyTorch version : 2.10.0a0+gitb0eb5f7 CUDA used to build PyTorch : 12.9 Python version : 3.12 Is CUDA available : True CUDA runtime version : 12.9.86 GPU models and configuration : GPU 0-7: NVIDIA H20 Nvidia driver version : 570.124.06 vLLM Version : v0.16.1rc0 ``` ## Model MoE + DPEP setup (`data_parallel_size > 1`) using coordinator wave handling (internal DP load balancing / `DPLBAsyncMPClient`). ## Bug description With DPEP, calling: 1) `pause_generation(mode="abort")` 2) then `collective_rpc(...)` for online weight update can intermittently fail with NCCL ALLREDUCE timeout (600s). Observed behavior: one engine enters the utility `collective_rpc`, while a peer engine can still be in dummy-batch ALLREDUCE, leading to collective mismatch and timeout. ### Root cause In `DPEngineCoreProc._handle_client_request`, `START_DP_WAVE` can set `engines_running = True` even when scheduler is paused: ```python # vllm/v1/engine/core.py — DPEngineCoreProc._handle_client_request if request_type == EngineCoreRequestType.START_DP_WAVE: new_wave, exclude_eng_index = request if exclude_eng_index != self.engine_index and new_wave >= self.current_wave: self.current_wave = new_wave if not self.engines_running: # no pause-state check self.engines_running = True # re-arms dummy-batch loop ``` This races with pause completion: - `pause_scheduler()` may return immediately when engine appears idle (`has_work()==False`), i.e. no Future wait path. - A late `START_DP_WAVE` from coordinator can still arrive after pause returns. - Peer engine re-enters running loop and executes dummy-batch ALLREDUCE. - Next `collective_rpc` is issued; ranks are in different collectives -> timeout. ### Race timeline ```text Engine 0 (idle) Coordinator Engine 1 (idle) | | | late req -> FIRST_REQ ----->|---- START_DP_WAVE -->| (queued) | | | pause_scheduler ------------|---------------------> pause_scheduler returns (idle fast path) returns (idle fast path) | | | pause_generation returns | handles START_DP_WAVE | | engines_running=True | | -> dummy ALLREDUCE | | | collective_rpc ---------------------------------> rank mismatch -> NCCL timeout ``` ### Suggested fix In `DPEngineCoreProc._handle_client_request`, gate `START_DP_WAVE` re-arm while paused: ```python if not self.engines_running and not self.is_scheduler_paused(): logger.debug("EngineCore starting idle loop for wave %d.", new_wave) self.engines_running = True ``` Why this should be safe: 1. Normal unpaused behavior is unchanged. 2. During pause, wave index (`current_wave`) is still updated, but loop is not re-armed. 3. On resume, `resume_scheduler()` already sends `start_wave` when unfinished requests exist. Note: this fix targets pause/collective race. Request-admission semantics for late requests during pause are a separate concern. ## How to reproduce 1. Start DPEP MoE serving with `data_parallel_size=2` and internal DP LB (`DPLBAsyncMPClient`). 2. Keep inference traffic running. 3. Trigger `pause_generation(mode="abort")` near wave boundary / in-flight transitions. 4. Immediately trigger all-worker utility `collective_rpc` (e.g., online weight receive/update). 5. Intermittently see NCCL ALLREDUCE timeout (`Timeout(ms)=600000`). Race is timing dependent: it requires a late coordinator `START_DP_WAVE` arriving after pause has returned but before utility collective is issued. ## Relevant logs ```text [rank7]:[E ProcessGroupNCCL.cpp:N] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=N, OpType=ALLREDUCE, ..., Timeout(ms)=600000) ran for 600025 milliseconds before timing out. ``` ## Before submitting a new issue... - [x] I have searched existing issues and confirmed this is not a duplicate. - [x] I have verified the bug exists on the latest vLLM version.

vllm2026-03-10 05:42:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36594•Fetched 2026-04-08 00:36:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

junjzhang

Participants

junjzhang

Timeline (top)

referenced ×17cross-referenced ×5project_v2_item_status_changed ×4added_to_project_v2 ×1

Root Cause

In DPEngineCoreProc._handle_client_request, START_DP_WAVE can set engines_running = True even when scheduler is paused:

# vllm/v1/engine/core.py — DPEngineCoreProc._handle_client_request
if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running:          # no pause-state check
            self.engines_running = True       # re-arms dummy-batch loop

This races with pause completion:

pause_scheduler() may return immediately when engine appears idle (has_work()==False), i.e. no Future wait path.
A late START_DP_WAVE from coordinator can still arrive after pause returns.
Peer engine re-enters running loop and executes dummy-batch ALLREDUCE.
Next collective_rpc is issued; ranks are in different collectives -> timeout.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix DP wave race condition re-arming engine while paused (https://github.com/vllm-project/vllm/pull/36608)

PR fix notes

PR #36608: [Bugfix] Fix DP wave race condition re-arming engine while paused

Repository: vllm-project/vllm
Author: AjAnubolu
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36608

Description (problem / solution / changelog)

Gate START_DP_WAVE on scheduler pause state to prevent race with collective_rpc.

Closes #36594

Changed files

vllm/v1/engine/core.py (modified, +1/-1)

Code Example

OS                           : Ubuntu 24.04.3 LTS (x86_64)
PyTorch version              : 2.10.0a0+gitb0eb5f7
CUDA used to build PyTorch   : 12.9
Python version               : 3.12
Is CUDA available            : True
CUDA runtime version         : 12.9.86
GPU models and configuration :
GPU 0-7: NVIDIA H20
Nvidia driver version        : 570.124.06
vLLM Version                 : v0.16.1rc0

---

# vllm/v1/engine/core.py — DPEngineCoreProc._handle_client_request
if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running:          # no pause-state check
            self.engines_running = True       # re-arms dummy-batch loop

---

Engine 0 (idle)          Coordinator            Engine 1 (idle)
     |                       |                       |
 late req -> FIRST_REQ ----->|---- START_DP_WAVE -->| (queued)
     |                       |                       |
 pause_scheduler ------------|---------------------> pause_scheduler
 returns (idle fast path)                            returns (idle fast path)
     |                       |                       |
 pause_generation returns    |                 handles START_DP_WAVE
     |                       |                 engines_running=True
     |                       |                 -> dummy ALLREDUCE
     |                       |                       |
 collective_rpc ---------------------------------> rank mismatch
                                              -> NCCL timeout

---

if not self.engines_running and not self.is_scheduler_paused():
    logger.debug("EngineCore starting idle loop for wave %d.", new_wave)
    self.engines_running = True

---

[rank7]:[E ProcessGroupNCCL.cpp:N] [Rank 7] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=N, OpType=ALLREDUCE, ..., Timeout(ms)=600000) ran for 600025 milliseconds before timing out.

RAW_BUFFERClick to expand / collapse

Title

[Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS                           : Ubuntu 24.04.3 LTS (x86_64)
PyTorch version              : 2.10.0a0+gitb0eb5f7
CUDA used to build PyTorch   : 12.9
Python version               : 3.12
Is CUDA available            : True
CUDA runtime version         : 12.9.86
GPU models and configuration :
GPU 0-7: NVIDIA H20
Nvidia driver version        : 570.124.06
vLLM Version                 : v0.16.1rc0

</details>

Model

MoE + DPEP setup (data_parallel_size > 1) using coordinator wave handling (internal DP load balancing / DPLBAsyncMPClient).

Bug description

With DPEP, calling:

pause_generation(mode="abort")
then collective_rpc(...) for online weight update

can intermittently fail with NCCL ALLREDUCE timeout (600s).

Observed behavior: one engine enters the utility collective_rpc, while a peer engine can still be in dummy-batch ALLREDUCE, leading to collective mismatch and timeout.

Root cause

In DPEngineCoreProc._handle_client_request, START_DP_WAVE can set engines_running = True even when scheduler is paused:

# vllm/v1/engine/core.py — DPEngineCoreProc._handle_client_request
if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running:          # no pause-state check
            self.engines_running = True       # re-arms dummy-batch loop

This races with pause completion:

pause_scheduler() may return immediately when engine appears idle (has_work()==False), i.e. no Future wait path.
A late START_DP_WAVE from coordinator can still arrive after pause returns.
Peer engine re-enters running loop and executes dummy-batch ALLREDUCE.
Next collective_rpc is issued; ranks are in different collectives -> timeout.

Race timeline

Engine 0 (idle)          Coordinator            Engine 1 (idle)
     |                       |                       |
 late req -> FIRST_REQ ----->|---- START_DP_WAVE -->| (queued)
     |                       |                       |
 pause_scheduler ------------|---------------------> pause_scheduler
 returns (idle fast path)                            returns (idle fast path)
     |                       |                       |
 pause_generation returns    |                 handles START_DP_WAVE
     |                       |                 engines_running=True
     |                       |                 -> dummy ALLREDUCE
     |                       |                       |
 collective_rpc ---------------------------------> rank mismatch
                                              -> NCCL timeout

Suggested fix

In DPEngineCoreProc._handle_client_request, gate START_DP_WAVE re-arm while paused:

if not self.engines_running and not self.is_scheduler_paused():
    logger.debug("EngineCore starting idle loop for wave %d.", new_wave)
    self.engines_running = True

Why this should be safe:

Normal unpaused behavior is unchanged.
During pause, wave index (current_wave) is still updated, but loop is not re-armed.
On resume, resume_scheduler() already sends start_wave when unfinished requests exist.

Note: this fix targets pause/collective race. Request-admission semantics for late requests during pause are a separate concern.

How to reproduce

Start DPEP MoE serving with data_parallel_size=2 and internal DP LB (DPLBAsyncMPClient).
Keep inference traffic running.
Trigger pause_generation(mode="abort") near wave boundary / in-flight transitions.
Immediately trigger all-worker utility collective_rpc (e.g., online weight receive/update).
Intermittently see NCCL ALLREDUCE timeout (Timeout(ms)=600000).

Race is timing dependent: it requires a late coordinator START_DP_WAVE arriving after pause has returned but before utility collective is issued.

Relevant logs

[rank7]:[E ProcessGroupNCCL.cpp:N] [Rank 7] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=N, OpType=ALLREDUCE, ..., Timeout(ms)=600000) ran for 600025 milliseconds before timing out.

Before submitting a new issue...

I have searched existing issues and confirmed this is not a duplicate.
I have verified the bug exists on the latest vLLM version.

extent analysis

Fix Plan

To resolve the issue, we need to modify the DPEngineCoreProc._handle_client_request method to prevent re-arming the dummy-batch loop while the scheduler is paused.

Here are the steps:

Modify the DPEngineCoreProc._handle_client_request method to check if the scheduler is paused before re-arming the dummy-batch loop.
Update the condition to set self.engines_running to True only when the scheduler is not paused.

Example code:

if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running and not self.is_scheduler_paused():
            logger.debug("EngineCore starting idle loop for wave %d.", new_wave)
            self.engines_running = True

Verification

To verify the fix, follow these steps:

Start DPEP MoE serving with data_parallel_size=2 and internal DP LB (DPLBAsyncMPClient).
Keep inference traffic running.
Trigger pause_generation(mode="abort") near wave boundary / in-flight transitions.
Immediately trigger all-worker utility collective_rpc (e.g., online weight receive/update).
Monitor the logs for NCCL ALLREDUCE timeout errors.

If the fix is successful, you should no longer see NCCL ALLREDUCE timeout errors.

Extra Tips

Make sure to test the fix thoroughly to ensure it resolves the issue without introducing any new problems.
Consider adding additional logging or debugging statements to help identify any future issues related to this fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #36608: [Bugfix] Fix DP wave race condition re-arming engine while paused

Description (problem / solution / changelog)

Changed files

Code Example

Title

Your current environment

Model

Bug description

Root cause

Race timeline

Suggested fix

How to reproduce

Relevant logs

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #36608: [Bugfix] Fix DP wave race condition re-arming engine while paused

Description (problem / solution / changelog)

Changed files

Code Example

Title

Your current environment

Model

Bug description

Root cause

Race timeline

Suggested fix

How to reproduce

Relevant logs

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING