vllm - ✅(Solved) Fix [Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36594Fetched 2026-04-08 00:36:07
View on GitHub
Comments
0
Participants
1
Timeline
30
Reactions
0
Author
Participants
Timeline (top)
referenced ×17cross-referenced ×5project_v2_item_status_changed ×4added_to_project_v2 ×1

Root Cause

In DPEngineCoreProc._handle_client_request, START_DP_WAVE can set engines_running = True even when scheduler is paused:

# vllm/v1/engine/core.py — DPEngineCoreProc._handle_client_request
if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running:          # no pause-state check
            self.engines_running = True       # re-arms dummy-batch loop

This races with pause completion:

  • pause_scheduler() may return immediately when engine appears idle (has_work()==False), i.e. no Future wait path.
  • A late START_DP_WAVE from coordinator can still arrive after pause returns.
  • Peer engine re-enters running loop and executes dummy-batch ALLREDUCE.
  • Next collective_rpc is issued; ranks are in different collectives -> timeout.

Fix Action

Fixed

PR fix notes

PR #36608: [Bugfix] Fix DP wave race condition re-arming engine while paused

Description (problem / solution / changelog)

Gate START_DP_WAVE on scheduler pause state to prevent race with collective_rpc.

Closes #36594

Changed files

  • vllm/v1/engine/core.py (modified, +1/-1)

Code Example

OS                           : Ubuntu 24.04.3 LTS (x86_64)
PyTorch version              : 2.10.0a0+gitb0eb5f7
CUDA used to build PyTorch   : 12.9
Python version               : 3.12
Is CUDA available            : True
CUDA runtime version         : 12.9.86
GPU models and configuration :
GPU 0-7: NVIDIA H20
Nvidia driver version        : 570.124.06
vLLM Version                 : v0.16.1rc0

---

# vllm/v1/engine/core.pyDPEngineCoreProc._handle_client_request
if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running:          # no pause-state check
            self.engines_running = True       # re-arms dummy-batch loop

---

Engine 0 (idle)          Coordinator            Engine 1 (idle)
     |                       |                       |
 late req -> FIRST_REQ ----->|---- START_DP_WAVE -->| (queued)
     |                       |                       |
 pause_scheduler ------------|---------------------> pause_scheduler
 returns (idle fast path)                            returns (idle fast path)
     |                       |                       |
 pause_generation returns    |                 handles START_DP_WAVE
     |                       |                 engines_running=True
     |                       |                 -> dummy ALLREDUCE
     |                       |                       |
 collective_rpc ---------------------------------> rank mismatch
                                              -> NCCL timeout

---

if not self.engines_running and not self.is_scheduler_paused():
    logger.debug("EngineCore starting idle loop for wave %d.", new_wave)
    self.engines_running = True

---

[rank7]:[E ProcessGroupNCCL.cpp:N] [Rank 7] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=N, OpType=ALLREDUCE, ..., Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
RAW_BUFFERClick to expand / collapse

Title

[Bug]: DPEngineCoreProc may re-arm DP wave while paused (START_DP_WAVE ignores pause state), causing collective timeout after pause_generation + collective_rpc

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
OS                           : Ubuntu 24.04.3 LTS (x86_64)
PyTorch version              : 2.10.0a0+gitb0eb5f7
CUDA used to build PyTorch   : 12.9
Python version               : 3.12
Is CUDA available            : True
CUDA runtime version         : 12.9.86
GPU models and configuration :
GPU 0-7: NVIDIA H20
Nvidia driver version        : 570.124.06
vLLM Version                 : v0.16.1rc0
</details>

Model

MoE + DPEP setup (data_parallel_size > 1) using coordinator wave handling (internal DP load balancing / DPLBAsyncMPClient).

Bug description

With DPEP, calling:

  1. pause_generation(mode="abort")
  2. then collective_rpc(...) for online weight update

can intermittently fail with NCCL ALLREDUCE timeout (600s).

Observed behavior: one engine enters the utility collective_rpc, while a peer engine can still be in dummy-batch ALLREDUCE, leading to collective mismatch and timeout.

Root cause

In DPEngineCoreProc._handle_client_request, START_DP_WAVE can set engines_running = True even when scheduler is paused:

# vllm/v1/engine/core.py — DPEngineCoreProc._handle_client_request
if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running:          # no pause-state check
            self.engines_running = True       # re-arms dummy-batch loop

This races with pause completion:

  • pause_scheduler() may return immediately when engine appears idle (has_work()==False), i.e. no Future wait path.
  • A late START_DP_WAVE from coordinator can still arrive after pause returns.
  • Peer engine re-enters running loop and executes dummy-batch ALLREDUCE.
  • Next collective_rpc is issued; ranks are in different collectives -> timeout.

Race timeline

Engine 0 (idle)          Coordinator            Engine 1 (idle)
     |                       |                       |
 late req -> FIRST_REQ ----->|---- START_DP_WAVE -->| (queued)
     |                       |                       |
 pause_scheduler ------------|---------------------> pause_scheduler
 returns (idle fast path)                            returns (idle fast path)
     |                       |                       |
 pause_generation returns    |                 handles START_DP_WAVE
     |                       |                 engines_running=True
     |                       |                 -> dummy ALLREDUCE
     |                       |                       |
 collective_rpc ---------------------------------> rank mismatch
                                              -> NCCL timeout

Suggested fix

In DPEngineCoreProc._handle_client_request, gate START_DP_WAVE re-arm while paused:

if not self.engines_running and not self.is_scheduler_paused():
    logger.debug("EngineCore starting idle loop for wave %d.", new_wave)
    self.engines_running = True

Why this should be safe:

  1. Normal unpaused behavior is unchanged.
  2. During pause, wave index (current_wave) is still updated, but loop is not re-armed.
  3. On resume, resume_scheduler() already sends start_wave when unfinished requests exist.

Note: this fix targets pause/collective race. Request-admission semantics for late requests during pause are a separate concern.

How to reproduce

  1. Start DPEP MoE serving with data_parallel_size=2 and internal DP LB (DPLBAsyncMPClient).
  2. Keep inference traffic running.
  3. Trigger pause_generation(mode="abort") near wave boundary / in-flight transitions.
  4. Immediately trigger all-worker utility collective_rpc (e.g., online weight receive/update).
  5. Intermittently see NCCL ALLREDUCE timeout (Timeout(ms)=600000).

Race is timing dependent: it requires a late coordinator START_DP_WAVE arriving after pause has returned but before utility collective is issued.

Relevant logs

[rank7]:[E ProcessGroupNCCL.cpp:N] [Rank 7] Watchdog caught collective operation timeout:
WorkNCCL(SeqNum=N, OpType=ALLREDUCE, ..., Timeout(ms)=600000) ran for 600025 milliseconds before timing out.

Before submitting a new issue...

  • I have searched existing issues and confirmed this is not a duplicate.
  • I have verified the bug exists on the latest vLLM version.

extent analysis

Fix Plan

To resolve the issue, we need to modify the DPEngineCoreProc._handle_client_request method to prevent re-arming the dummy-batch loop while the scheduler is paused.

Here are the steps:

  • Modify the DPEngineCoreProc._handle_client_request method to check if the scheduler is paused before re-arming the dummy-batch loop.
  • Update the condition to set self.engines_running to True only when the scheduler is not paused.

Example code:

if request_type == EngineCoreRequestType.START_DP_WAVE:
    new_wave, exclude_eng_index = request
    if exclude_eng_index != self.engine_index and new_wave >= self.current_wave:
        self.current_wave = new_wave
        if not self.engines_running and not self.is_scheduler_paused():
            logger.debug("EngineCore starting idle loop for wave %d.", new_wave)
            self.engines_running = True

Verification

To verify the fix, follow these steps:

  • Start DPEP MoE serving with data_parallel_size=2 and internal DP LB (DPLBAsyncMPClient).
  • Keep inference traffic running.
  • Trigger pause_generation(mode="abort") near wave boundary / in-flight transitions.
  • Immediately trigger all-worker utility collective_rpc (e.g., online weight receive/update).
  • Monitor the logs for NCCL ALLREDUCE timeout errors.

If the fix is successful, you should no longer see NCCL ALLREDUCE timeout errors.

Extra Tips

  • Make sure to test the fix thoroughly to ensure it resolves the issue without introducing any new problems.
  • Consider adding additional logging or debugging statements to help identify any future issues related to this fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING