vllm - ✅(Solved) Fix [Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730 [5 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36624Fetched 2026-04-08 00:35:54
View on GitHub
Comments
3
Participants
2
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×5commented ×3closed ×1mentioned ×1

test_external_lb_dp.py with api_server_count=4 has been failing consistently in every nightly/daily build since PR #34730 (27066d1b) landed on Mar 6.

The [1] variant (1 API server per DP rank) passes; only [4] (4 API servers per DP rank) fails.

Error Message

EngineCore_DP1: WorkerProc initialization failed due to an exception in a background process.
Worker_DP0_TP0: Parent process exited, terminating worker queues
Worker_DP0_TP0: BrokenPipeError: [Errno 32] Broken pipe
Failed to start server rank 0: Server exited unexpectedly
Exception: Servers failed to start

Root Cause

PR #34730 changes the server startup/shutdown chain:

  • run_multi_api_server(): New SIGTERM/SIGINT signal handler raising SystemExit, close()shutdown(timeout=...)
  • run_engine_core(): Direct SystemExit signal handler replaced with SignalCallback thread + shutdown_state enum
  • EngineProcessManager: close()shutdown() with _finalizer.detach() semantics

The [4] variant is more sensitive to timing because it launches 4 API servers per DP rank. During this heavier initialization, EngineCore_DP1's worker fails with BrokenPipeError because the parent process exits before the worker finishes initializing.

Fix Action

Fixed

PR fix notes

PR #36628: [Frontend][Core] Revert "Add shutdown timeout" (#34730 and #36270)

Description (problem / solution / changelog)

Distributed Test 4 GPUs is still failing. Testing whether these reverts fix it

Fixes #36624

Changed files

  • tests/entrypoints/openai/test_shutdown.py (modified, +0/-459)
  • tests/entrypoints/test_api_server_process_manager.py (modified, +7/-15)
  • vllm/config/vllm.py (modified, +0/-6)
  • vllm/engine/arg_utils.py (modified, +0/-11)
  • vllm/engine/protocol.py (modified, +0/-5)
  • vllm/entrypoints/cli/serve.py (modified, +6/-42)
  • vllm/entrypoints/launcher.py (modified, +5/-23)
  • vllm/v1/engine/__init__.py (modified, +0/-2)
  • vllm/v1/engine/async_llm.py (modified, +3/-2)
  • vllm/v1/engine/coordinator.py (modified, +2/-4)
  • vllm/v1/engine/core.py (modified, +43/-127)
  • vllm/v1/engine/core_client.py (modified, +12/-12)
  • vllm/v1/engine/utils.py (modified, +5/-34)
  • vllm/v1/utils.py (modified, +12/-19)

PR #36619: [DO NOT MERGE] Test revert "Remove busy loop from idle buffer readers" (#28053 and #36068)

Description (problem / solution / changelog)

Distributed Test 4 GPUs is still failing. Testing whether these reverts fix it

Changed files

  • tests/basic_correctness/test_basic_correctness.py (modified, +2/-0)
  • tests/distributed/test_shm_broadcast.py (modified, +8/-285)
  • vllm/distributed/device_communicators/shm_broadcast.py (modified, +64/-194)
  • vllm/envs.py (modified, +5/-0)
  • vllm/v1/executor/multiproc_executor.py (modified, +64/-109)

PR #36650: [DO NOT MERGE] Reapply "[BugFix] Fix engine hanging after KV cache initialization failure #35478"

Description (problem / solution / changelog)

Distributed Test 4 GPUs is still failing. Testing whether this reverts fixes it

Fixes https://github.com/vllm-project/vllm/issues/36624

See https://github.com/vllm-project/vllm/pull/36628#issuecomment-4030961404

The series of relevant PRs are:

So it appears that reverting #36262 might be sufficient

Changed files

  • vllm/v1/engine/core.py (modified, +55/-25)
  • vllm/v1/engine/utils.py (modified, +5/-0)

PR #36646: [DO NOT MERGE][Core] Revert "Fix benign error log during normal shutdown (#36270)"

Description (problem / solution / changelog)

Possible simpler alternative to #36628 fixing #36624

Changed files

  • vllm/v1/engine/core_client.py (modified, +9/-6)

PR #36666: [Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish

Description (problem / solution / changelog)

Re-apply #34730 and #36270 which were reverted by #36628 because of test_external_lb_dp.py failures in Distributed Tests (4 GPUs)


Relates to #24885 and supersedes a subset of #32420

🤖 Co-authored with Claude Code Co-authored-by: @njhill

Overview

This PR adds support for a shutdown timeout in the API server and engine core, allowing controlled handling of a SIGTERM signal - either immediate termination by default (--shutdown-timeout=0) or wait for in-flight requests to finish (--shutdown-timeout=SECONDS).

Background

Previously, SIGTERM to the API server or multi-API server launcher would cause immediate process termination without coordinating request handling. This PR addresses that by:

  1. Rejecting new requests when shutdown is initiated (not queuing them like pause does)
  2. Handling in-flight requests according to --shutdown-timeout:
    • Default, no timeout - abort all in-flight requests immediately
    • With timeout - Wait for in-flight requests to complete (up to --shutdown-timeout seconds)
  3. Using a separate shutdown state machine in the engine core to manage the shutdown lifecycle

Scope and Constraints

This PR is intentionally scoped to the API server use case with SIGTERM handling. The following are explicitly out of scope:

  • KV transfers: unlike #32420, prefill instances are not (yet) waiting for KV transfers to complete before shutting down
  • Library use case: No changes to direct LLM or AsyncLLM usage (though shutdown() methods are added for future use)
  • ScalingMiddleware reuse: Request rejection happens at engine core level, not middleware
  • Signal handling redesign: No new shutdown pipe mechanism; minimal changes to existing handlers
  • Ctrl-C behavior: Process groups and Ctrl-C handling remain unchanged; only SIGTERM is handled
  • /health endpoint changes: No new /live endpoint or changes to health check behavior
  • External/hybrid load balancer validation: Assumes parallel SIGTERM to multiple API servers works (untested, similar to pause behavior)

Implementation Approach

Core Mechanism: EngineShutdownState State Machine

The implementation uses a separate EngineShutdownState enum to manage shutdown lifecycle independently from pause state:

class EngineShutdownState(IntEnum):
    RUNNING = 0
    REQUESTED = 1
    SHUTTING_DOWN = 2

Why a separate state machine:

  • Shutdown has different semantics than pause (permanent vs temporary)
  • Shutdown needs request rejection, pause needs request queuing
  • Different lifecycle: pause can be resumed, shutdown cannot
  • Cleaner separation of concerns

How it works:

  1. SIGTERM signal handler sets engine_core.shutdown_state = EngineShutdownState.REQUESTED
  2. Engine core's _handle_shutdown() method checks state each iteration:
    • RUNNING: Continue normal operation
    • REQUESTED: Abort/wait based on config, transition to SHUTTING_DOWN, reject new requests
    • SHUTTING_DOWN: Exit when no work remaining
  3. Engine core's _handle_client_request() calls _reject_add_in_shutdown() and _reject_utility_in_shutdown() which check shutdown state and reject new work
  4. Engine core's run_busy_loop() exits when _handle_shutdown() returns False

Idle Engine Wake-up

When the engine is idle and blocked on input_queue.get(), it cannot detect shutdown state changes. To deal with this, we:

  1. Add WAKEUP sentinel to EngineCoreRequestType for waking idle engine
  2. Add a new signal-context-safe SignalCallback to allow triggering a callback from a signal handler
  3. Add a signal callback that pushes WAKEUP to input_queue
  4. Trigger the signal callback from the signal handler
  5. Main loop wakes from blocking get(), checks shutdown_state, proceeds

Shutdown Flow

Single API Server:

SIGTERM → signal_handler() sets shutdown_event
       → _handle_shutdown() coroutine wakes up
       → Retrieves timeout from vllm_config.shutdown_config.wait_timeout
       → Calls engine_client.shutdown(timeout=timeout)
       → Shutdown propagates through client hierarchy:
          - AsyncLLM.shutdown() → EngineCoreClient.shutdown()
          - MPClient detaches finalizer, calls engine_manager.shutdown()
          - CoreEngineProcManager.shutdown() terminates/waits for processes
       → Then cancels server tasks and stops SSL refresher

Multi-API Server Launcher:

SIGTERM → signal_handler() sets shutdown_requested flag, raises SystemExit
       → finally block executes
       → Computes shared deadline from configured wait_timeout
       → Calls shutdown(deadline=deadline) on all managers:
          - api_server_manager.shutdown(deadline=deadline)
          - local_engine_manager.shutdown(deadline=deadline) if present
          - coordinator.shutdown(deadline=deadline) if present
       → Each manager terminates/waits for processes within deadline

Headless Mode:

SIGTERM → signal_handler() sets shutdown_requested flag, raises SystemExit
       → finally block executes
       → Retrieves timeout from vllm_config.shutdown_config.wait_timeout
       → Calls engine_manager.shutdown(timeout=timeout)
       → Process manager terminates/waits for processes with timeout

Changed files

Code Example

EngineCore_DP1: WorkerProc initialization failed due to an exception in a background process.
Worker_DP0_TP0: Parent process exited, terminating worker queues
Worker_DP0_TP0: BrokenPipeError: [Errno 32] Broken pipe
Failed to start server rank 0: Server exited unexpectedly
Exception: Servers failed to start
RAW_BUFFERClick to expand / collapse

Description

test_external_lb_dp.py with api_server_count=4 has been failing consistently in every nightly/daily build since PR #34730 (27066d1b) landed on Mar 6.

The [1] variant (1 API server per DP rank) passes; only [4] (4 API servers per DP rank) fails.

Affected tests

  • test_external_lb_server_info[4]
  • test_external_lb_single_completion[4-ibm-research/PowerMoE-3b]
  • test_external_lb_completion_streaming[4-ibm-research/PowerMoE-3b]

Error

EngineCore_DP1: WorkerProc initialization failed due to an exception in a background process.
Worker_DP0_TP0: Parent process exited, terminating worker queues
Worker_DP0_TP0: BrokenPipeError: [Errno 32] Broken pipe
Failed to start server rank 0: Server exited unexpectedly
Exception: Servers failed to start

Bisection

  • Last passing build: #54783 (daily Mar 5, commit a97954b6)
  • First failing build: #54882 (nightly Mar 6, commit 5afb387b)
  • Culprit: PR #34730 (27066d1b) — "[Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish"
  • Failing in every nightly/daily since: #54882, #55011, #55067, #55110, #55120, #55158, #55214, #55354, #55437

Analysis

PR #34730 changes the server startup/shutdown chain:

  • run_multi_api_server(): New SIGTERM/SIGINT signal handler raising SystemExit, close()shutdown(timeout=...)
  • run_engine_core(): Direct SystemExit signal handler replaced with SignalCallback thread + shutdown_state enum
  • EngineProcessManager: close()shutdown() with _finalizer.detach() semantics

The [4] variant is more sensitive to timing because it launches 4 API servers per DP rank. During this heavier initialization, EngineCore_DP1's worker fails with BrokenPipeError because the parent process exits before the worker finishes initializing.

Note

This was already flagged on Mar 6 by @markmc:

Watch out for shutdown related regressions that may have been caused by 34730

Failing build examples

cc @markmc

extent analysis

Fix Plan

To address the issue, we need to adjust the shutdown timeout to allow for the heavier initialization of 4 API servers per DP rank. Here are the steps:

  • Increase the shutdown timeout in run_multi_api_server() to give the API servers enough time to initialize.
  • Modify the SignalCallback thread in run_engine_core() to wait for the worker initialization to complete before shutting down.
  • Update EngineProcessManager to use a longer timeout when calling shutdown() to ensure that the worker processes have enough time to finish initializing.

Example code changes:

# In run_multi_api_server()
shutdown(timeout=30)  # Increase timeout to 30 seconds

# In run_engine_core()
signal_callback = SignalCallback()
signal_callback.wait_for_worker_init()  # Wait for worker initialization to complete

# In EngineProcessManager
def shutdown(self):
    self._finalizer.detach()
    self._shutdown_timeout = 30  # Increase shutdown timeout to 30 seconds
    self._shutdown()

Verification

To verify that the fix worked, run the failing tests (test_external_lb_server_info[4], test_external_lb_single_completion[4-ibm-research/PowerMoE-3b], and test_external_lb_completion_streaming[4-ibm-research/PowerMoE-3b]) and check that they pass without errors.

Extra Tips

  • Monitor the tests for any shutdown-related regressions and adjust the timeouts as needed.
  • Consider adding logging to track the initialization time of the API servers and workers to help identify any future issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730 [5 pull requests, 3 comments, 2 participants]