vllm - 💡(How to fix) Fix [RFC]: Async parallel startup for EngineCore processes in DP/TP scenarios [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39678Fetched 2026-04-14 05:38:13
View on GitHub
Comments
2
Participants
2
Timeline
12
Reactions
0
Author
Timeline (top)
mentioned ×3subscribed ×3commented ×2labeled ×2

Root Cause

  1. Receives the environment variable name (evar) and value (value) as arguments, pre-computed in the parent before any concurrency begins
  2. Sets os.environ[evar] = value inside the child process before invoking the real engine entry point
  3. Eliminates the race condition entirely because each child sets its own environment independently

Fix Action

Fix / Workaround

Correctness bug on non-CUDA platforms. The current implementation temporarily patches CUDA_VISIBLE_DEVICES (or the platform equivalent) in the parent process via patch.dict(os.environ, ...) before calling proc.start(). When multiple EngineCore processes are started concurrently (or even sequentially with tight timing), this creates a race condition: a child process may inherit the environment variable intended for a different child. This affects Ascend NPU (ASCEND_RT_VISIBLE_DEVICES), ROCm (HIP_VISIBLE_DEVICES), and Ray-launched deployments.

We replace the parent-side patch.dict(os.environ, ...) + set_device_control_env_var() context manager with a child-side bootstrap shim _enginecore_bootstrap that:

Code Example

def _enginecore_bootstrap(
    *,
    evar: str,
    value: str,
    target_fn: Callable[..., Any],
    target_kwargs: dict[str, Any],
) -> None:
    os.environ[evar] = value      # set inside child, not in parent — no race
    target_fn(**target_kwargs)

---

data_parallel = vllm_config.parallel_config.data_parallel_size > 1
need_env_control = data_parallel and (
    not current_platform.is_cuda_alike()
    or vllm_config.parallel_config.use_ray
)

---

┌─────────────────────────────────────────┐
AsyncLLM / LLMEngine                        └──────────────┬──────────────────────────┘
                        ┌──────────────▼──────────────────────────┐
CoreEngineProcManager.__init__()                        │                                          │
                        │  local_engine_count > 1?                        │  ├─ YES_run_async_startup()                        │  │         └─ _start_processes_async()                        │  │            └─ asyncio.gather(                        │  │                 proc0.start(),                        │  │                 proc1.start(), ...)                        │  └─ NO  → serial proc.start()                        └──────────────┬──────────────────────────┘
                                        (each EngineCore process)
                        ┌──────────────▼──────────────────────────┐
MultiprocExecutor._init_executor()                        │                                          │
                        │  spawn + GPU/NPU + local_world_size > 1?                        │  ├─ YES_run_async_workers_startup()                        │  │         └─ _start_workers_async()                        │  │            └─ asyncio.gather(                        │  │                 worker0, worker1, ...)                        │  └─ NO  → serial make_worker_process()                        └─────────────────────────────────────────┘
RAW_BUFFERClick to expand / collapse

Motivation.

In multi-GPU deployments using Data Parallelism (DP) or Tensor Parallelism (TP), vLLM starts subprocess workers sequentially. Each proc.start() / WorkerProc.make_worker_process() call takes approximately 12 seconds on H100 (NCCL init + device setup), so total startup time grows linearly with the number of processes:

ConfigurationSequential startup time
DP=8, TP=1 (8 EngineCore procs)~96 s
DP=1, TP=8 (8 Worker procs per EngineCore)~96 s

This O(N) startup latency is a significant barrier for large-scale production deployments, especially in elastic scaling scenarios where new DP shards need to come online quickly.

Correctness bug on non-CUDA platforms. The current implementation temporarily patches CUDA_VISIBLE_DEVICES (or the platform equivalent) in the parent process via patch.dict(os.environ, ...) before calling proc.start(). When multiple EngineCore processes are started concurrently (or even sequentially with tight timing), this creates a race condition: a child process may inherit the environment variable intended for a different child. This affects Ascend NPU (ASCEND_RT_VISIBLE_DEVICES), ROCm (HIP_VISIBLE_DEVICES), and Ray-launched deployments.

Proposed Change.

We propose two independent async startup optimizations, each targeting a different parallelism dimension, plus a correctness fix for the environment variable race condition.

1. EngineCore parallel startup (DP dimension)

File: vllm/v1/engine/utils.py

In CoreEngineProcManager.__init__, we replace the sequential for proc in processes: proc.start() loop with concurrent startup via asyncio.gather + asyncio.to_thread when local_engine_count > 1.

Key design decisions:

  • Each proc.start() is wrapped in a _start_with_numa closure and offloaded to a thread-pool thread via asyncio.to_thread, so that per-process NUMA binding context managers (numa_utils.configure_subprocess) can be applied on the same thread without blocking the event loop.
  • When the caller is already inside a running event loop (e.g., AsyncLLM), we spin up a dedicated background thread with its own event loop via concurrent.futures.ThreadPoolExecutor, avoiding nested asyncio.run() calls.
  • The serial path is preserved for single-engine deployments (local_engine_count == 1) where async overhead is unnecessary.

New functions:

FunctionPurpose
_enginecore_bootstrap()Child-side bootstrap shim that sets the device visibility env var before invoking the real engine entry point (see §3)
_run_async_startup()Event loop wrapper that handles the already-in-loop case
_start_processes_async()Core async method that starts all EngineCore processes concurrently via asyncio.gather

2. Worker parallel startup (TP dimension)

File: vllm/v1/executor/multiproc_executor.py

In MultiprocExecutor._init_executor, we replace the sequential worker creation loop with concurrent startup via asyncio.gather + asyncio.to_thread when all of the following conditions are met:

  • Multiprocessing context is spawn (not fork)
  • Platform is not CPU
  • local_world_size > 1

Why these constraints:

ConstraintReason
Not forkFork mode requires incremental inherited_fds tracking — each worker's death_writer and ready_pipe fds must be added to the list before the next worker is created. This is inherently sequential.
Not CPUCPU platform uses om.run() (OMP manager) for thread affinity binding, which is not compatible with async startup.
local_world_size > 1Single-worker case has no parallelism opportunity.

New functions:

FunctionPurpose
_run_async_workers_startup()Event loop wrapper (same pattern as EngineCore)
_start_workers_async()Core async method that starts all Worker processes concurrently via asyncio.gather, passing inherited_fds=None (spawn mode only)

3. Race condition fix (_enginecore_bootstrap)

We replace the parent-side patch.dict(os.environ, ...) + set_device_control_env_var() context manager with a child-side bootstrap shim _enginecore_bootstrap that:

  1. Receives the environment variable name (evar) and value (value) as arguments, pre-computed in the parent before any concurrency begins
  2. Sets os.environ[evar] = value inside the child process before invoking the real engine entry point
  3. Eliminates the race condition entirely because each child sets its own environment independently
def _enginecore_bootstrap(
    *,
    evar: str,
    value: str,
    target_fn: Callable[..., Any],
    target_kwargs: dict[str, Any],
) -> None:
    os.environ[evar] = value      # set inside child, not in parent — no race
    target_fn(**target_kwargs)

This shim is only used when need_env_control is True (non-CUDA platforms or Ray launcher). CUDA platforms without Ray use torch.cuda.set_device() inside the worker and are completely unaffected.

data_parallel = vllm_config.parallel_config.data_parallel_size > 1
need_env_control = data_parallel and (
    not current_platform.is_cuda_alike()
    or vllm_config.parallel_config.use_ray
)

Architecture diagram

                        ┌─────────────────────────────────────────┐
                        │         AsyncLLM / LLMEngine            │
                        └──────────────┬──────────────────────────┘
                        ┌──────────────▼──────────────────────────┐
                        │     CoreEngineProcManager.__init__()     │
                        │                                          │
                        │  local_engine_count > 1?                 │
                        │  ├─ YES → _run_async_startup()           │
                        │  │         └─ _start_processes_async()   │
                        │  │            └─ asyncio.gather(          │
                        │  │                 proc0.start(),         │
                        │  │                 proc1.start(), ...)    │
                        │  └─ NO  → serial proc.start()            │
                        └──────────────┬──────────────────────────┘
                                       │ (each EngineCore process)
                        ┌──────────────▼──────────────────────────┐
                        │   MultiprocExecutor._init_executor()     │
                        │                                          │
                        │  spawn + GPU/NPU + local_world_size > 1? │
                        │  ├─ YES → _run_async_workers_startup()   │
                        │  │         └─ _start_workers_async()     │
                        │  │            └─ asyncio.gather(          │
                        │  │                 worker0, worker1, ...) │
                        │  └─ NO  → serial make_worker_process()   │
                        └─────────────────────────────────────────┘

Feedback Period.

2 weeks

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To address the O(N) startup latency and correctness bug in multi-GPU deployments, implement async startup optimizations for EngineCore and Worker processes, and fix the environment variable race condition by setting the variable inside the child process.

Guidance

  1. Implement async EngineCore startup: Replace the sequential for proc in processes: proc.start() loop with concurrent startup via asyncio.gather + asyncio.to_thread in CoreEngineProcManager.__init__.
  2. Implement async Worker startup: Replace the sequential worker creation loop with concurrent startup via asyncio.gather + asyncio.to_thread in MultiprocExecutor._init_executor, considering the constraints (spawn mode, non-CPU platform, and local_world_size > 1).
  3. Fix environment variable race condition: Use the _enginecore_bootstrap shim to set the environment variable inside the child process, eliminating the race condition.
  4. Verify the fix: Test the async startup optimizations and environment variable fix on different platforms and configurations to ensure correctness and performance improvements.

Example

def _enginecore_bootstrap(
    *,
    evar: str,
    value: str,
    target_fn: Callable[..., Any],
    target_kwargs: dict[str, Any],
) -> None:
    os.environ[evar] = value      # set inside child, not in parent — no race
    target_fn(**target_kwargs)

Notes

The proposed changes require careful consideration of the constraints and platform-specific requirements. The async startup optimizations may introduce additional complexity, and thorough testing is necessary to ensure correctness and performance improvements.

Recommendation

Apply the workaround by implementing the proposed async startup optimizations and environment variable fix, as they address the O(N) startup latency and correctness bug, and provide a more scalable solution for large-scale production deployments.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Async parallel startup for EngineCore processes in DP/TP scenarios [2 comments, 2 participants]