vllm - 💡(How to fix) Fix [RFC]: Async parallel startup for EngineCore processes in DP/TP scenarios [2 comments, 2 participants]

vllm2026-04-13 07:46:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39678•Fetched 2026-04-14 05:38:13

View on GitHub

Comments

Participants

Timeline

Reactions

Author

hwhaokun

Participants

github-actions[bot]

hwhaokun

Timeline (top)

mentioned ×3subscribed ×3commented ×2labeled ×2

Root Cause

Receives the environment variable name (evar) and value (value) as arguments, pre-computed in the parent before any concurrency begins
Sets os.environ[evar] = value inside the child process before invoking the real engine entry point
Eliminates the race condition entirely because each child sets its own environment independently

Fix Action

Fix / Workaround

Correctness bug on non-CUDA platforms. The current implementation temporarily patches CUDA_VISIBLE_DEVICES (or the platform equivalent) in the parent process via patch.dict(os.environ, ...) before calling proc.start(). When multiple EngineCore processes are started concurrently (or even sequentially with tight timing), this creates a race condition: a child process may inherit the environment variable intended for a different child. This affects Ascend NPU (ASCEND_RT_VISIBLE_DEVICES), ROCm (HIP_VISIBLE_DEVICES), and Ray-launched deployments.

We replace the parent-side patch.dict(os.environ, ...) + set_device_control_env_var() context manager with a child-side bootstrap shim _enginecore_bootstrap that:

Code Example

def _enginecore_bootstrap(
    *,
    evar: str,
    value: str,
    target_fn: Callable[..., Any],
    target_kwargs: dict[str, Any],
) -> None:
    os.environ[evar] = value      # set inside child, not in parent — no race
    target_fn(**target_kwargs)

---

data_parallel = vllm_config.parallel_config.data_parallel_size > 1
need_env_control = data_parallel and (
    not current_platform.is_cuda_alike()
    or vllm_config.parallel_config.use_ray
)

---

┌─────────────────────────────────────────┐
                        │         AsyncLLM / LLMEngine            │
                        └──────────────┬──────────────────────────┘
                                       │
                        ┌──────────────▼──────────────────────────┐
                        │     CoreEngineProcManager.__init__()     │
                        │                                          │
                        │  local_engine_count > 1?                 │
                        │  ├─ YES → _run_async_startup()           │
                        │  │         └─ _start_processes_async()   │
                        │  │            └─ asyncio.gather(          │
                        │  │                 proc0.start(),         │
                        │  │                 proc1.start(), ...)    │
                        │  └─ NO  → serial proc.start()            │
                        └──────────────┬──────────────────────────┘
                                       │ (each EngineCore process)
                        ┌──────────────▼──────────────────────────┐
                        │   MultiprocExecutor._init_executor()     │
                        │                                          │
                        │  spawn + GPU/NPU + local_world_size > 1? │
                        │  ├─ YES → _run_async_workers_startup()   │
                        │  │         └─ _start_workers_async()     │
                        │  │            └─ asyncio.gather(          │
                        │  │                 worker0, worker1, ...) │
                        │  └─ NO  → serial make_worker_process()   │
                        └─────────────────────────────────────────┘

RAW_BUFFERClick to expand / collapse

Motivation.

In multi-GPU deployments using Data Parallelism (DP) or Tensor Parallelism (TP), vLLM starts subprocess workers sequentially. Each proc.start() / WorkerProc.make_worker_process() call takes approximately 12 seconds on H100 (NCCL init + device setup), so total startup time grows linearly with the number of processes:

Configuration	Sequential startup time
DP=8, TP=1 (8 EngineCore procs)	~96 s
DP=1, TP=8 (8 Worker procs per EngineCore)	~96 s

This O(N) startup latency is a significant barrier for large-scale production deployments, especially in elastic scaling scenarios where new DP shards need to come online quickly.

Proposed Change.

We propose two independent async startup optimizations, each targeting a different parallelism dimension, plus a correctness fix for the environment variable race condition.

1. EngineCore parallel startup (DP dimension)

File: vllm/v1/engine/utils.py

In CoreEngineProcManager.__init__, we replace the sequential for proc in processes: proc.start() loop with concurrent startup via asyncio.gather + asyncio.to_thread when local_engine_count > 1.

Key design decisions:

Each proc.start() is wrapped in a _start_with_numa closure and offloaded to a thread-pool thread via asyncio.to_thread, so that per-process NUMA binding context managers (numa_utils.configure_subprocess) can be applied on the same thread without blocking the event loop.
When the caller is already inside a running event loop (e.g., AsyncLLM), we spin up a dedicated background thread with its own event loop via concurrent.futures.ThreadPoolExecutor, avoiding nested asyncio.run() calls.
The serial path is preserved for single-engine deployments (local_engine_count == 1) where async overhead is unnecessary.

New functions:

Function	Purpose
`_enginecore_bootstrap()`	Child-side bootstrap shim that sets the device visibility env var before invoking the real engine entry point (see §3)
`_run_async_startup()`	Event loop wrapper that handles the already-in-loop case
`_start_processes_async()`	Core async method that starts all EngineCore processes concurrently via `asyncio.gather`

2. Worker parallel startup (TP dimension)

File: vllm/v1/executor/multiproc_executor.py

In MultiprocExecutor._init_executor, we replace the sequential worker creation loop with concurrent startup via asyncio.gather + asyncio.to_thread when all of the following conditions are met:

Multiprocessing context is spawn (not fork)
Platform is not CPU
local_world_size > 1

Why these constraints:

Constraint	Reason
Not fork	Fork mode requires incremental `inherited_fds` tracking — each worker's `death_writer` and `ready_pipe` fds must be added to the list before the next worker is created. This is inherently sequential.
Not CPU	CPU platform uses `om.run()` (OMP manager) for thread affinity binding, which is not compatible with async startup.
`local_world_size > 1`	Single-worker case has no parallelism opportunity.

New functions:

Function	Purpose
`_run_async_workers_startup()`	Event loop wrapper (same pattern as EngineCore)
`_start_workers_async()`	Core async method that starts all Worker processes concurrently via `asyncio.gather`, passing `inherited_fds=None` (spawn mode only)

3. Race condition fix (`_enginecore_bootstrap`)

We replace the parent-side patch.dict(os.environ, ...) + set_device_control_env_var() context manager with a child-side bootstrap shim _enginecore_bootstrap that:

Receives the environment variable name (evar) and value (value) as arguments, pre-computed in the parent before any concurrency begins
Sets os.environ[evar] = value inside the child process before invoking the real engine entry point
Eliminates the race condition entirely because each child sets its own environment independently

def _enginecore_bootstrap(
    *,
    evar: str,
    value: str,
    target_fn: Callable[..., Any],
    target_kwargs: dict[str, Any],
) -> None:
    os.environ[evar] = value      # set inside child, not in parent — no race
    target_fn(**target_kwargs)

This shim is only used when need_env_control is True (non-CUDA platforms or Ray launcher). CUDA platforms without Ray use torch.cuda.set_device() inside the worker and are completely unaffected.

data_parallel = vllm_config.parallel_config.data_parallel_size > 1
need_env_control = data_parallel and (
    not current_platform.is_cuda_alike()
    or vllm_config.parallel_config.use_ray
)

Architecture diagram

                        ┌─────────────────────────────────────────┐
                        │         AsyncLLM / LLMEngine            │
                        └──────────────┬──────────────────────────┘
                                       │
                        ┌──────────────▼──────────────────────────┐
                        │     CoreEngineProcManager.__init__()     │
                        │                                          │
                        │  local_engine_count > 1?                 │
                        │  ├─ YES → _run_async_startup()           │
                        │  │         └─ _start_processes_async()   │
                        │  │            └─ asyncio.gather(          │
                        │  │                 proc0.start(),         │
                        │  │                 proc1.start(), ...)    │
                        │  └─ NO  → serial proc.start()            │
                        └──────────────┬──────────────────────────┘
                                       │ (each EngineCore process)
                        ┌──────────────▼──────────────────────────┐
                        │   MultiprocExecutor._init_executor()     │
                        │                                          │
                        │  spawn + GPU/NPU + local_world_size > 1? │
                        │  ├─ YES → _run_async_workers_startup()   │
                        │  │         └─ _start_workers_async()     │
                        │  │            └─ asyncio.gather(          │
                        │  │                 worker0, worker1, ...) │
                        │  └─ NO  → serial make_worker_process()   │
                        └─────────────────────────────────────────┘

Feedback Period.

2 weeks

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To address the O(N) startup latency and correctness bug in multi-GPU deployments, implement async startup optimizations for EngineCore and Worker processes, and fix the environment variable race condition by setting the variable inside the child process.

Guidance

Implement async EngineCore startup: Replace the sequential for proc in processes: proc.start() loop with concurrent startup via asyncio.gather + asyncio.to_thread in CoreEngineProcManager.__init__.
Implement async Worker startup: Replace the sequential worker creation loop with concurrent startup via asyncio.gather + asyncio.to_thread in MultiprocExecutor._init_executor, considering the constraints (spawn mode, non-CPU platform, and local_world_size > 1).
Fix environment variable race condition: Use the _enginecore_bootstrap shim to set the environment variable inside the child process, eliminating the race condition.
Verify the fix: Test the async startup optimizations and environment variable fix on different platforms and configurations to ensure correctness and performance improvements.

Example

def _enginecore_bootstrap(
    *,
    evar: str,
    value: str,
    target_fn: Callable[..., Any],
    target_kwargs: dict[str, Any],
) -> None:
    os.environ[evar] = value      # set inside child, not in parent — no race
    target_fn(**target_kwargs)

Notes

The proposed changes require careful consideration of the constraints and platform-specific requirements. The async startup optimizations may introduce additional complexity, and thorough testing is necessary to ensure correctness and performance improvements.

Recommendation

Apply the workaround by implementing the proposed async startup optimizations and environment variable fix, as they address the O(N) startup latency and correctness bug, and provide a more scalable solution for large-scale production deployments.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #optimization #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Async parallel startup for EngineCore processes in DP/TP scenarios [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Proposed Change.

1. EngineCore parallel startup (DP dimension)

2. Worker parallel startup (TP dimension)

3. Race condition fix (`_enginecore_bootstrap`)

Architecture diagram

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Async parallel startup for EngineCore processes in DP/TP scenarios [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Proposed Change.

1. EngineCore parallel startup (DP dimension)

2. Worker parallel startup (TP dimension)

3. Race condition fix (_enginecore_bootstrap)

Architecture diagram

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

3. Race condition fix (`_enginecore_bootstrap`)