vllm - ✅(Solved) Fix [BUG] Port-allocation race between ApiServer processes in hybrid-LB mode (ZMQError: Address already in use) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40443Fetched 2026-04-22 07:45:35
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
renamed ×1

Multi-node hybrid-LB deployments (DP > 1 across 2+ nodes, default api_server_count) intermittently fail during ApiServer startup with:

(ApiServer_N pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://<host>:<port>')
...
RuntimeError: Process ApiServer_N (PID: ...) died with exit code 1
RuntimeError: Engine core initialization failed. See root cause above.

Happens after weights/compile finish, before Application startup complete. Observed consistently on MoE / multimodal models (Qwen3-30B-A3B-Instruct-2507, gpt-oss-120b, gemma-4-31B-it) across 2n and 4n jobs; not observed on single-node or on small dense models.

Error Message

(ApiServer_N pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://<host>:<port>') ... RuntimeError: Process ApiServer_N (PID: ...) died with exit code 1 RuntimeError: Engine core initialization failed. See root cause above.

Root Cause

(ApiServer_N pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://<host>:<port>')
...
RuntimeError: Process ApiServer_N (PID: ...) died with exit code 1
RuntimeError: Engine core initialization failed. See root cause above.

Fix Action

Workaround

--api-server-count 1 — single ApiServer, no concurrent bind. Costs HTTP throughput.

PR fix notes

PR #40596: [Bugfix] Close ApiServer ZMQ bind race with wildcard bind + pipe-back

Description (problem / solution / changelog)

Summary

Multi-ApiServer hybrid-LB launches (multi-node MoE / multimodal) periodically crash with zmq.error.ZMQError: Address already in use. The root cause is a TOCTOU window in _get_open_port(): it does bind(0) / close / return <port>, so the port drifts back into the kernel ephemeral pool between the launcher's probe and the ApiServer child's zmq.bind(). Another probe (or any other process) can claim that port first.

Resolves #40443.

Approach

Let each ApiServer child bind wildcard tcp://host:0 so the kernel atomically assigns a port that the socket immediately owns — no window. Actual endpoints are reported back to the launcher via multiprocessing.Pipe, mirroring the existing DPCoordinator._wait_for_zmq_addrs pattern (same file tree). Engines learn the real endpoints through the normal handshake init_message with no protocol change.

Key points:

  • `get_engine_zmq_addresses()` returns `tcp://host:0` placeholders on the TCP path.
  • `APIServerProcessManager` opens a `Pipe(duplex=False)` per child, `connection.wait([pipe, proc.sentinel], timeout=30)` collects actual endpoints, writes them back into the `input_addresses` / `output_addresses` lists in place.
  • `MPClient` (child side) reads `getsockopt(zmq.LAST_ENDPOINT)` after `bind()`, sends via pipe.
  • `MPClient`'s else-branch (single-ApiServer inline DP=1) reads `last_endpoint` in place so that path is TOCTOU-free too.
  • Ray backend is explicitly gated to the legacy pre-allocate path because `launch_core_engines` spawns engine actors with the placeholder addresses before `APIServerProcessManager` collects real endpoints, which would break the Ray actor connect.

Relation to #39166

#39166 addresses the same bug with a different strategy: pre-allocate plain TCP sockets at the launcher and hand the socket objects through (msgspec / Ray pickle) until ZMQ bind. That works but is more invasive (~416 lines across 8 files, adds `getstate`/`setstate` for socket serialization; bundles a separate DP master IP env-var fix).

This PR is 83 lines across 3 files, reuses the pattern already present in `DPCoordinator`, and doesn't require crossing the msgspec/Ray serialization boundary with live sockets. I'm not blocking on #39166 — flagging here so reviewers can pick whichever approach they prefer.

Test plan

Stress test on 4 nodes × 8 GPU, TP=2, DP=16, api_server_count=16, gpt-oss-120b (MoE):

ScenarioResult
baseline (no fix), parallel 5 launches4/5 race (80% hit rate)
this PR, serial 10 launches10/10 ready
this PR, parallel 5 launches5/5 ready

Also validated end-to-end by `POST /v1/chat/completions` returning a normal completion after startup.

Duplicate-work check

  • `gh issue view 40443 --repo vllm-project/vllm --comments` — no open comments resolving it yet.
  • `gh pr list --search "40443 in:body"` — no PR currently links the issue.
  • `gh pr list --search "ApiServer Address already in use"` — #39166 found; explicitly discussed above.

Disclosure

AI assistance was used (Claude) for code navigation, initial drafting, and test scripting. All changes were reviewed and tested by me end-to-end on the target workload.

Changed files

  • vllm/v1/engine/core_client.py (modified, +25/-0)
  • vllm/v1/engine/utils.py (modified, +16/-2)
  • vllm/v1/utils.py (modified, +42/-1)

Code Example

(ApiServer_N pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://<host>:<port>')
...
RuntimeError: Process ApiServer_N (PID: ...) died with exit code 1
RuntimeError: Engine core initialization failed. See root cause above.

---

from vllm.v1.utils import get_engine_client_zmq_addr

def probe_once(n_addrs: int) -> bool:
    addrs = [
        get_engine_client_zmq_addr(local_only=False, host="127.0.0.1")
        for _ in range(n_addrs)
    ]
    return len(set(addrs)) < len(addrs)

for n in (8, 16, 32):
    trials = 1000
    dup = sum(probe_once(n) for _ in range(trials))
    print(f"n={n:3d} -> dup trials: {dup}/{trials} ({100*dup/trials:.1f}%)")

---

n=  8 -> dup trials: 0/1000 (0.0%)
n= 16 -> dup trials: 8/1000 (0.8%)
n= 32 -> dup trials: 38/1000 (3.8%)

---

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 2 --data-parallel-size 8 \
  --data-parallel-size-local 4 \
  --data-parallel-address <head-ip> --data-parallel-rpc-port 13345 \
  --enable-expert-parallel

---

with socket.socket(AF_INET, SOCK_STREAM) as s:
    s.bind(("", 0))             # OS picks port X
    return s.getsockname()[1]   # 'with' closes socket → X goes back in the pool
                                 # caller then ZMQ-binds X separately

---

# before (racy)
addr = f"tcp://{host}:{_get_open_port()}"
sock.bind(addr)

# after (atomic)
sock.bind(f"tcp://{host}:*")
addr = sock.last_endpoint.decode()
RAW_BUFFERClick to expand / collapse

Summary

Multi-node hybrid-LB deployments (DP > 1 across 2+ nodes, default api_server_count) intermittently fail during ApiServer startup with:

(ApiServer_N pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://<host>:<port>')
...
RuntimeError: Process ApiServer_N (PID: ...) died with exit code 1
RuntimeError: Engine core initialization failed. See root cause above.

Happens after weights/compile finish, before Application startup complete. Observed consistently on MoE / multimodal models (Qwen3-30B-A3B-Instruct-2507, gpt-oss-120b, gemma-4-31B-it) across 2n and 4n jobs; not observed on single-node or on small dense models.

Minimal reproduction (no GPUs, no multi-node)

Run anywhere vLLM is installed:

from vllm.v1.utils import get_engine_client_zmq_addr

def probe_once(n_addrs: int) -> bool:
    addrs = [
        get_engine_client_zmq_addr(local_only=False, host="127.0.0.1")
        for _ in range(n_addrs)
    ]
    return len(set(addrs)) < len(addrs)

for n in (8, 16, 32):
    trials = 1000
    dup = sum(probe_once(n) for _ in range(trials))
    print(f"n={n:3d} -> dup trials: {dup}/{trials} ({100*dup/trials:.1f}%)")

On my box:

n=  8 -> dup trials: 0/1000 (0.0%)
n= 16 -> dup trials: 8/1000 (0.8%)
n= 32 -> dup trials: 38/1000 (3.8%)

get_engine_zmq_addresses calls get_engine_client_zmq_addr 2 * num_api_servers times in a list comprehension to fill inputs=[...] and outputs=[...]. Each call goes through _get_open_port, which binds port 0, reads the number, closes, returns — so the port is back in the ephemeral pool before the next call, and the OS can hand the same port to two calls. In hybrid-LB multi-node those duplicates become two ApiServers ZMQ-binding the same TCP port, one succeeds, the other crashes.

The low-trial-count rate (≤4%) matches roughly with the 20–80% hit rate we see in multi-node serve: DP barriers synchronize ApiServer init, so many calls arrive in rapid succession and the per-call dup probability compounds.

End-to-end reproduction

Multi-node jobs (kjobctl/SLURM-like), head node:

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 2 --data-parallel-size 8 \
  --data-parallel-size-local 4 \
  --data-parallel-address <head-ip> --data-parallel-rpc-port 13345 \
  --enable-expert-parallel

Workers: same command with --headless --data-parallel-start-rank {4,8,12}. Also repros with Qwen3-30B-A3B-Instruct-2507 and gemma-4-31B-it.

Environment

  • vLLM v0.19.1 (hit rate much higher than on v0.19.0, though this path has been racy all along — it just wasn't triggering often before)
  • transformers 5.5.4
  • GPU: H200 SXM, CUDA 12.9
  • non-Ray DP (mp), TP=2, DP_LOCAL=4, 2/4 nodes

Likely cause

vllm/utils/network_utils.py::_get_open_port:

with socket.socket(AF_INET, SOCK_STREAM) as s:
    s.bind(("", 0))             # OS picks port X
    return s.getsockname()[1]   # 'with' closes socket → X goes back in the pool
                                 # caller then ZMQ-binds X separately

Classic TOCTOU. The probe socket closes before the caller rebinds via ZMQ, so another concurrent caller (or its own bind(0)) can grab the same port in the gap. SO_REUSEADDR isn't set on the ZMQ bind, so the second rebind hits EADDRINUSE.

Call path: vllm/v1/engine/core_client.py::DPLBAsyncMPClient.__init__make_zmq_socket(ctx, output_address, zmq.PULL), with output_address coming from get_engine_zmq_addressesget_engine_client_zmq_addr_get_open_port. With N ApiServers racing on the head, this collides whenever their timings coincide (e.g. when a cross-node DP barrier releases them together).

Workaround

--api-server-count 1 — single ApiServer, no concurrent bind. Costs HTTP throughput.

Suggested fix

Let ZMQ wildcard-bind and read back the kernel-assigned endpoint in a single atomic step — no probe socket, no close/rebind window:

# before (racy)
addr = f"tcp://{host}:{_get_open_port()}"
sock.bind(addr)

# after (atomic)
sock.bind(f"tcp://{host}:*")
addr = sock.last_endpoint.decode()

Happy to put up a PR if this approach looks right.

extent analysis

TL;DR

The most likely fix is to modify the _get_open_port function to let ZMQ wildcard-bind and read back the kernel-assigned endpoint in a single atomic step.

Guidance

  • The issue is caused by a Time-of-Check-to-Time-of-Use (TOCTOU) bug in the _get_open_port function, where the port is released back to the pool before the caller can bind to it, allowing another concurrent caller to grab the same port.
  • To verify the issue, run the provided minimal reproduction code, which demonstrates the problem by probing for duplicate addresses.
  • A temporary workaround is to use the --api-server-count 1 flag, which reduces the number of concurrent ApiServer instances and avoids the binding collision.
  • To mitigate the issue, consider modifying the get_engine_zmq_addresses function to use a more robust method for generating unique addresses, such as using a lock or a centralized address allocation mechanism.

Example

The suggested fix involves modifying the _get_open_port function to use a single atomic step for binding and retrieving the kernel-assigned endpoint:

# before (racy)
addr = f"tcp://{host}:{_get_open_port()}"
sock.bind(addr)

# after (atomic)
sock.bind(f"tcp://{host}:*")
addr = sock.last_endpoint.decode()

Notes

The provided fix approach looks promising, but it's essential to test and verify its correctness before merging it into the main codebase.

Recommendation

Apply the suggested fix by modifying the _get_open_port function to use a single atomic step for binding and retrieving the kernel-assigned endpoint, as this approach addresses the root cause of the issue and provides a more robust solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [BUG] Port-allocation race between ApiServer processes in hybrid-LB mode (ZMQError: Address already in use) [1 pull requests, 1 participants]