vllm - ✅(Solved) Fix [Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38677Fetched 2026-04-08 01:58:36
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1referenced ×1

Error Message

Traceback (most recent call last):

Root Cause

The root cause is that a metrics scraper probed 13345, which is a ZMQ handshake port rather than an HTTP endpoint and vLLM was trying to decode the frame to an interger

Fix Action

Fix / Workaround

During startup, the leader pod fails with logs like:

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 114, in cmd
    run_multi_api_server(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 282, in run_multi_api_server
    with launch_core_engines(
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1042, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Message from engine with unexpected data parallel rank: 1686319630778293995078299103167563579499603905256250717064909206856998694298046203802202091868134188419865871146820170319286026737070985330274755728732960911242422067284

PR fix notes

PR #38731: [Bugfix] Harden DP handshake port against non-engine traffic

Description (problem / solution / changelog)

Summary

In multi-node DP serving on Kubernetes, the ZMQ handshake port (--data-parallel-rpc-port) crashes startup if it receives a non-engine TCP connection (e.g. Prometheus metrics scraper, HTTP health check). The raw bytes get decoded as an integer identity, producing:

RuntimeError: Message from engine with unexpected data parallel rank: 1686319630778293995...

This PR makes the handshake loop resilient to stray traffic:

  • Catch ZMQError on recv_multipart
  • Reject frames that don't match the expected 2-part (identity, payload) layout
  • Log a warning and continue (instead of crashing) when the identity doesn't match any known engine

All three paths log at WARNING level so operators can spot misconfigured scrapers without losing the serving pod.

Fixes #38677

Test plan

  • ruff check and ruff format --check pass
  • Verified the crash traceback from the issue matches the changed code path
  • The fix only touches the error-handling branch; the happy path (valid engine traffic) is unchanged
  • Manual verification: would require a multi-node DP setup with a metrics scraper hitting the RPC port

Changed files

  • vllm/v1/engine/utils.py (modified, +21/-3)

Code Example

### 🐛 Describe the bug

In multi-node DP serving, vLLM crashes during engine startup if the DP handshake port (`--data-parallel-rpc-port`) receives a non-engine TCP connection such as an HTTP metrics scrape.

Instead of ignoring or rejecting invalid traffic, the launcher interprets the incoming bytes as an engine identity and raises:

---

During startup, the leader pod fails with logs like:

---

The large integer decodes to the bytes of an HTTP request sent to the DP RPC port, for example:
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: current main

Deployment: Kubernetes + LWS, Nodes: 2

DP: 16,Local DP per node: 8,TP: 1,EP enabled,all2all backend: deepep_low_latency

vllm serve /??? \
      --served-model-name /??? \
      --tensor-parallel-size 1 \
      --data-parallel-size 16 \
      --data-parallel-size-local 8 \
      --enable-expert-parallel \
      --all2all-backend deepep_low_latency \
      --data-parallel-address ${LWS_LEADER_ADDRESS} \
      --data-parallel-rpc-port 13345 \
      --data-parallel-hybrid-lb \
      --api-server-count 8 \
      --trust-remote-code \
      --port 8000 \
      --nnodes ${LWS_GROUP_SIZE} \
      --node-rank ${LWS_WORKER_INDEX}

🐛 Describe the bug

In multi-node DP serving, vLLM crashes during engine startup if the DP handshake port (--data-parallel-rpc-port) receives a non-engine TCP connection such as an HTTP metrics scrape.

Instead of ignoring or rejecting invalid traffic, the launcher interprets the incoming bytes as an engine identity and raises:

RuntimeError: Message from engine with unexpected data parallel rank: <very large integer>

During startup, the leader pod fails with logs like:

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 114, in cmd
    run_multi_api_server(args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 282, in run_multi_api_server
    with launch_core_engines(
         ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 972, in launch_core_engines
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1042, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Message from engine with unexpected data parallel rank: 1686319630778293995078299103167563579499603905256250717064909206856998694298046203802202091868134188419865871146820170319286026737070985330274755728732960911242422067284

The large integer decodes to the bytes of an HTTP request sent to the DP RPC port, for example:

T /metrics HTTP/1.1
Host: 172.26.43.196:13345
User-Agent: vm_promscr

The root cause is that a metrics scraper probed 13345, which is a ZMQ handshake port rather than an HTTP endpoint and vLLM was trying to decode the frame to an interger

https://github.com/vllm-project/vllm/blob/b5e608258e7b5e4abadf84ffee36e584d7e00b7d/vllm/v1/engine/utils.py#L1138-L1145

I am not saying the RPC port should be treated as a metrics endpoint. The issue is that invalid traffic on that port currently crashes startup instead of being safely rejected.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The vLLM startup crash can be fixed by safely rejecting or ignoring invalid traffic on the DP handshake port (--data-parallel-rpc-port) instead of interpreting it as an engine identity.

Guidance

  • Modify the vllm launcher to ignore or reject non-engine TCP connections on the DP handshake port to prevent crashes during engine startup.
  • Consider adding a validation check for incoming traffic on the DP handshake port to ensure it conforms to the expected engine identity format.
  • Review the launch_core_engines function in vllm/v1/engine/utils.py to handle unexpected data parallel ranks without raising a RuntimeError.
  • Ensure that the metrics scraper is configured to target the correct HTTP endpoint, rather than the ZMQ handshake port.

Example

No code snippet is provided as the issue requires modifications to the vllm launcher and its underlying functions, which are not fully specified in the issue.

Notes

The solution requires careful handling of invalid traffic on the DP handshake port to prevent crashes and ensure safe rejection or ignoring of non-engine TCP connections.

Recommendation

Apply a workaround to safely reject or ignore invalid traffic on the DP handshake port, as modifying the vllm launcher and its underlying functions may require significant changes and testing. This workaround can help prevent startup crashes until a more permanent fix is implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup [1 pull requests, 1 participants]