vllm - 💡(How to fix) Fix fix(nixl): Handshake race when same-node workers re-register with new engine IDs [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38840Fetched 2026-04-08 02:34:35
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

Error: AssertionError at nixl_connector.py:2108 — layer count mismatch during re-handshake

RAW_BUFFERClick to expand / collapse

Bug

NIXL handshake assertion failure when a worker on the same node restarts and registers with a new engine ID.

Error: AssertionError at nixl_connector.py:2108 — layer count mismatch during re-handshake

Environment

  • vLLM 0.16.0 / 0.18.0 with NixlConnector
  • Same-node disaggregated workers (both prefill and decode on one p5.48xlarge)
  • Dynamo with etcd-based service discovery

Reproduction

  1. Deploy prefill + decode workers on the same node using NixlConnector
  2. Kill and restart one worker (it receives a new engine_id from the scheduler)
  3. The restarted worker's handshake conflicts with the stale metadata from the old engine_id

The old handshake entry still exists in the connector's metadata store. When the new worker registers from the same host:port but with a different engine_id, the layer count from the old registration does not match the new one, triggering the assertion at line 2108.

Expected Behavior

Re-registration from the same host:port should invalidate the stale handshake entry and accept the new metadata cleanly.

Actual Behavior

AssertionError on layer count mismatch. The connector holds stale metadata from the previous engine instance.

Suggested Fix

When a new engine_id registers from a host:port that already has a handshake entry, invalidate the old entry before processing the new registration. This handles the common case of worker restarts in orchestrated environments (Kubernetes, Ray).

Impact

Breaks NIXL connectivity on pod restart or worker crash recovery. Requires full teardown and redeploy of both workers to recover.

extent analysis

TL;DR

Invalidate the old handshake entry when a new engine_id registers from the same host:port to resolve the layer count mismatch assertion failure.

Guidance

  • Identify and modify the nixl_connector.py code at line 2108 to handle the case where a new engine_id registers from a host:port that already has a handshake entry.
  • Implement a check to invalidate the old handshake entry before processing the new registration to prevent stale metadata from causing the assertion failure.
  • Verify that the fix works by reproducing the error and checking that the new worker can register cleanly without triggering the AssertionError.
  • Consider adding logging or monitoring to detect and handle similar issues in the future, especially in environments where worker restarts are common.

Example

# Pseudo-code example of how to invalidate the old handshake entry
if existing_handshake_entry and new_engine_id != existing_engine_id:
    # Invalidate the old handshake entry
    invalidate_handshake_entry(existing_handshake_entry)
    # Process the new registration
    process_new_registration(new_engine_id)

Notes

This fix assumes that the nixl_connector.py code has the necessary functionality to invalidate handshake entries and handle new registrations. The exact implementation details may vary depending on the specific requirements and constraints of the system.

Recommendation

Apply the workaround by modifying the nixl_connector.py code to invalidate the old handshake entry when a new engine_id registers from the same host:port, as this directly addresses the root cause of the issue and prevents the AssertionError from occurring.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING