vllm - 💡(How to fix) Fix NIXL KV transfer crash with asymmetric TP (prefill TP=4, decode TP=1) — vLLM 0.1.dev1+g2b51d23f6, NIXL 0.8.0

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When running PD disaggregation with asymmetric TP (prefill TP=4, decode TP=1), the decode pod crashes with a KeyError in nixl_connector.py during KV transfer. The decode engine can't find the prefill engine's block size in its remote_block_size map.

The crash happens on the first request that completes prefill and attempts KV handoff to decode. All decode pods crash, leaving only prefill pods running.

Error Message

File "/opt/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2671, in execute_model
    return self.kv_connector_no_forward(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 87, in kv_connector_no_forward
    KVConnectorModelRunnerMixin._get_kv_connector_output(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 137, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids)
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 265, in get_finished
    return self.connector_worker.get_finished()
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1758, in get_finished
    block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 762, in block_size_ratio_from_engine_id
    remote_block_size = self.remote_block_size[remote_engine_id]
KeyError: 'f6c8f19f-4e41-41e4-a41a-b535b09adb00'

Root Cause

The decode pod's NIXL KVTopo.remote_block_size dict doesn't contain the prefill engine's ID. During the NIXL handshake, the prefill engine (TP=4) registers with a different block size than what the decode engine (TP=1) expects. The handshake either fails silently or the registration is rejected due to TP mismatch, leaving remote_block_size empty for that engine ID.

When get_finished() tries to compute block_size_ratio_from_engine_id() for a completed KV transfer, it can't find the remote engine → KeyError → EngineDeadError → all subsequent requests fail.

Code Example

File "/opt/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2671, in execute_model
    return self.kv_connector_no_forward(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 87, in kv_connector_no_forward
    KVConnectorModelRunnerMixin._get_kv_connector_output(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 137, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids)
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 265, in get_finished
    return self.connector_worker.get_finished()
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1758, in get_finished
    block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 762, in block_size_ratio_from_engine_id
    remote_block_size = self.remote_block_size[remote_engine_id]
KeyError: 'f6c8f19f-4e41-41e4-a41a-b535b09adb00'
RAW_BUFFERClick to expand / collapse

Bug: NIXL KeyError in block_size_ratio_from_engine_id with asymmetric TP

Description

When running PD disaggregation with asymmetric TP (prefill TP=4, decode TP=1), the decode pod crashes with a KeyError in nixl_connector.py during KV transfer. The decode engine can't find the prefill engine's block size in its remote_block_size map.

The crash happens on the first request that completes prefill and attempts KV handoff to decode. All decode pods crash, leaving only prefill pods running.

Reproduction

  • Model: RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8-dynamic (dense, NOT MoE)
  • Architecture: PD disaggregation (separate prefill and decode LWS)
  • Prefill: 3 pods × TP=4 (12 GPUs)
  • Decode: 4 pods × TP=1 (4 GPUs)
  • GPU: NVIDIA H200 141GB
  • KV Connector: NixlConnector, kv_role=kv_both
  • max_model_len: 2205
  • Workload: 100 concurrent users, ISL=2000, OSL=100

Symmetric TP configurations (TP=1/TP=1, TP=2/TP=2, TP=4/TP=4) work fine. The crash only occurs when prefill_tp > decode_tp.

Error

File "/opt/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2671, in execute_model
    return self.kv_connector_no_forward(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 87, in kv_connector_no_forward
    KVConnectorModelRunnerMixin._get_kv_connector_output(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 137, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids)
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 265, in get_finished
    return self.connector_worker.get_finished()
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1758, in get_finished
    block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 762, in block_size_ratio_from_engine_id
    remote_block_size = self.remote_block_size[remote_engine_id]
KeyError: 'f6c8f19f-4e41-41e4-a41a-b535b09adb00'

Root Cause

The decode pod's NIXL KVTopo.remote_block_size dict doesn't contain the prefill engine's ID. During the NIXL handshake, the prefill engine (TP=4) registers with a different block size than what the decode engine (TP=1) expects. The handshake either fails silently or the registration is rejected due to TP mismatch, leaving remote_block_size empty for that engine ID.

When get_finished() tries to compute block_size_ratio_from_engine_id() for a completed KV transfer, it can't find the remote engine → KeyError → EngineDeadError → all subsequent requests fail.

Impact

  • All decode pods crash on the first KV transfer attempt
  • 5884 out of 5897 requests errored in our benchmark run
  • Only ~13 requests completed (those that hit prefill pods serving directly before KV handoff)
  • The crash is deterministic for any prefill_tp > decode_tp configuration

Environment

  • vLLM: 0.1.dev1+g2b51d23f6 (precompiled, from ghcr.io/llm-d/llm-d-cuda:v0.4.0)
  • NIXL: 0.8.0
  • GPU: NVIDIA H200 141GB
  • Kubernetes: CoreWeave (vanilla K8s)
  • LWS: LeaderWorkerSet v1

Expected Behavior

Asymmetric TP should either:

  1. Work correctly — NIXL handles the block size mapping between different TP values
  2. Fail gracefully during handshake — reject the registration and log a warning instead of crashing at runtime

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix NIXL KV transfer crash with asymmetric TP (prefill TP=4, decode TP=1) — vLLM 0.1.dev1+g2b51d23f6, NIXL 0.8.0