vllm - 💡(How to fix) Fix NIXL KV transfer crash with asymmetric TP (prefill TP=4, decode TP=1)

Error Message

File "/opt/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2671, in execute_model
    return self.kv_connector_no_forward(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 87, in kv_connector_no_forward
    KVConnectorModelRunnerMixin._get_kv_connector_output(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 137, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids)
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 265, in get_finished
    return self.connector_worker.get_finished()
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1758, in get_finished
    block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 762, in block_size_ratio_from_engine_id
    remote_block_size = self.remote_block_size[remote_engine_id]
KeyError: 'f6c8f19f-4e41-41e4-a41a-b535b09adb00'

Root Cause

The decode pod's NIXL KVTopo.remote_block_size dict doesn't contain the prefill engine's ID. During the NIXL handshake, the prefill engine (TP=4) registers with a different block size than what the decode engine (TP=1) expects. The handshake either fails silently or the registration is rejected due to TP mismatch, leaving remote_block_size empty for that engine ID.

When get_finished() tries to compute block_size_ratio_from_engine_id() for a completed KV transfer, it can't find the remote engine → KeyError → EngineDeadError → all subsequent requests fail.

Code Example

File "/opt/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2671, in execute_model
    return self.kv_connector_no_forward(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 87, in kv_connector_no_forward
    KVConnectorModelRunnerMixin._get_kv_connector_output(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 137, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids)
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 265, in get_finished
    return self.connector_worker.get_finished()
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1758, in get_finished
    block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 762, in block_size_ratio_from_engine_id
    remote_block_size = self.remote_block_size[remote_engine_id]
KeyError: 'f6c8f19f-4e41-41e4-a41a-b535b09adb00'

Bug: NIXL KeyError in block_size_ratio_from_engine_id with asymmetric TP

Description

When running PD disaggregation with asymmetric TP (prefill TP=4, decode TP=1), the decode pod crashes with a KeyError in nixl_connector.py during KV transfer. The decode engine can't find the prefill engine's block size in its remote_block_size map.

The crash happens on the first request that completes prefill and attempts KV handoff to decode. All decode pods crash, leaving only prefill pods running.

Reproduction

Model: RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8-dynamic (dense, NOT MoE)
Architecture: PD disaggregation (separate prefill and decode LWS)
Prefill: 3 pods × TP=4 (12 GPUs)
Decode: 4 pods × TP=1 (4 GPUs)
GPU: NVIDIA H200 141GB
KV Connector: NixlConnector, kv_role=kv_both
max_model_len: 2205
Workload: 100 concurrent users, ISL=2000, OSL=100

Symmetric TP configurations (TP=1/TP=1, TP=2/TP=2, TP=4/TP=4) work fine. The crash only occurs when prefill_tp > decode_tp.

Error

File "/opt/vllm-source/vllm/v1/worker/gpu_model_runner.py", line 2671, in execute_model
    return self.kv_connector_no_forward(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 87, in kv_connector_no_forward
    KVConnectorModelRunnerMixin._get_kv_connector_output(
File "/opt/vllm-source/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 137, in _get_kv_connector_output
    kv_connector.get_finished(scheduler_output.finished_req_ids)
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 265, in get_finished
    return self.connector_worker.get_finished()
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 1758, in get_finished
    block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(
File "/opt/vllm-source/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 762, in block_size_ratio_from_engine_id
    remote_block_size = self.remote_block_size[remote_engine_id]
KeyError: 'f6c8f19f-4e41-41e4-a41a-b535b09adb00'

Root Cause

Impact

All decode pods crash on the first KV transfer attempt
5884 out of 5897 requests errored in our benchmark run
Only ~13 requests completed (those that hit prefill pods serving directly before KV handoff)
The crash is deterministic for any prefill_tp > decode_tp configuration

Environment

vLLM: 0.1.dev1+g2b51d23f6 (precompiled, from ghcr.io/llm-d/llm-d-cuda:v0.4.0)
NIXL: 0.8.0
GPU: NVIDIA H200 141GB
Kubernetes: CoreWeave (vanilla K8s)
LWS: LeaderWorkerSet v1

Expected Behavior

Asymmetric TP should either:

Work correctly — NIXL handles the block size mapping between different TP values
Fail gracefully during handshake — reject the registration and log a warning instead of crashing at runtime

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix NIXL KV transfer crash with asymmetric TP (prefill TP=4, decode TP=1) — vLLM 0.1.dev1+g2b51d23f6, NIXL 0.8.0

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug: NIXL KeyError in block_size_ratio_from_engine_id with asymmetric TP

Description

Reproduction

Error

Root Cause

Impact

Environment

Expected Behavior

Still need to ship something?

TRENDING