vllm - ✅(Solved) Fix [RFC]: [KV Connector]: Support KV push from Prefill to Decode node using Nixl Connector [1 pull requests, 10 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36923Fetched 2026-04-08 00:43:34
View on GitHub
Comments
10
Participants
2
Timeline
24
Reactions
0
Author
Timeline (top)
commented ×10subscribed ×7mentioned ×5cross-referenced ×1

Error Message

  • Error handling skips _invalid_block_ids tracking (blocks belong to D, not P)

Root Cause

6. Amortized transfer latency for long contexts For long-context requests (100K+ tokens), KV caches can reach gigabytes. Because D's blocks are pre-registered before P finishes, push hides part of the transfer setup behind P's computation, reducing the effective transfer time visible on the critical path.

Fix Action

Fix / Workaround

4. Proxy exits the transfer critical path Push allows the proxy to dispatch to P and D simultaneously. D registers blocks directly with P via a ZMQ side channel, and P pushes when ready. The transfer coordination happens peer-to-peer between P and D, removing the proxy as a serialization point.

  • D sends REGISTER_BLOCKS_MSG to P's scheduler via ZMQ side channel

  • P stores the registration data

  • When P finishes the request (request_finished), it finds the stored registration

  • P enqueues to the push dispatcher thread, which sends a PUSH_TRIGGER_MSG via ZMQ TCP to all TP workers

  • P finishes the request and stores block IDs in _finished_request_blocks

  • When D's REGISTER_BLOCKS_MSG arrives at P's listener thread, it enqueues to the push dispatcher

  • The dispatcher fuzzy-matches the registration to the finished request and sends the push trigger

PR fix notes

PR #35264: [KV Connector]: Support KV push from Prefill to Decode node using Nixl KV Connector

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Implemented KV push feature where Prefill node pushes its KV blocks to Decode node as soon as the model executor completes the forward pass and finishes request. The implementation supports heterogeneous TP and heterogeneous block sizes between P and D nodes

And it covers both the scenarios: Scenario 1: D registers blocks with P before P finishes generating KV Scenario 2: P has the KV ready before D registers

Purpose

To improve TTFT for P-D disaggregated inference deployment.

Test Plan

Manually tested Inference on P-D disaggregated setup

Test Result

Tested 1p1d configuration for pushing KV from prefill to decode node and validated accuracy of the results.

Benchmarking results

I ran vllm bench serve with sonnet dataset for different input and output token lengths. KV push mode (this PR) showed 1.2x - 3.0x improvements over pull mode across different input and out lengths. Following are the performance numbers measured on AWS P5en instance and similar improvements are observed on AWS Trn2 instances as well where the feature was originally developed.

ModePrefill/Decode TPInput LenOutput LenQPSMean TTFT (ms)Median TTFT (ms)P99 TTFT (ms)Mean TPOT (ms)Req/s
pull4512644110.9485.07740.0312.513.86
push451264474.4677.13100.2412.503.87
pull45121284301.9289.221125.9812.663.63
push4512128493.7779.95302.3712.593.75
pull451212882948.112682.327258.9612.704.65
push451212881212.481275.052826.7212.675.92
pull410241284330.9491.261163.7012.743.62
push410241284273.8384.101096.8812.793.67
pull420481284352.2298.531225.2012.883.61
push420481284239.6588.231042.3212.853.73
pull851264885.7765.69257.588.957.62
push851264859.4160.3271.888.847.63
pull8512128163672.893628.997880.989.076.45
push8512128161653.701791.733681.049.069.04
pull810241288986.94776.672764.929.106.03
push810241288674.82551.532082.189.096.37
pull8204812881048.93828.822905.039.205.99
push820481288648.08371.251859.219.196.46

TODO:

  1. Run unit tests

Phase2:

  1. Add layer-wise KV push to bring out the actual performance gains from kv push model

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (modified, +961/-66)

Code Example

+---------------------------------------------+
                 |                Proxy Server                  |
                 |  1. Send to P (do_remote_decode=True)        |
                 |  2. Send to D (remote_block_ids=[], push)    |
                 |     (can be simultaneous)                    |
                 +----------+--------------+-------------------+
                            |              |
                 +----------v------+  +----v--------------+
                 |   P (Prefill)   |  |   D (Decode)      |
                 |                 |  |                    |
                 | 1. Compute KV   |  | 1. Allocate blocks |
                 | 2. Finish req   |  | 2. REGISTER_BLOCKS |
                 |                 |  |    -> P scheduler  |
                 | <---------------+--+                    |
                 | 3. Push KV -----+-->  3. Receive KV     |
                 |    (NIXL WRITE) |  | 4. Begin decode    |
                 +-----------------+  +--------------------+

---

PULL (current):
  time ----->
  P:  [====== compute ======]
  Proxy:                      [relay params]
  D:                                        [alloc]-[handshake]-[RDMA READ]-[decode]
      |                      |              |                   |           |
      t0                     t1             t2                  t3         TTFT

PUSH (proposed):
  time ----->
  P:  [====== compute ======][NIXL WRITE]
  D:  [alloc + register]                 [receive]-[decode]
      |                      |           |         |
      t0                     t1          t2       TTFT

---

class TransferMode(enum.Enum):
    PULL = "pull"   # D reads from P (existing behavior)
    PUSH = "push"   # P writes to D (new behavior)

---

+----------------------------------------------------------------+
|                     P Scheduler Process                         |
|                                                                 |
|  +------------------------+   +-----------------------------+   |
|  |  Listener Thread        |   |  Push Dispatcher Thread     |   |
|  |  (ZMQ ROUTER socket)   |   |  (reads _push_dispatch_queue|   |
|  |                          |   |   + sends PUSH_TRIGGER_MSG |   |
|  |  Handles:                |   |   via TCP to all TP workers)|   |
|  |  - GET_META_MSG          |   |                             |   |
|  |  - REGISTER_BLOCKS_MSG   |   |  - Fuzzy-matches registra- |   |
|  |                          |   |    tion to finished requests|   |
|  |  On REGISTER_BLOCKS:     |   |  - Fans out push trigger   |   |
|  |  -> stores registration  |   |    to all TP ranks          |   |
|  |  -> enqueues to          |   +-----------------------------+   |
|  |     dispatcher queue     |                                     |
|  +------------------------+                                     |
|                                                                 |
|  Scheduler main thread calls request_finished():                |
|  -> stores block_ids                                            |
|  -> checks registered_blocks                                    |
|  -> enqueues to dispatcher queue                                |
+----------------------------------------------------------------+

---

PUSH_TRIGGER_BASE_PORT = 29600

def _push_trigger_addr(engine_id: str, tp_rank: int = 0) -> str:
    h = int(hashlib.md5(engine_id.encode()).hexdigest(), 16)
    port = PUSH_TRIGGER_BASE_PORT + (h % 1000) + tp_rank
    return f"tcp://127.0.0.1:{port}"

---

payload = msgspec.msgpack.encode((PUSH_TRIGGER_MSG, {
    "request_id": request_id,
    "block_ids": block_ids,
    "registration_data": registration_data,
}))

---

registration_data = {
    "request_id": str,           # P's request ID (remote_request_id)
    "decode_request_id": str,    # D's local request ID
    "decode_engine_id": str,     # D's engine ID
    "decode_tp_size": int,       # D's TP degree
    "local_block_ids": list[int],# D's allocated physical block IDs
    "num_blocks": int,           # Number of blocks allocated
    "remote_host": str,          # D's side channel host
    "remote_port": int,          # D's side channel port
}

---

registration_data = self.pop_registered_blocks(request.request_id)
if registration_data:
    self._push_dispatch_queue.put(
        (request.request_id, registration_data)
    )

---

# After storing registration and sending ACK
scheduler_instance._push_dispatch_queue.put(
    (request_id, registration_data)
)

---

def start_push_kv(self, request_id, local_block_ids, registration_data):
    # 1. Handshake with D if not already done (one-time per D engine_id)
    # 2. Convert logical -> physical block IDs
    # 3. Call _xfer_blocks_for_req with mode=PUSH

---

if req_id in self._recving_metadata and meta.mode == TransferMode.PUSH:
    # Push complete -- create empty handle list so
    # _pop_done_transfers immediately marks it as done
    _ = self._recving_transfers[req_id]
    continue
RAW_BUFFERClick to expand / collapse

Motivation.

The current vLLM P/D disaggregation architecture uses a pull-based KV transfer model: the Decode (D) node initiates a NIXL READ to pull KV blocks from the Prefill (P) node after P finishes computation. While functional, this approach has a key limitation - D must wait for P to finish before it can begin transferring KV data, adding latency to the critical path.

In the pull model, the sequence is strictly serial:

  1. Proxy sends request to P
  2. P computes KV and finishes the request
  3. Proxy sends request to D with P's kv_transfer_params
  4. D allocates blocks, handshakes with P, and initiates NIXL READ
  5. D begins decoding after transfer completes

This serialization means D sits idle during P's computation phase. For large prompts or high-TP configurations, this idle time is significant.

Advantages of Push Mode

Push-based KV transfer builds on the existing pull infrastructure and offers several benefits:

1. Reduced TTFT through concurrent operation With push, D can register its blocks with P while P is still computing. The moment P finishes, it WRITEs KV directly to D's pre-registered memory -- no round-trip through the proxy to relay kv_transfer_params. This overlaps block allocation with computation and shortens the critical path to first token.

2. Natural fit for pipelined (layer-by-layer) transfer Push is a stepping stone toward layer-wise KV transfer, where P sends layer 0's KV while layer 1 is still computing. Since P knows exactly when each layer completes, it can initiate a WRITE immediately, overlapping communication with computation without requiring D to poll for readiness.

3. P controls NIC scheduling under fan-out When P pushes to multiple D nodes, it decides when and how each WRITE is issued, enabling pacing and prioritization. This gives P explicit control over its outbound NIC bandwidth rather than servicing concurrent inbound READ requests from multiple D nodes.

4. Proxy exits the transfer critical path Push allows the proxy to dispatch to P and D simultaneously. D registers blocks directly with P via a ZMQ side channel, and P pushes when ready. The transfer coordination happens peer-to-peer between P and D, removing the proxy as a serialization point.

5. Faster KV block release on P P can free its KV blocks immediately after the WRITE completes, rather than holding them until D finishes reading. This improves P's memory utilization, especially under high request concurrency.

6. Amortized transfer latency for long contexts For long-context requests (100K+ tokens), KV caches can reach gigabytes. Because D's blocks are pre-registered before P finishes, push hides part of the transfer setup behind P's computation, reducing the effective transfer time visible on the critical path.

Proposed Change.

Implement a push-based KV transfer mode in the Nixl KV Connector where P initiates NIXL WRITE operations to transfer KV blocks to D, complementing the existing pull-based READ mode.

Architecture Overview

                 +---------------------------------------------+
                 |                Proxy Server                  |
                 |  1. Send to P (do_remote_decode=True)        |
                 |  2. Send to D (remote_block_ids=[], push)    |
                 |     (can be simultaneous)                    |
                 +----------+--------------+-------------------+
                            |              |
                 +----------v------+  +----v--------------+
                 |   P (Prefill)   |  |   D (Decode)      |
                 |                 |  |                    |
                 | 1. Compute KV   |  | 1. Allocate blocks |
                 | 2. Finish req   |  | 2. REGISTER_BLOCKS |
                 |                 |  |    -> P scheduler  |
                 | <---------------+--+                    |
                 | 3. Push KV -----+-->  3. Receive KV     |
                 |    (NIXL WRITE) |  | 4. Begin decode    |
                 +-----------------+  +--------------------+

Timing Comparison: Pull vs Push

PULL (current):
  time ----->
  P:  [====== compute ======]
  Proxy:                      [relay params]
  D:                                        [alloc]-[handshake]-[RDMA READ]-[decode]
      |                      |              |                   |           |
      t0                     t1             t2                  t3         TTFT

PUSH (proposed):
  time ----->
  P:  [====== compute ======][NIXL WRITE]
  D:  [alloc + register]                 [receive]-[decode]
      |                      |           |         |
      t0                     t1          t2       TTFT

In pull mode, TTFT = t0 -> t1 (P compute) + t1 -> t2 (proxy relay) + t2 -> t3 (D alloc + handshake + RDMA READ) + decode start.

In push mode, D's alloc + register overlaps with P's compute. Once P finishes, it pushes immediately. TTFT = t0 -> t1 (P compute) + t1 -> t2 (NIXL WRITE) + decode start. The proxy relay and D-side alloc/handshake are removed from the critical path.

Two Scenarios Covered

Since P's computation and D's block registration happen concurrently, the implementation must handle two ordering scenarios:

<details> <summary><b>Scenario 1: D registers before P finishes</b></summary>
  • D sends REGISTER_BLOCKS_MSG to P's scheduler via ZMQ side channel
  • P stores the registration data
  • When P finishes the request (request_finished), it finds the stored registration
  • P enqueues to the push dispatcher thread, which sends a PUSH_TRIGGER_MSG via ZMQ TCP to all TP workers
</details> <details> <summary><b>Scenario 2: P finishes before D registers</b></summary>
  • P finishes the request and stores block IDs in _finished_request_blocks
  • When D's REGISTER_BLOCKS_MSG arrives at P's listener thread, it enqueues to the push dispatcher
  • The dispatcher fuzzy-matches the registration to the finished request and sends the push trigger
</details>

Implementation Details

<details> <summary><b>1. TransferMode Enum</b></summary>

A new TransferMode enum distinguishes pull vs push transfers throughout the connector:

class TransferMode(enum.Enum):
    PULL = "pull"   # D reads from P (existing behavior)
    PUSH = "push"   # P writes to D (new behavior)
</details> <details> <summary><b>2. ZMQ TCP Push Trigger (Scheduler -> Worker Communication)</b></summary>

Two Threads on P Scheduler

+----------------------------------------------------------------+
|                     P Scheduler Process                         |
|                                                                 |
|  +------------------------+   +-----------------------------+   |
|  |  Listener Thread        |   |  Push Dispatcher Thread     |   |
|  |  (ZMQ ROUTER socket)   |   |  (reads _push_dispatch_queue|   |
|  |                          |   |   + sends PUSH_TRIGGER_MSG |   |
|  |  Handles:                |   |   via TCP to all TP workers)|   |
|  |  - GET_META_MSG          |   |                             |   |
|  |  - REGISTER_BLOCKS_MSG   |   |  - Fuzzy-matches registra- |   |
|  |                          |   |    tion to finished requests|   |
|  |  On REGISTER_BLOCKS:     |   |  - Fans out push trigger   |   |
|  |  -> stores registration  |   |    to all TP ranks          |   |
|  |  -> enqueues to          |   +-----------------------------+   |
|  |     dispatcher queue     |                                     |
|  +------------------------+                                     |
|                                                                 |
|  Scheduler main thread calls request_finished():                |
|  -> stores block_ids                                            |
|  -> checks registered_blocks                                    |
|  -> enqueues to dispatcher queue                                |
+----------------------------------------------------------------+

TCP Addressing

Each TP worker binds a ZMQ PULL socket at a deterministic TCP port derived from the engine_id using hashlib.md5:

PUSH_TRIGGER_BASE_PORT = 29600

def _push_trigger_addr(engine_id: str, tp_rank: int = 0) -> str:
    h = int(hashlib.md5(engine_id.encode()).hexdigest(), 16)
    port = PUSH_TRIGGER_BASE_PORT + (h % 1000) + tp_rank
    return f"tcp://127.0.0.1:{port}"

Push Trigger Message Format

payload = msgspec.msgpack.encode((PUSH_TRIGGER_MSG, {
    "request_id": request_id,
    "block_ids": block_ids,
    "registration_data": registration_data,
}))

The scheduler fans out the message to all TP ranks via per-rank PUSH sockets (non-blocking zmq.NOBLOCK).

Push Listener Thread on Worker

Each TP worker runs a push listener thread that:

  1. Binds a ZMQ PULL socket at its deterministic TCP address
  2. Polls for PUSH_TRIGGER_MSG messages (500ms timeout)
  3. Calls start_push_kv directly on the listener thread
</details> <details> <summary><b>3. Block Registration Protocol (D -> P)</b></summary>

D's scheduler calls _register_blocks_with_prefill() which sends a REGISTER_BLOCKS_MSG to P's ZMQ side channel (ROUTER socket) with:

registration_data = {
    "request_id": str,           # P's request ID (remote_request_id)
    "decode_request_id": str,    # D's local request ID
    "decode_engine_id": str,     # D's engine ID
    "decode_tp_size": int,       # D's TP degree
    "local_block_ids": list[int],# D's allocated physical block IDs
    "num_blocks": int,           # Number of blocks allocated
    "remote_host": str,          # D's side channel host
    "remote_port": int,          # D's side channel port
}

Note: remote_host/remote_port here refer to D's own ZMQ side channel address (not P's), so that P can handshake back with D to register D's memory regions for the NIXL WRITE.

P's listener thread handles REGISTER_BLOCKS_MSG:

  • Validates required fields
  • Sends ACK with P's block_size and tp_size (so D can populate kv_topo without a full handshake)
  • Stores registration in _registered_blocks (thread-safe dict with lock)
  • Enqueues (request_id, registration_data) to the push dispatcher queue
</details> <details> <summary><b>4. D-side Registration Flow</b></summary>

When D's scheduler processes a push-mode request (indicated by do_remote_prefill=True with empty remote_block_ids):

  1. Allocates local blocks normally
  2. Creates a ReqMeta with the allocated block IDs
  3. Calls _register_blocks_with_prefill() -> sends REGISTER_BLOCKS_MSG to P
  4. Receives ACK with P's block_size and tp_size
  5. Stores these as remote_block_size and remote_tp_size in kv_transfer_params, which flow to the worker via ReqMeta to configure the transfer topology
</details> <details> <summary><b>5. P-side Push Dispatch</b></summary>

Both scenarios converge on the push dispatcher thread via _push_dispatch_queue:

In request_finished() (Scenario 1):

registration_data = self.pop_registered_blocks(request.request_id)
if registration_data:
    self._push_dispatch_queue.put(
        (request.request_id, registration_data)
    )

In _nixl_handshake_listener (Scenario 2):

# After storing registration and sending ACK
scheduler_instance._push_dispatch_queue.put(
    (request_id, registration_data)
)

The dispatcher thread dequeues work items, resolves block_ids (with fuzzy matching via _get_base_request_id for request ID variants), and calls _send_push_trigger() which fans out TCP messages to all TP workers.

</details> <details> <summary><b>6. Worker-side Push (start_push_kv)</b></summary>

Called directly by the push listener thread on each TP worker:

def start_push_kv(self, request_id, local_block_ids, registration_data):
    # 1. Handshake with D if not already done (one-time per D engine_id)
    # 2. Convert logical -> physical block IDs
    # 3. Call _xfer_blocks_for_req with mode=PUSH
</details> <details> <summary><b>7. Unified Transfer Method (_xfer_blocks_for_req / _xfer_blocks)</b></summary>

The existing _read_blocks_for_req and _read_blocks are refactored into:

  • _xfer_blocks_for_req(): Orchestrates transfers across TP ranks (handles heterogeneous TP)
  • _xfer_blocks(): Issues a single NIXL transfer operation

Both accept a mode parameter:

  • PULL -> nixl_wrapper.make_prepped_xfer("READ", ...)
  • PUSH -> nixl_wrapper.make_prepped_xfer("WRITE", ...)

Key differences in push mode:

  • Block alignment is reversed (truncate local blocks to match remote count, vs truncate remote to match local in pull)
  • Transfer handles are tracked in _sending_transfers (separate from _recving_transfers) so P can free blocks once the RDMA WRITE completes
  • Error handling skips _invalid_block_ids tracking (blocks belong to D, not P)
</details> <details> <summary><b>8. Push Transfer Completion</b></summary>

On P side, get_finished() polls _sending_transfers for completed WRITE handles. Once done, P frees its blocks immediately rather than waiting for the 480s timeout.

On D side, NIXL sends a notification when P's WRITE completes. The worker's _get_new_notifs() recognizes push-mode completions:

if req_id in self._recving_metadata and meta.mode == TransferMode.PUSH:
    # Push complete -- create empty handle list so
    # _pop_done_transfers immediately marks it as done
    _ = self._recving_transfers[req_id]
    continue
</details> <details> <summary><b>9. Handshake in Push Mode</b></summary>

The existing _nixl_handshake method accepts a mode parameter. In push mode, P initiates the handshake with D (reverse of pull mode where D initiates with P). The add_remote_agent method's head-slice offset logic works symmetrically for both directions.

</details> <details> <summary><b>10. Proxy Server</b></summary>
  1. Send to P with do_remote_decode=True for normal prefill computation
  2. Send to D with remote_block_ids=[] and P's metadata to trigger push-mode registration
  3. D and P coordinate the push transfer directly without further proxy involvement
</details>

Files Changed

FileChange
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.pyTransferMode enum, ZMQ TCP push trigger (scheduler->worker), push dispatcher thread, REGISTER_BLOCKS_MSG/PUSH_TRIGGER_MSG protocols, _register_blocks_with_prefill, start_push_kv, start_push_listener, _push_listener_loop, _push_dispatcher_loop, _send_push_trigger, refactored _xfer_blocks_for_req/_xfer_blocks, push-mode notification handling, extended handshake listener

Compatibility

  • Push mode is opt-in via proxy behavior. The existing pull-based flow is unchanged.
  • The TransferMode enum defaults to PULL everywhere, maintaining backward compatibility.
  • _reqs_need_send cleanup is deferred (not cleared in build_connector_meta) to support the async push trigger. Entries are cleaned up when push completes or times out.
  • Heterogeneous TP and heterogeneous block sizes are supported in both push and pull modes.

Limitations and Future Work

  • Single-node TP only: The current TCP addressing uses 127.0.0.1. For multi-node TP, the address needs to be resolved from TP group rank-to-host mapping.
  • No pipelined push: Currently P pushes all blocks after the full forward pass. Future work could push blocks incrementally as layers complete.
  • No push failure recovery: If push fails, D must fall back to pull or recompute. The current implementation logs errors but doesn't implement automatic fallback.
  • Fuzzy request ID matching: The dispatcher uses _get_base_request_id (UUID extraction) to handle cases where the request ID format differs between registration and completion.

Benchmarking results

I ran vllm bench serve with sonnet dataset for different input and output token lengths. KV push mode (with https://github.com/vllm-project/vllm/pull/35264) showed 1.2x - 3.0x improvements over pull mode across different input and output lengths. Following are the performance numbers measured on AWS P5en instances and similar improvements are observed on AWS Trn2 instances as well where the feature was originally developed.

ModePrefill/Decode TPInput LenOutput LenQPSMean TTFT (ms)Median TTFT (ms)P99 TTFT (ms)Mean TPOT (ms)Req/s
pull4512644110.9485.07740.0312.513.86
push451264474.4677.13100.2412.503.87
pull45121284301.9289.221125.9812.663.63
push4512128493.7779.95302.3712.593.75
pull451212882948.112682.327258.9612.704.65
push451212881212.481275.052826.7212.675.92
pull410241284330.9491.261163.7012.743.62
push410241284273.8384.101096.8812.793.67
pull420481284352.2298.531225.2012.883.61
push420481284239.6588.231042.3212.853.73
pull851264885.7765.69257.588.957.62
push851264859.4160.3271.888.847.63
pull8512128163672.893628.997880.989.076.45
push8512128161653.701791.733681.049.069.04
pull810241288986.94776.672764.929.106.03
push810241288674.82551.532082.189.096.37
pull8204812881048.93828.822905.039.205.99
push820481288648.08371.251859.219.196.46

Alternate Solutions Considered

1. Distributed KV store Use a distributed KV store like Mooncake Store as an intermediary. P writes KV to the store, D reads from it. This decouples P and D but adds an extra hop and requires additional infrastructure. Push-to-D is simpler and lower latency for the common single-P-to-single-D case.

Feedback Period.

No response

CC List.

@yewentao256 , @NickLucche , @robertgshaw2-redhat

Any Other Things.

PR: https://github.com/vllm-project/vllm/pull/35264

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement a push-based KV transfer mode, follow these steps:

  • Introduce a TransferMode enum to distinguish between pull and push transfers.
  • Implement a block registration protocol where D registers its blocks with P before P finishes computation.
  • Modify P's listener thread to handle REGISTER_BLOCKS_MSG and trigger push when the request is finished.
  • Update D's scheduler to send REGISTER_BLOCKS_MSG to P and receive ACK with P's block size and TP size.
  • Implement the start_push_kv RPC method on the worker to initiate the push transfer.
  • Refactor the existing _read_blocks_for_req and _read_blocks methods to support both pull and push modes.
  • Update the proxy server to send requests to P and D simultaneously and trigger push-mode registration.

Code Changes

Here are some example code changes:

# Introduce TransferMode enum
class TransferMode(enum.Enum):
    PULL = "pull"
    PUSH = "push"

# Implement block registration protocol
def _register_blocks_with_prefill(self, request_id, local_block_ids):
    registration_data = {
        "request_id": request_id,
        "decode_request_id": self.decode_request_id,
        "decode_engine_id": self.engine_id,
        "decode_tp_size": self.tp_size,
        "local_block_ids": local_block_ids,
        "num_blocks": len(local_block_ids),
        "remote_host": self.side_channel_host,
        "remote_port": self.side_channel_port,
    }
    self._send_registration_data(registration_data)

# Modify P's listener thread to handle REGISTER_BLOCKS_MSG
def _nixl_handshake_listener(self):
    # Handle REGISTER_BLOCKS_MSG
    if msg_type == "REGISTER_BLOCKS_MSG":
        registration_data = msg_data
        self._store_registration_data(registration_data)
        if self._request_finished(registration_data["request_id"]):
            self._trigger_push(registration_data)

# Update D's scheduler to send REGISTER_BLOCKS_MSG to P
def _process_request(self, request):
    if request.do_remote_prefill:
        local_block_ids = self._allocate_blocks(request)
        self._register_blocks_with_prefill(request.request_id, local_block_ids)

# Implement start_push_kv RPC method on the worker
def start_push_kv(self, request_id, local_block_ids, registration_data):
    # Handshake with D if not already done (PUSH mode)
    # Convert logical → physical block IDs
    # Call _xfer_blocks_for_req with mode=PUSH
    self._xfer_blocks_for_req(request_id, local_block_ids, registration_data, mode=TransferMode.PUSH)

Verification

To verify that the fix worked, you can test the push-based KV transfer mode by:

  • Checking that D can register its blocks with P before P finishes computation.
  • Verifying that P triggers the

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING