vllm - 💡(How to fix) Fix [RFC]: Deprecate `kv_both` for NIXLConnector and Enforce Explicit P/D Roles

Fix Action

Fix / Workaround

Code clarity: not an optimization per-se, but a lot of the confusion in reading/maintaining NIXL code stems from unclear distinction in code between P/D/Shared behavior. Speculative decoding asymmetry: P can auto-adjust its speculator config (e.g. set num_speculative_tokens = 1) and skip sampling since it never uses draft tokens. This would enable PR such as #39266 to skip proposing using a clear "is_producer" interface. This eliminates the block-count mismatch that currently requires a scheduler-level workaround (#22317) and avoids regressions for other spec-decode methods (#43733) . Future optimizations. Any P-vs-D asymmetric behavior (chunked transfer strategies, priority scheduling, health-check semantics) can be cleanly gated on role.

Motivation.

Today, NIXL P/D instances are all configured with kv_role: "kv_both". The actual prefill-vs-decode behavior is determined by incoming request at runtime, via kv_transfer_params (do_remote_prefill / do_remote_decode). This means there is no reliable way for an instance to know whether it is a P or D before the handshake.

In practice, to the best of my knowledge, kv_both does not provide real value:

Instances are never re-purposed at runtime. P and D are already started with distinct configurations: low-latency/high-throughput, different tensor-parallel degrees, or even different GPU types. The proxy maintains separate P and D endpoint lists. I have observed no real-world scenario where a running P instance gets repurposed as D or vice versa. If such a need exist please comment on this RFC! :)

Role ambiguity blocks config-time optimizations. Any optimization that depends on knowing producer vs consumer semantics -- memory allocation strategies, model loading decisions, scheduler behavior -- must either be deferred to request time (adding per-request overhead) or resort to brittle heuristics like num_computed_tokens == 0 to infer role.

Other connectors already enforce roles. MooncakeConnector, LMCacheConnector, and others already expect explicit kv_producer / kv_consumer roles. NIXL is the outlier.

Proposed Change.

Phase 1: Deprecation warning (non-breaking)

NIXLConnector validates kv_role at init:

kv_producer or kv_consumer: recorded and used going forward.
kv_both: logs a deprecation warning pointing to the migration guide, then continues with today's behavior. No functional change for existing deployments

The kv_both literal remains valid in the KVRole type system -- offloading, LMCache, FlexKV, and other connectors that genuinely need dual-role semantics are unaffected. The deprecation is NIXL-specific.

Migration cost for operators (llm-d/dynamo): one config field change per instance. Since the proxy already separates P and D endpoint lists, operators already know which instances serve which role.

Phase 2: Role-aware optimizations

With the role known at config time, a number of optimizations become possible:

Phase 3: Hard deprecation (breaking)

After a deprecation period (2/3 minor releases), NIXL rejects kv_both with a clear error message pointing to the migration guide.

Feedback Period.

1 week

CC List.

@youkaichao @robertgshaw2-redhat @markmc @benchislett @snadampal @ZhanqiuHu @TheEpicDolphin

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering