vllm - 💡(How to fix) Fix [RFC]: Deprecate `kv_both` for NIXLConnector and Enforce Explicit P/D Roles

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

After a deprecation period (2/3 minor releases), NIXL rejects kv_both with a clear error message pointing to the migration guide.

Fix Action

Fix / Workaround

Code clarity: not an optimization per-se, but a lot of the confusion in reading/maintaining NIXL code stems from unclear distinction in code between P/D/Shared behavior. Speculative decoding asymmetry: P can auto-adjust its speculator config (e.g. set num_speculative_tokens = 1) and skip sampling since it never uses draft tokens. This would enable PR such as #39266 to skip proposing using a clear "is_producer" interface. This eliminates the block-count mismatch that currently requires a scheduler-level workaround (#22317) and avoids regressions for other spec-decode methods (#43733) . Future optimizations. Any P-vs-D asymmetric behavior (chunked transfer strategies, priority scheduling, health-check semantics) can be cleanly gated on role.

RAW_BUFFERClick to expand / collapse

Motivation.

Today, NIXL P/D instances are all configured with kv_role: "kv_both". The actual prefill-vs-decode behavior is determined by incoming request at runtime, via kv_transfer_params (do_remote_prefill / do_remote_decode). This means there is no reliable way for an instance to know whether it is a P or D before the handshake.

In practice, to the best of my knowledge, kv_both does not provide real value:

Instances are never re-purposed at runtime. P and D are already started with distinct configurations: low-latency/high-throughput, different tensor-parallel degrees, or even different GPU types. The proxy maintains separate P and D endpoint lists. I have observed no real-world scenario where a running P instance gets repurposed as D or vice versa. If such a need exist please comment on this RFC! :)

Role ambiguity blocks config-time optimizations. Any optimization that depends on knowing producer vs consumer semantics -- memory allocation strategies, model loading decisions, scheduler behavior -- must either be deferred to request time (adding per-request overhead) or resort to brittle heuristics like num_computed_tokens == 0 to infer role.

Other connectors already enforce roles. MooncakeConnector, LMCacheConnector, and others already expect explicit kv_producer / kv_consumer roles. NIXL is the outlier.

Proposed Change.

Phase 1: Deprecation warning (non-breaking)

NIXLConnector validates kv_role at init:

  • kv_producer or kv_consumer: recorded and used going forward.
  • kv_both: logs a deprecation warning pointing to the migration guide, then continues with today's behavior. No functional change for existing deployments

The kv_both literal remains valid in the KVRole type system -- offloading, LMCache, FlexKV, and other connectors that genuinely need dual-role semantics are unaffected. The deprecation is NIXL-specific.

Migration cost for operators (llm-d/dynamo): one config field change per instance. Since the proxy already separates P and D endpoint lists, operators already know which instances serve which role.

Phase 2: Role-aware optimizations

With the role known at config time, a number of optimizations become possible:

Code clarity: not an optimization per-se, but a lot of the confusion in reading/maintaining NIXL code stems from unclear distinction in code between P/D/Shared behavior. Speculative decoding asymmetry: P can auto-adjust its speculator config (e.g. set num_speculative_tokens = 1) and skip sampling since it never uses draft tokens. This would enable PR such as #39266 to skip proposing using a clear "is_producer" interface. This eliminates the block-count mismatch that currently requires a scheduler-level workaround (#22317) and avoids regressions for other spec-decode methods (#43733) . Future optimizations. Any P-vs-D asymmetric behavior (chunked transfer strategies, priority scheduling, health-check semantics) can be cleanly gated on role.

Phase 3: Hard deprecation (breaking)

After a deprecation period (2/3 minor releases), NIXL rejects kv_both with a clear error message pointing to the migration guide.

Feedback Period.

1 week

CC List.

@youkaichao @robertgshaw2-redhat @markmc @benchislett @snadampal @ZhanqiuHu @TheEpicDolphin

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING