vllm - ✅(Solved) Fix [Feature]: Support `routed_experts` export in disaggregated Prefill/Decode serving [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38523Fetched 2026-04-08 01:53:34
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Assignees
Timeline (top)
assigned ×1labeled ×1

Root Cause

However, this is difficult for downstream users to maintain because it touches scheduler, serving, protocol, and RequestOutput.add() behavior.

Fix Action

Fix / Workaround

An out-of-tree implementation can patch this behavior by serializing prefill routed experts into kv_transfer_params, merging them on the decode/serving side, and aggregating routed experts across streamed outputs.

PR fix notes

PR #39289: [Feature]: Support routed_experts export in disaggregated Prefill/Decode serving

Description (problem / solution / changelog)

Purpose

FIX #38523

Test Plan

# Proxy
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-hosts 172.16.1.247 --decoder-hosts 172.16.1.247 --decoder-ports 8002 --prefiller-ports 8001 --port 800

# Decode
vllm serve /mnt/data3/models/Qwen/Qwen3.5-35B-A3B --enable-auto-tool-choice --tool-call-parser qwen3_coder  --served-model-name my-model  --enable_return_routed_experts --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}' --port 8002 --no-disable-hybrid-kv-cache-manager


# Prefill
vllm serve /mnt/data3/models/Qwen/Qwen3.5-35B-A3B --enable-auto-tool-choice --tool-call-parser qwen3_coder  --served-model-name my-model  --enable_return_routed_experts --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}' --port 8001 --no-disable-hybrid-kv-cache-manager

Test Result

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8003/v1"   # Proxy port

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
messages = [{"role": "user", "content": "9.11 and 9.8, which is lower?"}]
response = client.chat.completions.create(
    model="my-model",
    messages=messages,
    stream=stream,
    n=1,
    max_tokens=5
)

for choice in response.choices:
    print("---")
    print(len(choice.routed_experts))

---
28

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +8/-0)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +10/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +16/-0)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

enable_return_routed_experts works for single-instance deployments, but it does not work end-to-end in disaggregated Prefill/Decode (PD) serving today.

The current routed-experts implementation is based on local shared memory and local KV slot mapping, while the PD path only transfers kv_transfer_params. As a result, routed expert data produced on the prefill instance is not transferred to the decode instance, and the final response cannot return a complete routed-expert trace for the whole request.

It would be helpful if vLLM could officially support routed-experts export in PD deployments, for example by:

  • carrying prefill-side routed expert data across the PD boundary
  • merging prefill and decode routed experts in the final output
  • exposing an optional routed_experts field in the response schema

This would make enable_return_routed_experts usable for MoE debugging, router replay, and expert-load analysis in PD deployments as well.

Alternatives

An out-of-tree implementation can patch this behavior by serializing prefill routed experts into kv_transfer_params, merging them on the decode/serving side, and aggregating routed experts across streamed outputs.

However, this is difficult for downstream users to maintain because it touches scheduler, serving, protocol, and RequestOutput.add() behavior.

Additional context

We have validated this approach in an internal fork based on the current vLLM routed-experts implementation:

  • prefill routed experts are serialized into kv_transfer_params
  • decode-side routed experts are concatenated with the prefill payload
  • RequestOutput.add() aggregates routed_experts
  • PD decode skips prompt-token ranges when reading local routed experts

It would be great to have an official upstream implementation for this workflow.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To enable enable_return_routed_experts in disaggregated Prefill/Decode (PD) serving, we need to:

  • Carry prefill-side routed expert data across the PD boundary
  • Merge prefill and decode routed experts in the final output
  • Expose an optional routed_experts field in the response schema

Here are the concrete steps:

  • Serialize prefill routed experts: Modify the prefill instance to serialize routed experts into kv_transfer_params.
  • Deserialize and merge routed experts: Update the decode instance to deserialize prefill routed experts from kv_transfer_params and merge them with decode-side routed experts.
  • Aggregate routed experts: Modify RequestOutput.add() to aggregate routed_experts across streamed outputs.

Example code snippets:

# Serialize prefill routed experts
def serialize_routed_experts(routed_experts):
    return pickle.dumps(routed_experts)

# Deserialize and merge routed experts
def merge_routed_experts(prefill_routed_experts, decode_routed_experts):
    return prefill_routed_experts + decode_routed_experts

# Aggregate routed experts
class RequestOutput:
    def add(self, routed_experts):
        self.routed_experts.extend(routed_experts)

Verification

To verify the fix, test the enable_return_routed_experts feature in a PD deployment and check that the final response contains a complete routed-expert trace for the whole request.

Extra Tips

  • Ensure that the kv_transfer_params serialization and deserialization are efficient and scalable.
  • Consider adding error handling and logging to handle cases where routed expert data is missing or corrupted.
  • Review the response schema to ensure that the routed_experts field is properly defined and documented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING