vllm - ✅(Solved) Fix [Feature]: Support `routed_experts` export in disaggregated Prefill/Decode serving [1 pull requests, 1 participants]

vllm2026-03-30 07:38:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38523•Fetched 2026-04-08 01:53:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Lecooo

Participants

Lecooo

Assignees

chaunceyjiang

Timeline (top)

assigned ×1labeled ×1

Root Cause

However, this is difficult for downstream users to maintain because it touches scheduler, serving, protocol, and RequestOutput.add() behavior.

Fix Action

Fix / Workaround

An out-of-tree implementation can patch this behavior by serializing prefill routed experts into kv_transfer_params, merging them on the decode/serving side, and aggregating routed experts across streamed outputs.

PR fix notes

PR #39289: [Feature]: Support routed_experts export in disaggregated Prefill/Decode serving

Repository: vllm-project/vllm
Author: chaunceyjiang
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39289

Description (problem / solution / changelog)

Purpose

FIX #38523

Test Plan

# Proxy
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-hosts 172.16.1.247 --decoder-hosts 172.16.1.247 --decoder-ports 8002 --prefiller-ports 8001 --port 800

# Decode
vllm serve /mnt/data3/models/Qwen/Qwen3.5-35B-A3B --enable-auto-tool-choice --tool-call-parser qwen3_coder  --served-model-name my-model  --enable_return_routed_experts --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}' --port 8002 --no-disable-hybrid-kv-cache-manager


# Prefill
vllm serve /mnt/data3/models/Qwen/Qwen3.5-35B-A3B --enable-auto-tool-choice --tool-call-parser qwen3_coder  --served-model-name my-model  --enable_return_routed_experts --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail"}' --port 8001 --no-disable-hybrid-kv-cache-manager

Test Result

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8003/v1"   # Proxy port

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
messages = [{"role": "user", "content": "9.11 and 9.8, which is lower?"}]
response = client.chat.completions.create(
    model="my-model",
    messages=messages,
    stream=stream,
    n=1,
    max_tokens=5
)

for choice in response.choices:
    print("---")
    print(len(choice.routed_experts))

---
28

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

vllm/entrypoints/openai/chat_completion/protocol.py (modified, +8/-0)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +10/-0)
vllm/v1/core/sched/scheduler.py (modified, +16/-0)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

enable_return_routed_experts works for single-instance deployments, but it does not work end-to-end in disaggregated Prefill/Decode (PD) serving today.

The current routed-experts implementation is based on local shared memory and local KV slot mapping, while the PD path only transfers kv_transfer_params. As a result, routed expert data produced on the prefill instance is not transferred to the decode instance, and the final response cannot return a complete routed-expert trace for the whole request.

It would be helpful if vLLM could officially support routed-experts export in PD deployments, for example by:

carrying prefill-side routed expert data across the PD boundary
merging prefill and decode routed experts in the final output
exposing an optional routed_experts field in the response schema

This would make enable_return_routed_experts usable for MoE debugging, router replay, and expert-load analysis in PD deployments as well.

Alternatives

However, this is difficult for downstream users to maintain because it touches scheduler, serving, protocol, and RequestOutput.add() behavior.

Additional context

We have validated this approach in an internal fork based on the current vLLM routed-experts implementation:

prefill routed experts are serialized into kv_transfer_params
decode-side routed experts are concatenated with the prefill payload
RequestOutput.add() aggregates routed_experts
PD decode skips prompt-token ranges when reading local routed experts

It would be great to have an official upstream implementation for this workflow.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To enable enable_return_routed_experts in disaggregated Prefill/Decode (PD) serving, we need to:

Carry prefill-side routed expert data across the PD boundary
Merge prefill and decode routed experts in the final output
Expose an optional routed_experts field in the response schema

Here are the concrete steps:

Serialize prefill routed experts: Modify the prefill instance to serialize routed experts into kv_transfer_params.
Deserialize and merge routed experts: Update the decode instance to deserialize prefill routed experts from kv_transfer_params and merge them with decode-side routed experts.
Aggregate routed experts: Modify RequestOutput.add() to aggregate routed_experts across streamed outputs.

Example code snippets:

# Serialize prefill routed experts
def serialize_routed_experts(routed_experts):
    return pickle.dumps(routed_experts)

# Deserialize and merge routed experts
def merge_routed_experts(prefill_routed_experts, decode_routed_experts):
    return prefill_routed_experts + decode_routed_experts

# Aggregate routed experts
class RequestOutput:
    def add(self, routed_experts):
        self.routed_experts.extend(routed_experts)

Verification

To verify the fix, test the enable_return_routed_experts feature in a PD deployment and check that the final response contains a complete routed-expert trace for the whole request.

Extra Tips

Ensure that the kv_transfer_params serialization and deserialization are efficient and scalable.
Consider adding error handling and logging to handle cases where routed expert data is missing or corrupted.
Review the response schema to ensure that the routed_experts field is properly defined and documented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Support `routed_experts` export in disaggregated Prefill/Decode serving [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #39289: [Feature]: Support routed_experts export in disaggregated Prefill/Decode serving

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Support `routed_experts` export in disaggregated Prefill/Decode serving [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #39289: [Feature]: Support routed_experts export in disaggregated Prefill/Decode serving

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING