vllm - ✅(Solved) Fix [Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector [1 pull requests, 7 comments, 4 participants]

functionstackx · 2026-04-01T05:30:51Z

[vllm] PR 948: Draft, no merge MVP for vLLM Disagg - Repository: SemiAnalysisAI/InferenceX - Author: chunfangamd - State: open | merged: False - Link: https://… # PR #948: [Draft, no merge] MVP for vLLM Disagg - Repository: SemiAnalysisAI/InferenceX - Author: chunfangamd - State: open | merged: False - Link: https://github.com/SemiAnalysisAI/InferenceX/pull/948 ## Description (problem / solution / changelog) We prototype the PD Disagg on DeepSeek. So far, we've done - Framework Integration: Successfully integrated vLLM’s Prefill/Decode disaggregated architecture into the InferenceX framework. - MoRI/Deep EP Integration on DPSK V3: for MoE All-to-All and for KV Cache transfer. - Confirmed MoRI-IO outperforms the standard NixlConnector in TTFT (Time to First Token). - Stability & Bug Fixes: * Resolved hang issues at high concurrency (CONC512) by fixing KV cache leaks and optimizing the "Reaper" logic for memory block release. - Fixed hardware compatibility issues, including PCI topology failures on Broadcom switches and error handling for mlx5 NICs. - Cluster Validation: Verified the multi-node deployment recipe on the SA-9N (mia1) cluster. co-authors: - @ichbinblau - @ChuanLi1101 - @billishyahao Thanks to @functionstackx ## Changed files - `.github/configs/amd-master.yaml` (modified, +75/-0) - `benchmarks/multi_node/dsr1_fp8_mi355x_vllm-disagg.sh` (added, +79/-0) - `benchmarks/multi_node/vllm_disagg_utils/bench.sh` (added, +75/-0) - `benchmarks/multi_node/vllm_disagg_utils/env.sh` (added, +98/-0) - `benchmarks/multi_node/vllm_disagg_utils/job.slurm` (added, +358/-0) - `benchmarks/multi_node/vllm_disagg_utils/models.yaml` (added, +41/-0) - `benchmarks/multi_node/vllm_disagg_utils/moriio_proxy.py` (added, +326/-0) - `benchmarks/multi_node/vllm_disagg_utils/server.sh` (added, +490/-0) - `benchmarks/multi_node/vllm_disagg_utils/setup_deps.sh` (added, +848/-0) - `benchmarks/multi_node/vllm_disagg_utils/start_etcd.sh` (added, +47/-0) - `benchmarks/multi_node/vllm_disagg_utils/submit.sh` (added, +166/-0) - `benchmarks/multi_node/vllm_disagg_utils/sync.py` (added, +201/-0) - `runners/launch_mi355x-amds.sh` (modified, +12/-3) - `utils/bench_serving/backend_request_func.py` (modified, +104/-66) - `utils/bench_serving/benchmark_serving.py` (modified, +43/-15) ## Fixed - Fixed by PR: [Draft, no merge] MVP for vLLM Disagg (https://github.com/SemiAnalysisAI/InferenceX/pull/948) ### Your current environment all the nightly images in https://hub.docker.com/r/vllm/vllm-openai-rocm/tags as of April 1st, 2026 `vllm/vllm-openai-rocm:v0.18.1` `vllm/vllm-openai-rocm:v0.18.0` ### 🐛 Describe the bug hi @hongxiayang +viz @powderluv @chunfangamd @andyluo7 @ChuanLi1101 vLLM router does not currently with with MoRI kvcache connector for ROCm disagg. Verus on the CUDA side, vLLM router works with the NVIDIA equivalent of MoRI kvcache connector aka NIXL. On ROCm vLLM stack, users can only currently use RIXL (the second class fork of NIXL, RIXL isn't included out of the box in the docker image unfortunately) or use the MORIIO kvcache toy proxy server which is not prod ready. On ROCm SGLang stack, the sglang equivalent of vLLM router is called sglang model gateway server is already supports MoRI kvcache transfer. Lets ensure that the ROCm vLLM experience is parity with ROCm SGlang experience https://github.com/sgl-project/sglang/pull/14626 Can you look into having vLLM router support MoRI kvcache transfer? thanks the expected user experience is that the python wheel from upstream pypi and the docker image should support it https://hub.docker.com/r/vllm/vllm-router/tags https://github.com/vllm-project/vllm/blob/main/examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-01 05:30:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38692•Fetched 2026-04-08 02:23:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×11subscribed ×11commented ×7labeled ×3

Fix Action

Fixed

Fixed by PR: [Draft, no merge] MVP for vLLM Disagg (https://github.com/SemiAnalysisAI/InferenceX/pull/948)

PR fix notes

PR #948: [Draft, no merge] MVP for vLLM Disagg

Repository: SemiAnalysisAI/InferenceX
Author: chunfangamd
State: open | merged: False
Link: https://github.com/SemiAnalysisAI/InferenceX/pull/948

Description (problem / solution / changelog)

We prototype the PD Disagg on DeepSeek. So far, we've done

Framework Integration: Successfully integrated vLLM’s Prefill/Decode disaggregated architecture into the InferenceX framework.
MoRI/Deep EP Integration on DPSK V3: for MoE All-to-All and for KV Cache transfer.
Confirmed MoRI-IO outperforms the standard NixlConnector in TTFT (Time to First Token).
Stability & Bug Fixes: * Resolved hang issues at high concurrency (CONC512) by fixing KV cache leaks and optimizing the "Reaper" logic for memory block release.
Fixed hardware compatibility issues, including PCI topology failures on Broadcom switches and error handling for mlx5 NICs.
Cluster Validation: Verified the multi-node deployment recipe on the SA-9N (mia1) cluster.

co-authors:

@ichbinblau
@ChuanLi1101
@billishyahao Thanks to @functionstackx

Changed files

.github/configs/amd-master.yaml (modified, +75/-0)
benchmarks/multi_node/dsr1_fp8_mi355x_vllm-disagg.sh (added, +79/-0)
benchmarks/multi_node/vllm_disagg_utils/bench.sh (added, +75/-0)
benchmarks/multi_node/vllm_disagg_utils/env.sh (added, +98/-0)
benchmarks/multi_node/vllm_disagg_utils/job.slurm (added, +358/-0)
benchmarks/multi_node/vllm_disagg_utils/models.yaml (added, +41/-0)
benchmarks/multi_node/vllm_disagg_utils/moriio_proxy.py (added, +326/-0)
benchmarks/multi_node/vllm_disagg_utils/server.sh (added, +490/-0)
benchmarks/multi_node/vllm_disagg_utils/setup_deps.sh (added, +848/-0)
benchmarks/multi_node/vllm_disagg_utils/start_etcd.sh (added, +47/-0)
benchmarks/multi_node/vllm_disagg_utils/submit.sh (added, +166/-0)
benchmarks/multi_node/vllm_disagg_utils/sync.py (added, +201/-0)
runners/launch_mi355x-amds.sh (modified, +12/-3)
utils/bench_serving/backend_request_func.py (modified, +104/-66)
utils/bench_serving/benchmark_serving.py (modified, +43/-15)

RAW_BUFFERClick to expand / collapse

Your current environment

all the nightly images in https://hub.docker.com/r/vllm/vllm-openai-rocm/tags as of April 1st, 2026

vllm/vllm-openai-rocm:v0.18.1

vllm/vllm-openai-rocm:v0.18.0

🐛 Describe the bug

hi @hongxiayang

+viz @powderluv @chunfangamd @andyluo7 @ChuanLi1101

vLLM router does not currently with with MoRI kvcache connector for ROCm disagg. Verus on the CUDA side, vLLM router works with the NVIDIA equivalent of MoRI kvcache connector aka NIXL.

On ROCm vLLM stack, users can only currently use RIXL (the second class fork of NIXL, RIXL isn't included out of the box in the docker image unfortunately) or use the MORIIO kvcache toy proxy server which is not prod ready.

On ROCm SGLang stack, the sglang equivalent of vLLM router is called sglang model gateway server is already supports MoRI kvcache transfer. Lets ensure that the ROCm vLLM experience is parity with ROCm SGlang experience https://github.com/sgl-project/sglang/pull/14626

Can you look into having vLLM router support MoRI kvcache transfer? thanks

the expected user experience is that the python wheel from upstream pypi and the docker image should support it https://hub.docker.com/r/vllm/vllm-router/tags

https://github.com/vllm-project/vllm/blob/main/examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The vLLM router needs to be updated to support MoRI kvcache transfer for ROCm to achieve parity with the ROCm SGlang experience.

Guidance

Review the sglang model gateway server implementation to understand how MoRI kvcache transfer is supported, as seen in https://github.com/sgl-project/sglang/pull/14626.
Investigate modifying the vLLM router to include support for MoRI kvcache connector, potentially using the MORIIO kvcache toy proxy server as a reference.
Consider including RIXL in the docker image or providing clear instructions for users to set it up, as it is currently the only alternative to the MORIIO kvcache toy proxy server.
Verify that the python wheel from upstream pypi and the docker image support the updated vLLM router with MoRI kvcache transfer.

Notes

The current implementation of the vLLM router only supports RIXL or the MORIIO kvcache toy proxy server, which is not production-ready, limiting the user experience on ROCm.

Recommendation

Apply workaround: Modify the vLLM router to support MoRI kvcache transfer to achieve parity with the ROCm SGlang experience, as this will improve the user experience and provide a more robust solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API rate limit #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector [1 pull requests, 7 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #948: [Draft, no merge] MVP for vLLM Disagg

Description (problem / solution / changelog)

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: parity with CUDA & parity with rocm sglang: vLLM router doesn't current support MoRI kvcache connector [1 pull requests, 7 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #948: [Draft, no merge] MVP for vLLM Disagg

Description (problem / solution / changelog)

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING