vllm - ✅(Solved) Fix [Roadmap] DeepSeek V4 [1 pull requests, 1 participants]

ivanium · 2026-04-26T05:29:46Z

[vllm] PR 40860: Feat DeepSeek V4 Rebased - Repository: vllm-project/vllm - Author: ivanium - State: closed | merged: True - Link: https://github.com/vllm-proj… # PR #40860: [Feat] DeepSeek V4 Rebased - Repository: vllm-project/vllm - Author: ivanium - State: closed | merged: True - Link: https://github.com/vllm-project/vllm/pull/40860 ## Description (problem / solution / changelog) ## Purpose Rebased version of #40760 Roadmap: https://github.com/vllm-project/vllm/issues/40902 Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi ## Test Plan ## Test Result --- Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. ## Changed files - `CMakeLists.txt` (modified, +5/-2) - `cmake/external_projects/deepgemm.cmake` (modified, +6/-1) - `cmake/external_projects/flashmla.cmake` (modified, +1/-1) - `csrc/cpu/pos_encoding.cpp` (modified, +6/-1) - `csrc/cpu/torch_bindings.cpp` (modified, +2/-1) - `csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu` (added, +477/-0) - `csrc/layernorm_kernels.cu` (modified, +15/-7) - `csrc/layernorm_quant_kernels.cu` (modified, +28/-10) - `csrc/moe/moe_ops.h` (modified, +9/-0) - `csrc/moe/topk_softplus_sqrt_kernels.cu` (added, +715/-0) - `csrc/moe/torch_bindings.cpp` (modified, +8/-0) - `csrc/ops.h` (modified, +7/-1) - `csrc/persistent_topk.cuh` (modified, +17/-16) - `csrc/pos_encoding_kernels.cu` (modified, +39/-33) - `csrc/sampler.cu` (modified, +7/-1) - `csrc/topk.cu` (modified, +59/-35) - `csrc/torch_bindings.cpp` (modified, +15/-1) - `docs/design/attention_backends.md` (modified, +1/-1) - `docs/models/supported_models.md` (modified, +4/-3) - `requirements/cuda.txt` (modified, +2/-0) - `tests/compile/fusions_e2e/conftest.py` (modified, +5/-0) - `tests/kernels/attention/test_deepgemm_attention.py` (modified, +22/-12) - `tests/kernels/core/test_fused_q_kv_rmsnorm.py` (added, +81/-0) - `tests/kernels/moe/test_deepgemm.py` (modified, +195/-1) - `tests/kernels/moe/test_topk_softplus_sqrt.py` (added, +186/-0) - `tests/kernels/test_compressor_kv_cache.py` (added, +311/-0) - `tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py` (added, +359/-0) - `tests/kernels/test_fused_indexer_q_rope_quant.py` (added, +98/-0) - `tests/kernels/test_fused_inv_rope_fp8_quant.py` (added, +908/-0) - `tests/kernels/test_top_k_per_row.py` (modified, +5/-10) - `tests/model_executor/test_routed_experts_capture.py` (modified, +3/-1) - `tests/models/registry.py` (modified, +9/-0) - `tests/models/test_deepseek_v4_mega_moe.py` (added, +184/-0) - `tests/reasoning/test_deepseekv3_reasoning_parser.py` (modified, +7/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_input_1.json` (added, +81/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_input_2.json` (added, +24/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_input_3.json` (added, +159/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_input_4.json` (added, +28/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_output_1.txt` (added, +36/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_output_2.txt` (added, +1/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_output_3.txt` (added, +38/-0) - `tests/tokenizers_/fixtures/deepseek_v4/test_output_4.txt` (added, +29/-0) - `tests/tokenizers_/test_deepseek_v4.py` (added, +224/-0) - `tests/tool_parsers/test_deepseekv4_tool_parser.py` (added, +123/-0) - `tests/v1/attention/test_indexer_deepseek_v4_slot_mapping.py` (added, +92/-0) - `tests/v1/core/test_kv_cache_utils.py` (modified, +3/-2) - `tests/v1/core/test_prefix_caching.py` (modified, +19/-20) - `tests/v1/core/test_scheduler.py` (modified, +2/-0) - `tests/v1/kv_connector/unit/test_mooncake_connector.py` (modified, +27/-23) - `tests/v1/kv_connector/unit/test_mooncake_connector_hma.py` (added, +410/-0) - `tests/v1/streaming_input/test_scheduler_streaming.py` (modified, +1/-0) - `tools/install_deepgemm.sh` (modified, +1/-1) - `vllm/_custom_ops.py` (modified, +41/-3) - `vllm/config/attention.py` (modified, +3/-0) - `vllm/config/cache.py` (modified, +14/-0) - `vllm/config/compilation.py` (modified, +1/-0) - `vllm/config/kernel.py` (modified, +6/-4) - `vllm/config/model.py` (modified, +5/-1) - `vllm/config/speculative.py` (modified, +11/-1) - `vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py` (modified, +143/-40) - `vllm/entrypoints/chat_utils.py` (modified, +9/-0) - `vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py` (modified, +2/-0) - `vllm/model_executor/layers/attention/mla_attention.py` (modified, +16/-0) - `vllm/model_e

vllm2026-04-26 05:29:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40902•Fetched 2026-04-27 05:29:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ivanium

Participants

ivanium

Timeline (top)

subscribed ×10mentioned ×4labeled ×3cross-referenced ×1

Fix Action

Fix / Workaround

PD + High throughput Optimization

Nvlink_one_sided a2a support bf16 and mxfp8 dispatch
Nvlink_one_sided a2a support for FP8 quantized combine

PR fix notes

PR #40860: [Feat] DeepSeek V4 Rebased

Repository: vllm-project/vllm
Author: ivanium
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/40860

Description (problem / solution / changelog)

Purpose

Rebased version of #40760

Roadmap: https://github.com/vllm-project/vllm/issues/40902

Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

CMakeLists.txt (modified, +5/-2)
cmake/external_projects/deepgemm.cmake (modified, +6/-1)
cmake/external_projects/flashmla.cmake (modified, +1/-1)
csrc/cpu/pos_encoding.cpp (modified, +6/-1)
csrc/cpu/torch_bindings.cpp (modified, +2/-1)
csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (added, +477/-0)
csrc/layernorm_kernels.cu (modified, +15/-7)
csrc/layernorm_quant_kernels.cu (modified, +28/-10)
csrc/moe/moe_ops.h (modified, +9/-0)
csrc/moe/topk_softplus_sqrt_kernels.cu (added, +715/-0)
csrc/moe/torch_bindings.cpp (modified, +8/-0)
csrc/ops.h (modified, +7/-1)
csrc/persistent_topk.cuh (modified, +17/-16)
csrc/pos_encoding_kernels.cu (modified, +39/-33)
csrc/sampler.cu (modified, +7/-1)
csrc/topk.cu (modified, +59/-35)
csrc/torch_bindings.cpp (modified, +15/-1)
docs/design/attention_backends.md (modified, +1/-1)
docs/models/supported_models.md (modified, +4/-3)
requirements/cuda.txt (modified, +2/-0)
tests/compile/fusions_e2e/conftest.py (modified, +5/-0)
tests/kernels/attention/test_deepgemm_attention.py (modified, +22/-12)
tests/kernels/core/test_fused_q_kv_rmsnorm.py (added, +81/-0)
tests/kernels/moe/test_deepgemm.py (modified, +195/-1)
tests/kernels/moe/test_topk_softplus_sqrt.py (added, +186/-0)
tests/kernels/test_compressor_kv_cache.py (added, +311/-0)
tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py (added, +359/-0)
tests/kernels/test_fused_indexer_q_rope_quant.py (added, +98/-0)
tests/kernels/test_fused_inv_rope_fp8_quant.py (added, +908/-0)
tests/kernels/test_top_k_per_row.py (modified, +5/-10)
tests/model_executor/test_routed_experts_capture.py (modified, +3/-1)
tests/models/registry.py (modified, +9/-0)
tests/models/test_deepseek_v4_mega_moe.py (added, +184/-0)
tests/reasoning/test_deepseekv3_reasoning_parser.py (modified, +7/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_input_1.json (added, +81/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_input_2.json (added, +24/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_input_3.json (added, +159/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_input_4.json (added, +28/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_output_1.txt (added, +36/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_output_2.txt (added, +1/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_output_3.txt (added, +38/-0)
tests/tokenizers_/fixtures/deepseek_v4/test_output_4.txt (added, +29/-0)
tests/tokenizers_/test_deepseek_v4.py (added, +224/-0)
tests/tool_parsers/test_deepseekv4_tool_parser.py (added, +123/-0)
tests/v1/attention/test_indexer_deepseek_v4_slot_mapping.py (added, +92/-0)
tests/v1/core/test_kv_cache_utils.py (modified, +3/-2)
tests/v1/core/test_prefix_caching.py (modified, +19/-20)
tests/v1/core/test_scheduler.py (modified, +2/-0)
tests/v1/kv_connector/unit/test_mooncake_connector.py (modified, +27/-23)
tests/v1/kv_connector/unit/test_mooncake_connector_hma.py (added, +410/-0)
tests/v1/streaming_input/test_scheduler_streaming.py (modified, +1/-0)
tools/install_deepgemm.sh (modified, +1/-1)
vllm/_custom_ops.py (modified, +41/-3)
vllm/config/attention.py (modified, +3/-0)
vllm/config/cache.py (modified, +14/-0)
vllm/config/compilation.py (modified, +1/-0)
vllm/config/kernel.py (modified, +6/-4)
vllm/config/model.py (modified, +5/-1)
vllm/config/speculative.py (modified, +11/-1)
vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py (modified, +143/-40)
vllm/entrypoints/chat_utils.py (modified, +9/-0)
vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py (modified, +2/-0)
vllm/model_executor/layers/attention/mla_attention.py (modified, +16/-0)
vllm/model_executor/layers/deepseek_compressor.py (added, +438/-0)
vllm/model_executor/layers/deepseek_v4_attention.py (added, +1076/-0)
vllm/model_executor/layers/fused_moe/config.py (modified, +45/-1)
vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py (modified, +234/-1)
vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py (modified, +193/-2)
vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe.py (modified, +84/-60)
vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +19/-5)
vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +1/-0)
vllm/model_executor/layers/fused_moe/layer.py (modified, +7/-0)
vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +409/-10)
vllm/model_executor/layers/fused_moe/router/base_router.py (modified, +5/-1)
vllm/model_executor/layers/fused_moe/router/custom_routing_router.py (modified, +2/-0)
vllm/model_executor/layers/fused_moe/router/fused_moe_router.py (modified, +2/-0)
vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py (modified, +84/-16)
vllm/model_executor/layers/fused_moe/router/fused_topk_router.py (modified, +2/-0)
vllm/model_executor/layers/fused_moe/router/grouped_topk_router.py (modified, +3/-0)
vllm/model_executor/layers/fused_moe/router/router_factory.py (modified, +9/-2)
vllm/model_executor/layers/fused_moe/router/routing_simulator_router.py (modified, +2/-0)
vllm/model_executor/layers/fused_moe/router/zero_expert_router.py (modified, +2/-0)
vllm/model_executor/layers/fused_moe/runner/moe_runner.py (modified, +13/-0)
vllm/model_executor/layers/fused_moe/runner/moe_runner_interface.py (modified, +1/-0)
vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +1/-0)
vllm/model_executor/layers/fused_moe/utils.py (modified, +18/-0)
vllm/model_executor/layers/mhc.py (added, +450/-0)
vllm/model_executor/layers/quantization/__init__.py (modified, +3/-0)
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py (modified, +1/-0)
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_int8.py (modified, +1/-0)
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_fp8.py (modified, +1/-0)
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_mxfp8.py (modified, +1/-0)
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16_marlin.py (modified, +1/-0)
vllm/model_executor/layers/quantization/fp8.py (modified, +3/-0)
vllm/model_executor/layers/quantization/modelopt.py (modified, +3/-0)
vllm/model_executor/layers/quantization/mxfp4.py (modified, +333/-0)
vllm/model_executor/layers/quantization/online/moe_base.py (modified, +1/-0)
vllm/model_executor/layers/quantization/quark/quark_moe.py (modified, +1/-0)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +201/-10)
vllm/model_executor/layers/rotary_embedding/__init__.py (modified, +16/-7)

extent analysis

TL;DR

Review the implementation updates in https://github.com/vllm-project/vllm/pull/40860 and consider integrating the changes to address performance and compatibility issues.

Guidance

Examine the Performance Dashboard on InferenceX to identify areas for optimization: https://inferencex.semianalysis.com/inference
Investigate the FP4 Indexer and MegaMoE support implemented in https://github.com/vllm-project/vllm/pull/40860 for potential improvements
Review the Roadmap to prioritize tasks, focusing on Low Latency Optimization and PD + High throughput Optimization
Consider collaborating with contributors like @zyongye and @WoosukKwon for specific tasks, such as NVFP4 support and Model Runner V2 Integration

Notes

The issue lacks specific error messages or technical details, making it challenging to provide a targeted solution. The guidance provided is based on the available information and may require further investigation to address the underlying issues.

Recommendation

Apply workaround: Review and integrate the implementation updates from https://github.com/vllm-project/vllm/pull/40860 to address potential performance and compatibility issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #API routing #API middleware #SSR setup #ISR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Roadmap] DeepSeek V4 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PD + High throughput Optimization

PR fix notes

PR #40860: [Feat] DeepSeek V4 Rebased

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Implementation

Performance Dashboard

Roadmap

Core Model Support

Low Latency Optimization

PD + High throughput Optimization

Ultra Long Context

Runtime and Parallelism

Kernel Integration

KV Cache

Hardware Support

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING