vllm - ✅(Solved) Fix [Roadmap] DeepSeek V4 [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40902Fetched 2026-04-27 05:29:27
View on GitHub
Comments
0
Participants
1
Timeline
19
Reactions
10
Author
Participants
Timeline (top)
subscribed ×10mentioned ×4labeled ×3cross-referenced ×1

Fix Action

Fix / Workaround

PD + High throughput Optimization

  • Nvlink_one_sided a2a support bf16 and mxfp8 dispatch
  • Nvlink_one_sided a2a support for FP8 quantized combine

PR fix notes

PR #40860: [Feat] DeepSeek V4 Rebased

Description (problem / solution / changelog)

Purpose

Rebased version of #40760

Roadmap: https://github.com/vllm-project/vllm/issues/40902

Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • CMakeLists.txt (modified, +5/-2)
  • cmake/external_projects/deepgemm.cmake (modified, +6/-1)
  • cmake/external_projects/flashmla.cmake (modified, +1/-1)
  • csrc/cpu/pos_encoding.cpp (modified, +6/-1)
  • csrc/cpu/torch_bindings.cpp (modified, +2/-1)
  • csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (added, +477/-0)
  • csrc/layernorm_kernels.cu (modified, +15/-7)
  • csrc/layernorm_quant_kernels.cu (modified, +28/-10)
  • csrc/moe/moe_ops.h (modified, +9/-0)
  • csrc/moe/topk_softplus_sqrt_kernels.cu (added, +715/-0)
  • csrc/moe/torch_bindings.cpp (modified, +8/-0)
  • csrc/ops.h (modified, +7/-1)
  • csrc/persistent_topk.cuh (modified, +17/-16)
  • csrc/pos_encoding_kernels.cu (modified, +39/-33)
  • csrc/sampler.cu (modified, +7/-1)
  • csrc/topk.cu (modified, +59/-35)
  • csrc/torch_bindings.cpp (modified, +15/-1)
  • docs/design/attention_backends.md (modified, +1/-1)
  • docs/models/supported_models.md (modified, +4/-3)
  • requirements/cuda.txt (modified, +2/-0)
  • tests/compile/fusions_e2e/conftest.py (modified, +5/-0)
  • tests/kernels/attention/test_deepgemm_attention.py (modified, +22/-12)
  • tests/kernels/core/test_fused_q_kv_rmsnorm.py (added, +81/-0)
  • tests/kernels/moe/test_deepgemm.py (modified, +195/-1)
  • tests/kernels/moe/test_topk_softplus_sqrt.py (added, +186/-0)
  • tests/kernels/test_compressor_kv_cache.py (added, +311/-0)
  • tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py (added, +359/-0)
  • tests/kernels/test_fused_indexer_q_rope_quant.py (added, +98/-0)
  • tests/kernels/test_fused_inv_rope_fp8_quant.py (added, +908/-0)
  • tests/kernels/test_top_k_per_row.py (modified, +5/-10)
  • tests/model_executor/test_routed_experts_capture.py (modified, +3/-1)
  • tests/models/registry.py (modified, +9/-0)
  • tests/models/test_deepseek_v4_mega_moe.py (added, +184/-0)
  • tests/reasoning/test_deepseekv3_reasoning_parser.py (modified, +7/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_1.json (added, +81/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_2.json (added, +24/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_3.json (added, +159/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_4.json (added, +28/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_1.txt (added, +36/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_2.txt (added, +1/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_3.txt (added, +38/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_4.txt (added, +29/-0)
  • tests/tokenizers_/test_deepseek_v4.py (added, +224/-0)
  • tests/tool_parsers/test_deepseekv4_tool_parser.py (added, +123/-0)
  • tests/v1/attention/test_indexer_deepseek_v4_slot_mapping.py (added, +92/-0)
  • tests/v1/core/test_kv_cache_utils.py (modified, +3/-2)
  • tests/v1/core/test_prefix_caching.py (modified, +19/-20)
  • tests/v1/core/test_scheduler.py (modified, +2/-0)
  • tests/v1/kv_connector/unit/test_mooncake_connector.py (modified, +27/-23)
  • tests/v1/kv_connector/unit/test_mooncake_connector_hma.py (added, +410/-0)
  • tests/v1/streaming_input/test_scheduler_streaming.py (modified, +1/-0)
  • tools/install_deepgemm.sh (modified, +1/-1)
  • vllm/_custom_ops.py (modified, +41/-3)
  • vllm/config/attention.py (modified, +3/-0)
  • vllm/config/cache.py (modified, +14/-0)
  • vllm/config/compilation.py (modified, +1/-0)
  • vllm/config/kernel.py (modified, +6/-4)
  • vllm/config/model.py (modified, +5/-1)
  • vllm/config/speculative.py (modified, +11/-1)
  • vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py (modified, +143/-40)
  • vllm/entrypoints/chat_utils.py (modified, +9/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py (modified, +2/-0)
  • vllm/model_executor/layers/attention/mla_attention.py (modified, +16/-0)
  • vllm/model_executor/layers/deepseek_compressor.py (added, +438/-0)
  • vllm/model_executor/layers/deepseek_v4_attention.py (added, +1076/-0)
  • vllm/model_executor/layers/fused_moe/config.py (modified, +45/-1)
  • vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py (modified, +234/-1)
  • vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py (modified, +193/-2)
  • vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe.py (modified, +84/-60)
  • vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +19/-5)
  • vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +7/-0)
  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +409/-10)
  • vllm/model_executor/layers/fused_moe/router/base_router.py (modified, +5/-1)
  • vllm/model_executor/layers/fused_moe/router/custom_routing_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/fused_moe_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py (modified, +84/-16)
  • vllm/model_executor/layers/fused_moe/router/fused_topk_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/grouped_topk_router.py (modified, +3/-0)
  • vllm/model_executor/layers/fused_moe/router/router_factory.py (modified, +9/-2)
  • vllm/model_executor/layers/fused_moe/router/routing_simulator_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/zero_expert_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/runner/moe_runner.py (modified, +13/-0)
  • vllm/model_executor/layers/fused_moe/runner/moe_runner_interface.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/utils.py (modified, +18/-0)
  • vllm/model_executor/layers/mhc.py (added, +450/-0)
  • vllm/model_executor/layers/quantization/__init__.py (modified, +3/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_int8.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_fp8.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_mxfp8.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16_marlin.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/fp8.py (modified, +3/-0)
  • vllm/model_executor/layers/quantization/modelopt.py (modified, +3/-0)
  • vllm/model_executor/layers/quantization/mxfp4.py (modified, +333/-0)
  • vllm/model_executor/layers/quantization/online/moe_base.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/quark/quark_moe.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +201/-10)
  • vllm/model_executor/layers/rotary_embedding/__init__.py (modified, +16/-7)
RAW_BUFFERClick to expand / collapse

Implementation

The latest implementation updates are tracked in https://github.com/vllm-project/vllm/pull/40860

Performance Dashboard

Please refer to InferenceX: https://inferencex.semianalysis.com/inference

Roadmap

Core Model Support

Low Latency Optimization

  • Multi-stream 4 GEMM in C4A and C128A (Compressor WKV+W_S, SWA WQA+WKV, Indexer W, Indexer Compressor WKV+WS)
  • Fast topk kernel @WoosukKwon
  • Faster fp8 group quantization kernel
  • Enable Allreduce + RMSnorm + fp8 quant fusion
  • Indexer topk + page table transform fusion
  • Specialized pre-Attention GEMM for low BS

PD + High throughput Optimization

  • Nvlink_one_sided a2a support bf16 and mxfp8 dispatch
  • Nvlink_one_sided a2a support for FP8 quantized combine

Ultra Long Context

  • Decode Context Parallel Support

Runtime and Parallelism

  • Model Runner V2 Integration @WoosukKwon
  • MTP optimizations
  • Pipeline parallelism support

Kernel Integration

KV Cache

Hardware Support

extent analysis

TL;DR

Review the implementation updates in https://github.com/vllm-project/vllm/pull/40860 and consider integrating the changes to address performance and compatibility issues.

Guidance

  • Examine the Performance Dashboard on InferenceX to identify areas for optimization: https://inferencex.semianalysis.com/inference
  • Investigate the FP4 Indexer and MegaMoE support implemented in https://github.com/vllm-project/vllm/pull/40860 for potential improvements
  • Review the Roadmap to prioritize tasks, focusing on Low Latency Optimization and PD + High throughput Optimization
  • Consider collaborating with contributors like @zyongye and @WoosukKwon for specific tasks, such as NVFP4 support and Model Runner V2 Integration

Notes

The issue lacks specific error messages or technical details, making it challenging to provide a targeted solution. The guidance provided is based on the available information and may require further investigation to address the underlying issues.

Recommendation

Apply workaround: Review and integrate the implementation updates from https://github.com/vllm-project/vllm/pull/40860 to address potential performance and compatibility issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING