vllm - ✅(Solved) Fix [Feature]: Support DeepSeek V4 flash on SM120 with Triton fallback [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40928Fetched 2026-04-27 05:29:15
View on GitHub
Comments
1
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
renamed ×3commented ×1cross-referenced ×1labeled ×1

Root Cause

Because of this, DeepSeek V4 cannot run on SM120 even though the GPUs have enough memory and compute capability for the model. It would be very helpful if vLLM could support DeepSeek V4 on SM120, or provide a compatible execution path when DeepGEMM / FlashMLA are unavailable for this architecture.

Fix Action

Fix / Workaround

  • Use GPU architectures already supported by DeepGEMM / FlashMLA.
  • Wait for DeepGEMM / FlashMLA to add SM120 support.
  • Maintain a local patch to bypass unsupported kernel paths.

PR fix notes

PR #40929: [WIP]Support DeepSeek V4 flash on SM120 with Triton fallback

Description (problem / solution / changelog)

Issue:https://github.com/vllm-project/vllm/issues/40928 This PR is based on https://github.com/vllm-project/vllm/pull/40760 Tested on 2 x RTX Pro 6000 (SM120)

Summary

Support Triton fallback ops for DeepSeek V4 flash when DeepGEMM or FlashMLA is not available.

This PR adds a generic Triton implementation path for the DeepSeek V4 branch, including fallback kernels for sparse MLA attention, decode sparse attention, FP8 einsum, sparse attention indexer logits, and MHC prenorm GEMM. The existing optimized DeepGEMM / FlashMLA paths are still preferred when available; the Triton path is only used as a fallback.

Why

My approach for running DeepSeek V4 flash on SM120 is to provide a generic Triton implementation instead of hard-blocking execution on DeepGEMM or FlashMLA availability.

I think this is a reasonable fit for the vLLM DeepSeek V4 branch: when FlashMLA or DeepGEMM does not support a device yet, vLLM should still have a portable implementation that lets users run the model. Triton gives us a more general compatibility layer across GPU architectures, including SM120 and future SM architectures.

The goal of this PR is not to replace the optimized kernels. DeepGEMM and FlashMLA should remain the preferred paths when they are supported. However, when they are unavailable, the Triton fallback gives users a working implementation, even if there is still room for performance optimization.

This also keeps the migration cost low. If DeepGEMM adds SM120 support in the future, vLLM can switch SM120 back to the DeepGEMM path with minimal changes, while still keeping Triton as a portable fallback for other unsupported architectures.

Change

This PR supports DeepSeek V4 flash on SM120 by adding a generic Triton fallback path for kernels that currently depend on DeepGEMM or FlashMLA.

Main changes include:

  • Add Triton fallback kernels for DeepSeek V4 sparse MLA attention and decode sparse attention.
  • Add a Triton fallback implementation for the DeepSeek V4 FP8 einsum path.
  • Add Triton fallback kernels for sparse attention indexer logits.
  • Add a Triton fallback path for MHC prenorm GEMM.
  • Keep DeepGEMM / FlashMLA as the preferred optimized paths when available.
  • Fall back to Triton automatically when DeepGEMM or FlashMLA is unavailable, enabling DeepSeek V4 to run on SM120 and other future unsupported SM architectures.
  • Keep the implementation compatible with future migration to DeepGEMM once SM120 support becomes available.

Serving benchmark

random input len: 1024 random output len: 1024 num prompts: 32 max_model_len=8192 gpu_memory_utilization=0.9

TP=2, PP=1

max concurrencyduration (h)Throughput (tok/s)Output throughput (tok/s)
10.1736104.8952.44
40.0548332.01166.00
80.0329553.16276.58
160.0227800.37400.19
320.01691076.35538.17

Changed files

  • CMakeLists.txt (modified, +5/-2)
  • cmake/external_projects/deepgemm.cmake (modified, +6/-1)
  • cmake/external_projects/flashmla.cmake (modified, +1/-1)
  • csrc/cpu/pos_encoding.cpp (modified, +6/-1)
  • csrc/cpu/torch_bindings.cpp (modified, +2/-1)
  • csrc/fused_deepseek_v4_qnorm_rope_kv_insert_kernel.cu (added, +477/-0)
  • csrc/layernorm_kernels.cu (modified, +15/-7)
  • csrc/layernorm_quant_kernels.cu (modified, +25/-10)
  • csrc/moe/moe_ops.h (modified, +9/-0)
  • csrc/moe/topk_softplus_sqrt_kernels.cu (added, +715/-0)
  • csrc/moe/torch_bindings.cpp (modified, +8/-0)
  • csrc/ops.h (modified, +7/-1)
  • csrc/persistent_topk.cuh (modified, +17/-16)
  • csrc/pos_encoding_kernels.cu (modified, +39/-33)
  • csrc/sampler.cu (modified, +7/-1)
  • csrc/topk.cu (modified, +59/-35)
  • csrc/torch_bindings.cpp (modified, +15/-1)
  • docs/design/attention_backends.md (modified, +1/-1)
  • docs/models/supported_models.md (modified, +4/-3)
  • requirements/cuda.txt (modified, +2/-0)
  • tests/compile/fusions_e2e/conftest.py (modified, +5/-0)
  • tests/kernels/attention/test_deepgemm_attention.py (modified, +22/-12)
  • tests/kernels/core/test_fused_q_kv_rmsnorm.py (added, +81/-0)
  • tests/kernels/moe/test_deepgemm.py (modified, +195/-1)
  • tests/kernels/moe/test_topk_softplus_sqrt.py (added, +186/-0)
  • tests/kernels/test_compressor_kv_cache.py (added, +311/-0)
  • tests/kernels/test_fused_deepseek_v4_qnorm_rope_kv_insert.py (added, +359/-0)
  • tests/kernels/test_fused_indexer_q_rope_quant.py (added, +98/-0)
  • tests/kernels/test_fused_inv_rope_fp8_quant.py (added, +908/-0)
  • tests/kernels/test_top_k_per_row.py (modified, +5/-10)
  • tests/model_executor/test_routed_experts_capture.py (modified, +3/-1)
  • tests/models/registry.py (modified, +9/-0)
  • tests/models/test_deepseek_v4_mega_moe.py (added, +184/-0)
  • tests/reasoning/test_deepseekv3_reasoning_parser.py (modified, +7/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_1.json (added, +81/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_2.json (added, +24/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_3.json (added, +159/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_input_4.json (added, +28/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_1.txt (added, +36/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_2.txt (added, +1/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_3.txt (added, +38/-0)
  • tests/tokenizers_/fixtures/deepseek_v4/test_output_4.txt (added, +29/-0)
  • tests/tokenizers_/test_deepseek_v4.py (added, +224/-0)
  • tests/tool_parsers/test_deepseekv4_tool_parser.py (added, +123/-0)
  • tests/v1/attention/test_indexer_deepseek_v4_slot_mapping.py (added, +92/-0)
  • tests/v1/core/test_kv_cache_utils.py (modified, +3/-2)
  • tests/v1/core/test_prefix_caching.py (modified, +19/-20)
  • tests/v1/core/test_scheduler.py (modified, +2/-0)
  • tests/v1/kv_connector/unit/test_mooncake_connector.py (modified, +27/-23)
  • tests/v1/kv_connector/unit/test_mooncake_connector_hma.py (added, +410/-0)
  • tests/v1/streaming_input/test_scheduler_streaming.py (modified, +1/-0)
  • tools/install_deepgemm.sh (modified, +1/-1)
  • vllm/_custom_ops.py (modified, +41/-3)
  • vllm/config/attention.py (modified, +3/-0)
  • vllm/config/cache.py (modified, +14/-0)
  • vllm/config/compilation.py (modified, +1/-0)
  • vllm/config/kernel.py (modified, +3/-1)
  • vllm/config/model.py (modified, +5/-1)
  • vllm/config/speculative.py (modified, +11/-1)
  • vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py (modified, +143/-40)
  • vllm/entrypoints/chat_utils.py (modified, +9/-0)
  • vllm/envs.py (modified, +6/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/deep_gemm.py (modified, +2/-0)
  • vllm/model_executor/layers/attention/mla_attention.py (modified, +16/-0)
  • vllm/model_executor/layers/deepseek_compressor.py (added, +438/-0)
  • vllm/model_executor/layers/deepseek_v4_attention.py (added, +1139/-0)
  • vllm/model_executor/layers/deepseek_v4_triton_kernels.py (added, +1035/-0)
  • vllm/model_executor/layers/fused_moe/config.py (modified, +45/-1)
  • vllm/model_executor/layers/fused_moe/experts/deep_gemm_moe.py (modified, +234/-1)
  • vllm/model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py (modified, +193/-2)
  • vllm/model_executor/layers/fused_moe/experts/trtllm_mxfp4_moe.py (modified, +84/-60)
  • vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +19/-5)
  • vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +7/-0)
  • vllm/model_executor/layers/fused_moe/oracle/mxfp4.py (modified, +409/-10)
  • vllm/model_executor/layers/fused_moe/router/base_router.py (modified, +5/-1)
  • vllm/model_executor/layers/fused_moe/router/custom_routing_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/fused_moe_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py (modified, +84/-16)
  • vllm/model_executor/layers/fused_moe/router/fused_topk_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/grouped_topk_router.py (modified, +3/-0)
  • vllm/model_executor/layers/fused_moe/router/router_factory.py (modified, +9/-2)
  • vllm/model_executor/layers/fused_moe/router/routing_simulator_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/router/zero_expert_router.py (modified, +2/-0)
  • vllm/model_executor/layers/fused_moe/runner/moe_runner.py (modified, +13/-0)
  • vllm/model_executor/layers/fused_moe/runner/moe_runner_interface.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/utils.py (modified, +18/-0)
  • vllm/model_executor/layers/mhc.py (added, +463/-0)
  • vllm/model_executor/layers/quantization/__init__.py (modified, +3/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_nvfp4.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a8_int8.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_fp8.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w8a8_mxfp8.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16_marlin.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/fp8.py (modified, +3/-0)
  • vllm/model_executor/layers/quantization/modelopt.py (modified, +3/-0)
  • vllm/model_executor/layers/quantization/mxfp4.py (modified, +333/-0)
  • vllm/model_executor/layers/quantization/online/moe_base.py (modified, +1/-0)
  • vllm/model_executor/layers/quantization/quark/quark_moe.py (modified, +1/-0)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Support DeepSeek V4 on SM120

🚀 The feature, motivation and pitch

I am trying to run DeepSeek V4 / DeepSeek-V4-Flash on NVIDIA SM120 GPUs. Currently, the DeepSeek V4 path depends on optimized kernels such as DeepGEMM and FlashMLA, but these kernels do not appear to support SM120 yet.

Because of this, DeepSeek V4 cannot run on SM120 even though the GPUs have enough memory and compute capability for the model. It would be very helpful if vLLM could support DeepSeek V4 on SM120, or provide a compatible execution path when DeepGEMM / FlashMLA are unavailable for this architecture.

SM120 GPUs are becoming available in workstation and server environments, so supporting this architecture would make DeepSeek V4 usable on newer NVIDIA hardware.

Alternatives

The current alternatives seem to be:

  • Use GPU architectures already supported by DeepGEMM / FlashMLA.
  • Wait for DeepGEMM / FlashMLA to add SM120 support.
  • Maintain a local patch to bypass unsupported kernel paths.

I am not sure whether SM120 support is currently planned for DeepSeek V4 in vLLM.

Additional context

Environment:

  • GPU: NVIDIA RTX PRO 6000 96GB x 2
  • GPU architecture: SM120
  • Model: DeepSeek V4 / DeepSeek-V4-Flash
  • Backend: vLLM DeepSeek V4 branch
  • Expected behavior: DeepSeek V4 can run on SM120

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Modify the DeepSeek V4 execution path to support SM120 GPUs by either adding support for DeepGEMM and FlashMLA or providing a compatible alternative.

Guidance

  • Investigate the feasibility of adding SM120 support to DeepGEMM and FlashMLA kernels.
  • Explore alternative optimized kernels that already support SM120 architecture.
  • Consider maintaining a local patch to bypass unsupported kernel paths as a temporary workaround.
  • Verify the compute capability and memory requirements of DeepSeek V4 to ensure SM120 GPUs can handle the model.

Example

No specific code snippet is provided due to the lack of technical details, but modifying the kernel support or execution path would likely involve updating the DeepSeek V4 codebase to accommodate SM120 architecture.

Notes

The solution may depend on the specific requirements and constraints of the DeepSeek V4 model and the vLLM backend, which are not fully detailed in the issue.

Recommendation

Apply a workaround, such as maintaining a local patch to bypass unsupported kernel paths, until official support for SM120 is added to DeepGEMM and FlashMLA. This allows for temporary use of DeepSeek V4 on SM120 GPUs while waiting for official support.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING