vllm - 💡(How to fix) Fix [Feature]: DeepSeek V4 w4a4 MegaMoE support

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

DeepSeek V4 uses a MoE architecture where expert weights are already quantized to FP4 (w4), but activations in the MegaMoE pre-dispatch stage are still packed into the symmetric buffer as FP8 (E4M3) — one byte per activation value.

The goal of this issue is to support packing activations as FP4 (E2M1) in the MegaMoE pre-dispatch stage, turning w4a8 into true w4a4.

DeepGEMM already has a CUDA implementation of mega_moe_pre_dispatch() (sm100_mega_moe_pre_dispatch.cuh) that natively supports E2M1 packing. It does the same job as the Triton kernel — quantize and pack activations, route metadata — but outputs FP4 with the correct scale factor layout.

RAW_BUFFERClick to expand / collapse

Problem

DeepSeek V4 uses a MoE architecture where expert weights are already quantized to FP4 (w4), but activations in the MegaMoE pre-dispatch stage are still packed into the symmetric buffer as FP8 (E4M3) — one byte per activation value.

On Blackwell (SM100), DeepGEMM already supports FP4 activations. The mismatch is wasteful in two ways:

  1. Buffer footprint: FP8 packing uses 2× the memory that FP4 would require.
  2. Mainloop efficiency: FP8 activations force the kind::mxf8f6f4 mainloop (K=32 with padding). FP4 activations unlock the kind::mxf4 mainloop (K=64 dense), doubling compute density.

The goal of this issue is to support packing activations as FP4 (E2M1) in the MegaMoE pre-dispatch stage, turning w4a8 into true w4a4.


Root Cause Analysis

Current data flow

In _run_mega_moe(), the current path is:

  1. A Triton kernel (_stage_deepseek_v4_mega_moe_inputs) quantizes hidden_states to FP8 and packs them into the x slot of the symmetric buffer, along with topk_idx and topk_weights.
  2. fp8_fp4_mega_moe runs the actual MoE computation.

The Triton kernel only outputs FP8 — this is a hard constraint. Changing the kernel to output FP4 E2M1 is non-trivial: the scale factor layout for E2M1 is fundamentally different from FP8, not a simple dtype cast.

FP4 packing already exists in DeepGEMM

DeepGEMM already has a CUDA implementation of mega_moe_pre_dispatch() (sm100_mega_moe_pre_dispatch.cuh) that natively supports E2M1 packing. It does the same job as the Triton kernel — quantize and pack activations, route metadata — but outputs FP4 with the correct scale factor layout.

The natural fix: in FP4 mode, replace the Triton kernel call with deep_gemm.mega_moe_pre_dispatch().

Upstream DeepGEMM constraint

The vendored DeepGEMM in vllm (commit 891d57b, from deepseek-ai/DeepGEMM) is a pybind11 C extension. After checking the relevant files:

  • csrc/apis/mega.hpp: only registers get_symm_buffer_size_for_mega_moe and fp8_fp4_mega_moe — no mega_moe_pre_dispatch.
  • csrc/python_api.cpp: the pybind11 module does not expose this API.
  • get_symm_buffer_size_for_mega_moe hardcodes FP8 buffer layout (hidden_size bytes/token), does not support FP4 (hidden_size / 2).

Upstream DeepGEMM has not yet exposed mega_moe_pre_dispatch to Python. The vendored version inherits this limitation.

sgl-deep-gemm (SGLang's fork, v0.1.0) has a complete implementation of mega_moe_pre_dispatch including FP4 E2M1 packing and FP4-aware buffer sizing. It uses tvm_ffi instead of pybind11.


Proposed Solution

Three options were considered:

OptionProsCons
Wait for upstream to mergeNo external depUncontrolled timeline
Implement in vendored C++No external depLarge C++ diff, high maintenance cost
Use external package as transitional bridgePython-only, minimal diff, designed for rollbackRequires sgl-deep-gemm

Option 3 is chosen. Changes are confined to the Python layer (import branching + call adaptation). No C++ changes, no vendored package modification. Once upstream DeepGEMM merges mega_moe_pre_dispatch, vllm bumps the vendored commit and removes the import branches — env var interface, dispatch logic, and env forwarding are all reused with no rewrites.

Implementation details

Import branching: When VLLM_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1, import external deep_gemm (sgl-deep-gemm). Otherwise, use vllm.third_party.deep_gemm as before. Three branch sites in total.

Buffer allocation: External deep_gemm.get_symm_buffer_for_mega_moe() reads DG_USE_FP4_ACTS to allocate FP4-layout buffer (hidden_size / 2 bytes/token). Forwarded via os.environ.setdefault — explicit DG_USE_* overrides still win.

topk_ids dtype: vllm uses int64 for hash-based routing. mega_moe_pre_dispatch requires int32 for the input topk_idx. Cast before calling.

Buffer tensor type: External deep_gemm returns tvm_ffi.core.Tensor fields which do not support Python slice syntax ([:num_tokens]). Pass the full buffer fields and use the num_tokens parameter to control the valid range. This approach also works with torch.Tensor after upstream merges — no change needed at rollback time.

Env var design: Two new vllm-namespaced env vars (VLLM_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS, VLLM_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND), both default off. Forwarded to DG_USE_* via os.environ.setdefault with a module-level once-only guard.

Zero-impact default path

Both env vars default to False. All FP4 logic is gated behind if branches. Without setting these vars, the code path is identical to the pre-change state — same imports, same buffer allocation, same Triton kernel.


Rollback / Migration Path

Once upstream DeepGEMM merges mega_moe_pre_dispatch and FP4 buffer sizing:

  1. Bump vendored commit in cmake/external_projects/deepgemm.cmake
  2. Remove 3 import deep_gemm branches — all paths use vllm.third_party.deep_gemm
  3. Drop sgl-deep-gemm dependency
  4. Promote env vars from experimental to stable, remove [Experimental] tag

Estimated diff at migration time: ~15 lines deleted, 0 lines changed in dispatch/env logic.


Hardware Requirements

  • Blackwell (SM100) or newer
  • DeepSeek V4 series (V4-Flash, V4-Pro, etc.) with expert_dtype=fp4
  • moe_backend=deep_gemm_mega_moe
  • pip install sgl-deep-gemm==0.1.0

Not applicable: non-Blackwell hardware, non-MegaMoE mode, or bit-exact precision requirements.


Accuracy Validation

GSM8K (1319 samples, greedy decoding, max_tokens=2048, concurrency=32, 8× B200):

Modelw4a8w4a4Delta
V4-Flash (256 experts, hidden=4096, 43L, DP=4)93.86% (1238/1319)94.39% (1245/1319)+0.53%
V4-Pro (384 experts, hidden=7168, 61L, TP=8)91.81% (1211/1319)91.13% (1202/1319)-0.68%

Both deltas are within noise. No systematic accuracy degradation from FP4 activation packing.

Note: the +0.53% on V4-Flash does not indicate w4a4 is "better" — greedy decoding accumulates minor numerical differences across quantization paths that can flip borderline cases either way. More benchmarks and multi-sample evaluation would be needed to draw stronger conclusions.


Files Changed

  • vllm/envs.py: env var declarations and lambda definitions
  • vllm/model_executor/models/deepseek_v4.py: imports, env-forwarding helper, conditional import in get_symm_buffer(), FP4 dispatch branch in _run_mega_moe()

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: DeepSeek V4 w4a4 MegaMoE support