vllm - 💡(How to fix) Fix [Feature]: DeepSeek V4 w4a4 MegaMoE support

Fix Action

Fix / Workaround

DeepSeek V4 uses a MoE architecture where expert weights are already quantized to FP4 (w4), but activations in the MegaMoE pre-dispatch stage are still packed into the symmetric buffer as FP8 (E4M3) — one byte per activation value.

The goal of this issue is to support packing activations as FP4 (E2M1) in the MegaMoE pre-dispatch stage, turning w4a8 into true w4a4.

DeepGEMM already has a CUDA implementation of mega_moe_pre_dispatch() (sm100_mega_moe_pre_dispatch.cuh) that natively supports E2M1 packing. It does the same job as the Triton kernel — quantize and pack activations, route metadata — but outputs FP4 with the correct scale factor layout.

Problem

On Blackwell (SM100), DeepGEMM already supports FP4 activations. The mismatch is wasteful in two ways:

Buffer footprint: FP8 packing uses 2× the memory that FP4 would require.
Mainloop efficiency: FP8 activations force the kind::mxf8f6f4 mainloop (K=32 with padding). FP4 activations unlock the kind::mxf4 mainloop (K=64 dense), doubling compute density.

The goal of this issue is to support packing activations as FP4 (E2M1) in the MegaMoE pre-dispatch stage, turning w4a8 into true w4a4.

Root Cause Analysis

Current data flow

In _run_mega_moe(), the current path is:

A Triton kernel (_stage_deepseek_v4_mega_moe_inputs) quantizes hidden_states to FP8 and packs them into the x slot of the symmetric buffer, along with topk_idx and topk_weights.
fp8_fp4_mega_moe runs the actual MoE computation.

The Triton kernel only outputs FP8 — this is a hard constraint. Changing the kernel to output FP4 E2M1 is non-trivial: the scale factor layout for E2M1 is fundamentally different from FP8, not a simple dtype cast.

FP4 packing already exists in DeepGEMM

The natural fix: in FP4 mode, replace the Triton kernel call with deep_gemm.mega_moe_pre_dispatch().

Upstream DeepGEMM constraint

The vendored DeepGEMM in vllm (commit 891d57b, from deepseek-ai/DeepGEMM) is a pybind11 C extension. After checking the relevant files:

csrc/apis/mega.hpp: only registers get_symm_buffer_size_for_mega_moe and fp8_fp4_mega_moe — no mega_moe_pre_dispatch.
csrc/python_api.cpp: the pybind11 module does not expose this API.
get_symm_buffer_size_for_mega_moe hardcodes FP8 buffer layout (hidden_size bytes/token), does not support FP4 (hidden_size / 2).

Upstream DeepGEMM has not yet exposed mega_moe_pre_dispatch to Python. The vendored version inherits this limitation.

sgl-deep-gemm (SGLang's fork, v0.1.0) has a complete implementation of mega_moe_pre_dispatch including FP4 E2M1 packing and FP4-aware buffer sizing. It uses tvm_ffi instead of pybind11.

Proposed Solution

Three options were considered:

Option	Pros	Cons
Wait for upstream to merge	No external dep	Uncontrolled timeline
Implement in vendored C++	No external dep	Large C++ diff, high maintenance cost
Use external package as transitional bridge	Python-only, minimal diff, designed for rollback	Requires `sgl-deep-gemm`

Option 3 is chosen. Changes are confined to the Python layer (import branching + call adaptation). No C++ changes, no vendored package modification. Once upstream DeepGEMM merges mega_moe_pre_dispatch, vllm bumps the vendored commit and removes the import branches — env var interface, dispatch logic, and env forwarding are all reused with no rewrites.

Implementation details

Import branching: When VLLM_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS=1, import external deep_gemm (sgl-deep-gemm). Otherwise, use vllm.third_party.deep_gemm as before. Three branch sites in total.

Buffer allocation: External deep_gemm.get_symm_buffer_for_mega_moe() reads DG_USE_FP4_ACTS to allocate FP4-layout buffer (hidden_size / 2 bytes/token). Forwarded via os.environ.setdefault — explicit DG_USE_* overrides still win.

topk_ids dtype: vllm uses int64 for hash-based routing. mega_moe_pre_dispatch requires int32 for the input topk_idx. Cast before calling.

Buffer tensor type: External deep_gemm returns tvm_ffi.core.Tensor fields which do not support Python slice syntax ([:num_tokens]). Pass the full buffer fields and use the num_tokens parameter to control the valid range. This approach also works with torch.Tensor after upstream merges — no change needed at rollback time.

Env var design: Two new vllm-namespaced env vars (VLLM_DEEPGEMM_MEGA_MOE_USE_FP4_ACTS, VLLM_DEEPGEMM_MEGA_MOE_USE_MXF4_KIND), both default off. Forwarded to DG_USE_* via os.environ.setdefault with a module-level once-only guard.

Zero-impact default path

Both env vars default to False. All FP4 logic is gated behind if branches. Without setting these vars, the code path is identical to the pre-change state — same imports, same buffer allocation, same Triton kernel.

Rollback / Migration Path

Once upstream DeepGEMM merges mega_moe_pre_dispatch and FP4 buffer sizing:

Bump vendored commit in cmake/external_projects/deepgemm.cmake
Remove 3 import deep_gemm branches — all paths use vllm.third_party.deep_gemm
Drop sgl-deep-gemm dependency
Promote env vars from experimental to stable, remove [Experimental] tag

Estimated diff at migration time: ~15 lines deleted, 0 lines changed in dispatch/env logic.

Hardware Requirements

Blackwell (SM100) or newer
DeepSeek V4 series (V4-Flash, V4-Pro, etc.) with expert_dtype=fp4
moe_backend=deep_gemm_mega_moe
pip install sgl-deep-gemm==0.1.0

Not applicable: non-Blackwell hardware, non-MegaMoE mode, or bit-exact precision requirements.

Accuracy Validation

GSM8K (1319 samples, greedy decoding, max_tokens=2048, concurrency=32, 8× B200):

Model	w4a8	w4a4	Delta
V4-Flash (256 experts, hidden=4096, 43L, DP=4)	93.86% (1238/1319)	94.39% (1245/1319)	+0.53%
V4-Pro (384 experts, hidden=7168, 61L, TP=8)	91.81% (1211/1319)	91.13% (1202/1319)	-0.68%

Both deltas are within noise. No systematic accuracy degradation from FP4 activation packing.

Note: the +0.53% on V4-Flash does not indicate w4a4 is "better" — greedy decoding accumulates minor numerical differences across quantization paths that can flip borderline cases either way. More benchmarks and multi-sample evaluation would be needed to draw stronger conclusions.

Files Changed

vllm/envs.py: env var declarations and lambda definitions
vllm/model_executor/models/deepseek_v4.py: imports, env-forwarding helper, conditional import in get_symm_buffer(), FP4 dispatch branch in _run_mega_moe()

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: DeepSeek V4 w4a4 MegaMoE support

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

Problem

Root Cause Analysis

Current data flow

FP4 packing already exists in DeepGEMM

Upstream DeepGEMM constraint

Proposed Solution

Implementation details

Zero-impact default path

Rollback / Migration Path

Hardware Requirements

Accuracy Validation

Files Changed

Alternatives

Additional context

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: DeepSeek V4 w4a4 MegaMoE support

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

Problem

Root Cause Analysis

Current data flow

FP4 packing already exists in DeepGEMM

Upstream DeepGEMM constraint

Proposed Solution

Implementation details

Zero-impact default path

Rollback / Migration Path

Hardware Requirements

Accuracy Validation

Files Changed

Alternatives

Additional context

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING