vllm - 💡(How to fix) Fix [RFC]: support hopper FP8 MegaMoE backend for DeepSeek-V4

Fix Action

Fix / Workaround

Where	What
`vllm/model_executor/models/deepseek_v4.py`	Add `DeepseekV4HopperMegaMoEExperts` (block-FP8 sibling of `DeepseekV4MegaMoEExperts`) + an SM90 input-staging Triton kernel + a custom op.
`DeepseekV4MoE._init_mega_moe_experts` (same file)	Dispatch on `expert_dtype`: `"fp4"` → existing class (sm_100), `"fp8"` → new class (sm_90).
`tests/models/test_deepseek_v4_mega_moe.py`	Extend with Hopper unit tests.
`tests/models/run_hopper_mega_moe_ep8.py` + `scripts/run_megamoe_bench.sh`	Manual EP8 multi-GPU correctness + bench harness.

Code Example

python -m vllm.entrypoints.openai.api_server \
    --model DeepSeek-V4-Flash-Base \
    --enable-expert-parallel \
    --kernel-config '{"moe_backend": "deep_gemm_mega_moe"}'

---

if self._mega_moe_expert_dtype == "fp4":
    self.experts = DeepseekV4MegaMoEExperts(...)              # sm_100, existing
else:  # "fp8"
    self.experts = DeepseekV4HopperMegaMoEExperts(...)        # sm_90,   NEW

---

import deep_gemm
# DeepGEMM PR #323: SM90 weight transform = gate/up gran-8 N interleave
# only; block-(128, 128) float32 SF flows through unchanged.
l1, l2 = deep_gemm.transform_weights_for_mega_moe_sm90(
    (self.w13_weight.data, self.w13_weight_scale_inv.data),
    (self.w2_weight.data,  self.w2_weight_scale_inv.data),
)
self._transformed_l1_weights, self._transformed_l2_weights = l1, l2

---

import deep_gemm
# Stage hidden -> FP8 + per-128-K float SF into the symmetric buffer.
_stage_deepseek_v4_mega_moe_inputs_sm90(
    hidden_states, topk_weights, topk_ids,
    symm_buffer.x[:n], symm_buffer.x_sf[:n],
    symm_buffer.topk_idx[:n], symm_buffer.topk_weights[:n],
)
deep_gemm.fp8_mega_moe(
    y, self._transformed_l1_weights, self._transformed_l2_weights,
    symm_buffer,
    recipe=(128, 128, 128),    # required by the SM90 host-side check
    activation="swiglu",
    activation_clamp=activation_clamp,
    fast_math=fast_math,
)

---

pytest tests/models/test_deepseek_v4_mega_moe.py -v

---

bash scripts/run_megamoe_bench.sh small full

Motivation.

DeepSeek-V4's fused MoE expert kernel ("MegaMoE") is wired into vLLM for Blackwell (sm_100) as DeepseekV4MegaMoEExperts, calling deep_gemm.fp8_fp4_mega_moe. On Hopper (sm_90 — H20 / H100 / H800) that path is unusable because Hopper has no FP4 tensor cores.

DeepGEMM PR #323has just landed an SM90 sibling, fp8_mega_moe, that fuses the same five steps and accepts block-(128, 128) MN-major float32 weight SF tensors directly

This RFC proposes wiring deep_gemm.fp8_mega_moe into vLLM as a sibling of the existing FP4 MegaMoE class so DeepSeek-V4 served on Hopper (especially H20) gets the fused kernel benefit Blackwell already enjoys.

Proposed Change.

Adaptation points

One file changed

Where	What
`vllm/model_executor/models/deepseek_v4.py`	Add `DeepseekV4HopperMegaMoEExperts` (block-FP8 sibling of `DeepseekV4MegaMoEExperts`) + an SM90 input-staging Triton kernel + a custom op.
`DeepseekV4MoE._init_mega_moe_experts` (same file)	Dispatch on `expert_dtype`: `"fp4"` → existing class (sm_100), `"fp8"` → new class (sm_90).
`tests/models/test_deepseek_v4_mega_moe.py`	Extend with Hopper unit tests.
`tests/models/run_hopper_mega_moe_ep8.py` + `scripts/run_megamoe_bench.sh`	Manual EP8 multi-GPU correctness + bench harness.

python -m vllm.entrypoints.openai.api_server \
    --model DeepSeek-V4-Flash-Base \
    --enable-expert-parallel \
    --kernel-config '{"moe_backend": "deep_gemm_mega_moe"}'

Code sketch

DeepseekV4MoE._init_mega_moe_experts becomes (existing if/else shown for context, the fp8 branch is new):

if self._mega_moe_expert_dtype == "fp4":
    self.experts = DeepseekV4MegaMoEExperts(...)              # sm_100, existing
else:  # "fp8"
    self.experts = DeepseekV4HopperMegaMoEExperts(...)        # sm_90,   NEW

DeepseekV4HopperMegaMoEExperts.finalize_weights (one-shot weight transform; idempotent):

import deep_gemm
# DeepGEMM PR #323: SM90 weight transform = gate/up gran-8 N interleave
# only; block-(128, 128) float32 SF flows through unchanged.
l1, l2 = deep_gemm.transform_weights_for_mega_moe_sm90(
    (self.w13_weight.data, self.w13_weight_scale_inv.data),
    (self.w2_weight.data,  self.w2_weight_scale_inv.data),
)
self._transformed_l1_weights, self._transformed_l2_weights = l1, l2

DeepseekV4HopperMegaMoEExperts._run_hopper_mega_moe (one fused launch per layer per forward):

import deep_gemm
# Stage hidden -> FP8 + per-128-K float SF into the symmetric buffer.
_stage_deepseek_v4_mega_moe_inputs_sm90(
    hidden_states, topk_weights, topk_ids,
    symm_buffer.x[:n], symm_buffer.x_sf[:n],
    symm_buffer.topk_idx[:n], symm_buffer.topk_weights[:n],
)
deep_gemm.fp8_mega_moe(
    y, self._transformed_l1_weights, self._transformed_l2_weights,
    symm_buffer,
    recipe=(128, 128, 128),    # required by the SM90 host-side check
    activation="swiglu",
    activation_clamp=activation_clamp,
    fast_math=fast_math,
)

Hard guards

The new class fails loudly (NotImplementedError / ValueError) when:

device is not sm_90;
--enable-expert-parallel is missing;
scoring_func != "sqrtsoftplus";
hidden % 128 != 0 or intermediate % 128 != 0;
SF dtype != float32 or shape != (E, M//128, K//128).

This RFC depends on upstream DeepGEMM PR #323 (HEAD 192a513a), which adds the SM90 fp8_mega_moe kernel, transform_weights_for_mega_moe_sm90, and get_symm_buffer_for_mega_moe.

The corresponding vLLM implementation is already written and locally verified (10 unit tests passed, 1 skipped-by-design; full EP8 correctness + bench harness in place). However, no further development on the vLLM side will happen until DeepGEMM #323 is merged:

the implementation PR will not be opened (or will stay in Draft) while the upstream API can still change;
no request for review will be made of vLLM maintainers until 323 lands;

Test Plan

Unit (CI-friendly) — tests/models/test_deepseek_v4_mega_moe.py
```
  pytest tests/models/test_deepseek_v4_mega_moe.py -v
```
vLLM EP benchmark (8×H20) —
```
bash scripts/run_megamoe_bench.sh small full
```
Spawns 8 ranks via torch.multiprocessing.spawn, then for each shape compares (a) direct DeepGEMM call vs (b)DeepEP-LL + BatchedDeepGemmExperts pipeline for a latency baseline. Numbers will be pasted into the implementation PR description.

Feedback Period.

two weeks

CC List.

@WoosukKwon

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: support hopper FP8 MegaMoE backend for DeepSeek-V4

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Proposed Change.

Adaptation points

Code sketch

Hard guards

Test Plan

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: support hopper FP8 MegaMoE backend for DeepSeek-V4

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Proposed Change.

Adaptation points

Code sketch

Hard guards

Test Plan

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING