vllm - 💡(How to fix) Fix [RFC]: support hopper FP8 MegaMoE backend for DeepSeek-V4

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

DeepSeek-V4's fused MoE expert kernel ("MegaMoE") is wired into vLLM for Blackwell (sm_100) as DeepseekV4MegaMoEExperts, calling deep_gemm.fp8_fp4_mega_moe. On Hopper (sm_90 — H20 / H100 / H800) that path is unusable because Hopper has no FP4 tensor cores.

Fix Action

Fix / Workaround

WhereWhat
vllm/model_executor/models/deepseek_v4.pyAdd DeepseekV4HopperMegaMoEExperts (block-FP8 sibling of DeepseekV4MegaMoEExperts) + an SM90 input-staging Triton kernel + a custom op.
DeepseekV4MoE._init_mega_moe_experts (same file)Dispatch on expert_dtype: "fp4" → existing class (sm_100), "fp8" → new class (sm_90).
tests/models/test_deepseek_v4_mega_moe.pyExtend with Hopper unit tests.
tests/models/run_hopper_mega_moe_ep8.py + scripts/run_megamoe_bench.shManual EP8 multi-GPU correctness + bench harness.

Code Example

python -m vllm.entrypoints.openai.api_server \
    --model DeepSeek-V4-Flash-Base \
    --enable-expert-parallel \
    --kernel-config '{"moe_backend": "deep_gemm_mega_moe"}'

---

if self._mega_moe_expert_dtype == "fp4":
    self.experts = DeepseekV4MegaMoEExperts(...)              # sm_100, existing
else:  # "fp8"
    self.experts = DeepseekV4HopperMegaMoEExperts(...)        # sm_90,   NEW

---

import deep_gemm
# DeepGEMM PR #323: SM90 weight transform = gate/up gran-8 N interleave
# only; block-(128, 128) float32 SF flows through unchanged.
l1, l2 = deep_gemm.transform_weights_for_mega_moe_sm90(
    (self.w13_weight.data, self.w13_weight_scale_inv.data),
    (self.w2_weight.data,  self.w2_weight_scale_inv.data),
)
self._transformed_l1_weights, self._transformed_l2_weights = l1, l2

---

import deep_gemm
# Stage hidden -> FP8 + per-128-K float SF into the symmetric buffer.
_stage_deepseek_v4_mega_moe_inputs_sm90(
    hidden_states, topk_weights, topk_ids,
    symm_buffer.x[:n], symm_buffer.x_sf[:n],
    symm_buffer.topk_idx[:n], symm_buffer.topk_weights[:n],
)
deep_gemm.fp8_mega_moe(
    y, self._transformed_l1_weights, self._transformed_l2_weights,
    symm_buffer,
    recipe=(128, 128, 128),    # required by the SM90 host-side check
    activation="swiglu",
    activation_clamp=activation_clamp,
    fast_math=fast_math,
)

---

pytest tests/models/test_deepseek_v4_mega_moe.py -v

---

bash scripts/run_megamoe_bench.sh small full
RAW_BUFFERClick to expand / collapse

Motivation.

DeepSeek-V4's fused MoE expert kernel ("MegaMoE") is wired into vLLM for Blackwell (sm_100) as DeepseekV4MegaMoEExperts, calling deep_gemm.fp8_fp4_mega_moe. On Hopper (sm_90 — H20 / H100 / H800) that path is unusable because Hopper has no FP4 tensor cores.

DeepGEMM PR #323has just landed an SM90 sibling, fp8_mega_moe, that fuses the same five steps and accepts block-(128, 128) MN-major float32 weight SF tensors directly

This RFC proposes wiring deep_gemm.fp8_mega_moe into vLLM as a sibling of the existing FP4 MegaMoE class so DeepSeek-V4 served on Hopper (especially H20) gets the fused kernel benefit Blackwell already enjoys.

Proposed Change.

Adaptation points

One file changed

WhereWhat
vllm/model_executor/models/deepseek_v4.pyAdd DeepseekV4HopperMegaMoEExperts (block-FP8 sibling of DeepseekV4MegaMoEExperts) + an SM90 input-staging Triton kernel + a custom op.
DeepseekV4MoE._init_mega_moe_experts (same file)Dispatch on expert_dtype: "fp4" → existing class (sm_100), "fp8" → new class (sm_90).
tests/models/test_deepseek_v4_mega_moe.pyExtend with Hopper unit tests.
tests/models/run_hopper_mega_moe_ep8.py + scripts/run_megamoe_bench.shManual EP8 multi-GPU correctness + bench harness.
python -m vllm.entrypoints.openai.api_server \
    --model DeepSeek-V4-Flash-Base \
    --enable-expert-parallel \
    --kernel-config '{"moe_backend": "deep_gemm_mega_moe"}'

Code sketch

DeepseekV4MoE._init_mega_moe_experts becomes (existing if/else shown for context, the fp8 branch is new):

if self._mega_moe_expert_dtype == "fp4":
    self.experts = DeepseekV4MegaMoEExperts(...)              # sm_100, existing
else:  # "fp8"
    self.experts = DeepseekV4HopperMegaMoEExperts(...)        # sm_90,   NEW

DeepseekV4HopperMegaMoEExperts.finalize_weights (one-shot weight transform; idempotent):

import deep_gemm
# DeepGEMM PR #323: SM90 weight transform = gate/up gran-8 N interleave
# only; block-(128, 128) float32 SF flows through unchanged.
l1, l2 = deep_gemm.transform_weights_for_mega_moe_sm90(
    (self.w13_weight.data, self.w13_weight_scale_inv.data),
    (self.w2_weight.data,  self.w2_weight_scale_inv.data),
)
self._transformed_l1_weights, self._transformed_l2_weights = l1, l2

DeepseekV4HopperMegaMoEExperts._run_hopper_mega_moe (one fused launch per layer per forward):

import deep_gemm
# Stage hidden -> FP8 + per-128-K float SF into the symmetric buffer.
_stage_deepseek_v4_mega_moe_inputs_sm90(
    hidden_states, topk_weights, topk_ids,
    symm_buffer.x[:n], symm_buffer.x_sf[:n],
    symm_buffer.topk_idx[:n], symm_buffer.topk_weights[:n],
)
deep_gemm.fp8_mega_moe(
    y, self._transformed_l1_weights, self._transformed_l2_weights,
    symm_buffer,
    recipe=(128, 128, 128),    # required by the SM90 host-side check
    activation="swiglu",
    activation_clamp=activation_clamp,
    fast_math=fast_math,
)

Hard guards

The new class fails loudly (NotImplementedError / ValueError) when:

  • device is not sm_90;
  • --enable-expert-parallel is missing;
  • scoring_func != "sqrtsoftplus";
  • hidden % 128 != 0 or intermediate % 128 != 0;
  • SF dtype != float32 or shape != (E, M//128, K//128).

This RFC depends on upstream DeepGEMM PR #323 (HEAD 192a513a), which adds the SM90 fp8_mega_moe kernel, transform_weights_for_mega_moe_sm90, and get_symm_buffer_for_mega_moe.

The corresponding vLLM implementation is already written and locally verified (10 unit tests passed, 1 skipped-by-design; full EP8 correctness + bench harness in place). However, no further development on the vLLM side will happen until DeepGEMM #323 is merged:

  • the implementation PR will not be opened (or will stay in Draft) while the upstream API can still change;
  • no request for review will be made of vLLM maintainers until 323 lands;

Test Plan

  • Unit (CI-friendly)tests/models/test_deepseek_v4_mega_moe.py

      pytest tests/models/test_deepseek_v4_mega_moe.py -v
  • vLLM EP benchmark (8×H20)

    bash scripts/run_megamoe_bench.sh small full

    Spawns 8 ranks via torch.multiprocessing.spawn, then for each shape compares (a) direct DeepGEMM call vs (b)DeepEP-LL + BatchedDeepGemmExperts pipeline for a latency baseline. Numbers will be pasted into the implementation PR description.

<img width="863" height="157" alt="Image" src="https://github.com/user-attachments/assets/dbf0937b-8be7-413a-a75e-31a80a50bf06" />

Feedback Period.

two weeks

CC List.

@WoosukKwon

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: support hopper FP8 MegaMoE backend for DeepSeek-V4