vllm - 💡(How to fix) Fix [Bug][DSV4][NVFP4] `deep_gemm_mega_moe` does not dispatch NVFP4 expert layout — `KeyError: 'layers.0.ffn.experts.w13_input_scale'`

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

ERROR [multiproc_executor.py:870] Traceback (most recent call last): ERROR [multiproc_executor.py:870] KeyError: 'layers.0.ffn.experts.w13_input_scale'

Root Cause

Root cause (inferred from reading the loader)

Fix Action

Fix / Workaround

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)
torch 2.11.0+cu130, CUDA 13.0
  1. Add NVFP4 expert layout dispatch to deep_gemm_mega_moe so the mega-kernel backend can handle NVFP4 weights as well as native MXFP4/FP8.
  2. Or document explicitly that NVFP4 artifacts must use --moe-backend flashinfer_trtllm (the recipe YAML doesn't mention this; users will hit the KeyError and not know why).

Cc'ing @sychen52 @xinli-sw @pavanimajety @zyongye since this is on the PR #42209 dispatch path.

Code Example

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)
torch 2.11.0+cu130, CUDA 13.0

---

ERROR [multiproc_executor.py:870] Traceback (most recent call last):
ERROR [multiproc_executor.py:870] KeyError: 'layers.0.ffn.experts.w13_input_scale'

---

vllm serve canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP \
  --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --moe-backend deep_gemm_mega_moe \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)
torch 2.11.0+cu130, CUDA 13.0

How would you like to use vllm

Serve an NVFP4-quantized DeepSeek-V4-Pro artifact on --moe-backend deep_gemm_mega_moe. The artifact uses standard NVIDIA ModelOpt NVFP4 layout (group=16, FP8 E4M3 block scales, FP32 per-tensor weight_scale_2, per-expert input_scale).

Symptom

Worker load fails with:

ERROR [multiproc_executor.py:870] Traceback (most recent call last):
ERROR [multiproc_executor.py:870] KeyError: 'layers.0.ffn.experts.w13_input_scale'

Repro: serve any NVFP4 ModelOpt-layout DSV4 artifact with --moe-backend deep_gemm_mega_moe (e.g. the publicly available canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP):

vllm serve canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP \
  --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --moe-backend deep_gemm_mega_moe \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'

Root cause (inferred from reading the loader)

The deep_gemm_mega_moe parameter-registration path expects fused-name MoE parameters — one tensor for all experts at the layer level, named like experts.w13_input_scale. PR #42209's ModelOptNvFp4FusedMoE registers per-expert names like experts.{E}.w1.input_scale (one per expert per gate/up/down projection). The two layouts are not interchangeable through the same key-lookup at load.

When the loader transforms an on-disk per-expert key into the fused name the deep_gemm path looks up, no match exists in the params dict and params_dict[name] raises.

What works

--moe-backend flashinfer_trtllm (the path PR #42209 wired up) loads and serves the same NVFP4 artifact cleanly. We measure 75.3 tok/s c=1 single-stream + MTP n=2 and 572.8 tok/s c=16 batched aggregate on 8× B300 with that backend.

What's blocked

Users who want to serve an NVFP4 V4-Pro artifact on the upstream-recipe-default deep_gemm_mega_moe mega-kernel backend (which the vllm-project/recipes/.../DeepSeek-V4-Pro.yaml recommends as the Blackwell default for the native MXFP4 source) cannot route NVFP4 through that path. They have to switch backends — which currently means accepting flashinfer_trtllm's slightly different perf profile (we measured flashinfer_trtllm ~4% slower than deep_gemm_mega_moe on the native MXFP4 source; the comparison flips on NVFP4).

Asks

  1. Add NVFP4 expert layout dispatch to deep_gemm_mega_moe so the mega-kernel backend can handle NVFP4 weights as well as native MXFP4/FP8.
  2. Or document explicitly that NVFP4 artifacts must use --moe-backend flashinfer_trtllm (the recipe YAML doesn't mention this; users will hit the KeyError and not know why).

Additional context

Full backend × format matrix and repro evidence at:

Cc'ing @sychen52 @xinli-sw @pavanimajety @zyongye since this is on the PR #42209 dispatch path.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug][DSV4][NVFP4] `deep_gemm_mega_moe` does not dispatch NVFP4 expert layout — `KeyError: 'layers.0.ffn.experts.w13_input_scale'`