vllm - 💡(How to fix) Fix [Bug][DSV4][NVFP4] `deep_gemm_mega_moe` does not dispatch NVFP4 expert layout — `KeyError: 'layers.0.ffn.experts.w13_input_scale'`

vllm2026-05-22 23:55:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

ERROR [multiproc_executor.py:870] Traceback (most recent call last): ERROR [multiproc_executor.py:870] KeyError: 'layers.0.ffn.experts.w13_input_scale'

Root Cause

Root cause (inferred from reading the loader)

Fix Action

Fix / Workaround

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)
torch 2.11.0+cu130, CUDA 13.0

Add NVFP4 expert layout dispatch to deep_gemm_mega_moe so the mega-kernel backend can handle NVFP4 weights as well as native MXFP4/FP8.
Or document explicitly that NVFP4 artifacts must use --moe-backend flashinfer_trtllm (the recipe YAML doesn't mention this; users will hit the KeyError and not know why).

Cc'ing @sychen52 @xinli-sw @pavanimajety @zyongye since this is on the PR #42209 dispatch path.

Code Example

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)
torch 2.11.0+cu130, CUDA 13.0

---

ERROR [multiproc_executor.py:870] Traceback (most recent call last):
ERROR [multiproc_executor.py:870] KeyError: 'layers.0.ffn.experts.w13_input_scale'

---

vllm serve canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP \
  --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --moe-backend deep_gemm_mega_moe \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)
torch 2.11.0+cu130, CUDA 13.0

How would you like to use vllm

Serve an NVFP4-quantized DeepSeek-V4-Pro artifact on --moe-backend deep_gemm_mega_moe. The artifact uses standard NVIDIA ModelOpt NVFP4 layout (group=16, FP8 E4M3 block scales, FP32 per-tensor weight_scale_2, per-expert input_scale).

Symptom

Worker load fails with:

ERROR [multiproc_executor.py:870] Traceback (most recent call last):
ERROR [multiproc_executor.py:870] KeyError: 'layers.0.ffn.experts.w13_input_scale'

Repro: serve any NVFP4 ModelOpt-layout DSV4 artifact with --moe-backend deep_gemm_mega_moe (e.g. the publicly available canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP):

vllm serve canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP \
  --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --moe-backend deep_gemm_mega_moe \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'

Root cause (inferred from reading the loader)

The deep_gemm_mega_moe parameter-registration path expects fused-name MoE parameters — one tensor for all experts at the layer level, named like experts.w13_input_scale. PR #42209's ModelOptNvFp4FusedMoE registers per-expert names like experts.{E}.w1.input_scale (one per expert per gate/up/down projection). The two layouts are not interchangeable through the same key-lookup at load.

When the loader transforms an on-disk per-expert key into the fused name the deep_gemm path looks up, no match exists in the params dict and params_dict[name] raises.

What works

--moe-backend flashinfer_trtllm (the path PR #42209 wired up) loads and serves the same NVFP4 artifact cleanly. We measure 75.3 tok/s c=1 single-stream + MTP n=2 and 572.8 tok/s c=16 batched aggregate on 8× B300 with that backend.

What's blocked

Users who want to serve an NVFP4 V4-Pro artifact on the upstream-recipe-default deep_gemm_mega_moe mega-kernel backend (which the vllm-project/recipes/.../DeepSeek-V4-Pro.yaml recommends as the Blackwell default for the native MXFP4 source) cannot route NVFP4 through that path. They have to switch backends — which currently means accepting flashinfer_trtllm's slightly different perf profile (we measured flashinfer_trtllm ~4% slower than deep_gemm_mega_moe on the native MXFP4 source; the comparison flips on NVFP4).

Asks

Add NVFP4 expert layout dispatch to deep_gemm_mega_moe so the mega-kernel backend can handle NVFP4 weights as well as native MXFP4/FP8.
Or document explicitly that NVFP4 artifacts must use --moe-backend flashinfer_trtllm (the recipe YAML doesn't mention this; users will hit the KeyError and not know why).

Additional context

Full backend × format matrix and repro evidence at:

Source repo: https://github.com/canada-quant/dsv4-pro-nvfp4-fp8-mtp
Findings doc: https://github.com/canada-quant/dsv4-pro-nvfp4-fp8-mtp/blob/main/docs/findings/backend_format_matrix.md
HF artifact: https://huggingface.co/canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP

Cc'ing @sychen52 @xinli-sw @pavanimajety @zyongye since this is on the PR #42209 dispatch path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug][DSV4][NVFP4] `deep_gemm_mega_moe` does not dispatch NVFP4 expert layout — `KeyError: 'layers.0.ffn.experts.w13_input_scale'`

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (inferred from reading the loader)

Fix Action

Fix / Workaround

Code Example

Your current environment

How would you like to use vllm

Symptom

Root cause (inferred from reading the loader)

What works

What's blocked

Asks

Additional context

Still need to ship something?

TRENDING