vllm - 💡(How to fix) Fix [Bug]: ModelOpt NVFP4 Qwen3-30B-A3B export fails to load on DGX Spark/GB10 (missing _double_scale key) [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38980Fetched 2026-04-08 02:44:35
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

Root Cause

Additional context

  • We already opened the corresponding TRT-LLM upstream issue for the same exported checkpoint because TRT-LLM also fails to load it, but with a different symptom:
    • NVIDIA/TensorRT-LLM#12762
  • In TRT-LLM, the export load fails with weight_scale size mismatches.
  • In vLLM, the export is recognized as ModelOpt NVFP4, but the loader looks for w2_weight_quantizer._double_scale while the exported HF checkpoint appears to use down_proj/gate_proj/up_proj naming.
  • This suggests the export itself exists and is structurally rich enough, but the Qwen3 MoE mapping path for ModelOpt NVFP4 may be expecting a different internal naming contract.

Code Example

OS                           : Linux (DGX Spark / GB10 host)
Architecture                 : aarch64
Host memory                  : 121 GiB RAM
GPU                          : NVIDIA GB10
Driver                       : 580.142

Alternative validation container:
- Image: nvcr.io/nvidia/vllm:26.02-py3
- vLLM: 0.15.1+nv26.2
- torch: 2.11.0a0+eb65b36914.nv26.2
- transformers: 4.57.5

Export under test:
- Model: Qwen/Qwen3-30B-A3B-Instruct-2507
- Export type: ModelOpt HF NVFP4
- Producer: modelopt 0.37.0
- Quant algo: NVFP4
- KV cache quant algo: FP8

---

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

---

model.layers.28.mlp.experts.0.down_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.gate_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.up_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.1.down_proj.weight_quantizer._double_scale
...

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="/workspace/export",
    quantization="modelopt_fp4",
    trust_remote_code=True,
    max_model_len=256,
    max_num_seqs=1,
    gpu_memory_utilization=0.9,
)

outputs = llm.generate(["Hola"], SamplingParams(max_tokens=8, temperature=0.0))
print(outputs[0].outputs[0].text)

---

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
OS                           : Linux (DGX Spark / GB10 host)
Architecture                 : aarch64
Host memory                  : 121 GiB RAM
GPU                          : NVIDIA GB10
Driver                       : 580.142

Alternative validation container:
- Image: nvcr.io/nvidia/vllm:26.02-py3
- vLLM: 0.15.1+nv26.2
- torch: 2.11.0a0+eb65b36914.nv26.2
- transformers: 4.57.5

Export under test:
- Model: Qwen/Qwen3-30B-A3B-Instruct-2507
- Export type: ModelOpt HF NVFP4
- Producer: modelopt 0.37.0
- Quant algo: NVFP4
- KV cache quant algo: FP8
</details>

🐛 Describe the bug

vLLM detects our exported checkpoint as a ModelOpt NVFP4 checkpoint, but fails to load a Qwen3 MoE export on DGX Spark / GB10 before generation starts.

The checkpoint was exported successfully in our lab and is already materialized as a packaged HF export with:

  • config.json
  • hf_quant_config.json
  • model.safetensors.index.json
  • model-00001-of-00004.safetensors ... model-00004-of-00004.safetensors

vLLM recognizes it as ModelOpt NVFP4, sees the GPU correctly, and starts the load path, but then fails with:

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

Important extra evidence from the exported checkpoint:

  • the export DOES contain _double_scale keys
  • however, in the HF index they are named with the HF projection names, e.g.:
model.layers.28.mlp.experts.0.down_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.gate_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.up_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.1.down_proj.weight_quantizer._double_scale
...

So this does not look like a missing export artifact. It looks more like a naming/loader contract mismatch for Qwen3 MoE ModelOpt NVFP4.

Reproduction

Our alternative validation script runs the model inside the official NVIDIA vLLM container and tries to load the already exported checkpoint with quantization=modelopt_fp4.

Equivalent minimal repro is:

from vllm import LLM, SamplingParams

llm = LLM(
    model="/workspace/export",
    quantization="modelopt_fp4",
    trust_remote_code=True,
    max_model_len=256,
    max_num_seqs=1,
    gpu_memory_utilization=0.9,
)

outputs = llm.generate(["Hola"], SamplingParams(max_tokens=8, temperature=0.0))
print(outputs[0].outputs[0].text)

And we run with the exported checkpoint mounted at /workspace/export.

Observed result:

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

Additional runtime facts:

  • GPU is visible inside the container (NVIDIA GB10)
  • vLLM identifies the checkpoint as ModelOpt NVFP4
  • the failure happens during checkpoint load, before generation

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Additional context

  • We already opened the corresponding TRT-LLM upstream issue for the same exported checkpoint because TRT-LLM also fails to load it, but with a different symptom:
    • NVIDIA/TensorRT-LLM#12762
  • In TRT-LLM, the export load fails with weight_scale size mismatches.
  • In vLLM, the export is recognized as ModelOpt NVFP4, but the loader looks for w2_weight_quantizer._double_scale while the exported HF checkpoint appears to use down_proj/gate_proj/up_proj naming.
  • This suggests the export itself exists and is structurally rich enough, but the Qwen3 MoE mapping path for ModelOpt NVFP4 may be expecting a different internal naming contract.

If useful, I can also provide the exact summary.json, validation_summary.json, and the full stderr log from the run.

extent analysis

TL;DR

The most likely fix is to update the vLLM loader to handle the naming convention used in the exported Qwen3 MoE ModelOpt NVFP4 checkpoint.

Guidance

  • Verify that the exported checkpoint is correctly formatted and contains the required _double_scale keys with the expected naming convention.
  • Investigate the vLLM loader code to determine why it is expecting a different naming convention (w2_weight_quantizer._double_scale) than what is present in the exported checkpoint (down_proj/gate_proj/up_proj).
  • Consider updating the vLLM loader to handle the naming convention used in the exported Qwen3 MoE ModelOpt NVFP4 checkpoint, potentially by adding support for the down_proj/gate_proj/up_proj naming scheme.
  • Review the corresponding TRT-LLM upstream issue (NVIDIA/TensorRT-LLM#12762) to see if there are any insights or fixes that can be applied to the vLLM loader.

Example

No code example is provided as the issue is related to the internal implementation of the vLLM loader and the naming convention used in the exported checkpoint.

Notes

The issue appears to be specific to the Qwen3 MoE ModelOpt NVFP4 checkpoint and the vLLM loader. The fact that the exported checkpoint is recognized as ModelOpt NVFP4 but fails to load due to a naming convention mismatch suggests that the issue is related to the internal implementation of the vLLM loader.

Recommendation

Apply a workaround by updating the vLLM loader to handle the naming convention used in the exported Qwen3 MoE ModelOpt NVFP4 checkpoint. This will likely require modifications to the vLLM loader code to support the down_proj/gate_proj/up_proj naming scheme.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: ModelOpt NVFP4 Qwen3-30B-A3B export fails to load on DGX Spark/GB10 (missing _double_scale key) [1 participants]