vllm - 💡(How to fix) Fix [Bug]: ModelOpt NVFP4 Qwen3-30B-A3B export fails to load on DGX Spark/GB10 (missing _double

Root Cause

Additional context

We already opened the corresponding TRT-LLM upstream issue for the same exported checkpoint because TRT-LLM also fails to load it, but with a different symptom:
- NVIDIA/TensorRT-LLM#12762
In TRT-LLM, the export load fails with weight_scale size mismatches.
In vLLM, the export is recognized as ModelOpt NVFP4, but the loader looks for w2_weight_quantizer._double_scale while the exported HF checkpoint appears to use down_proj/gate_proj/up_proj naming.
This suggests the export itself exists and is structurally rich enough, but the Qwen3 MoE mapping path for ModelOpt NVFP4 may be expecting a different internal naming contract.

Code Example

OS                           : Linux (DGX Spark / GB10 host)
Architecture                 : aarch64
Host memory                  : 121 GiB RAM
GPU                          : NVIDIA GB10
Driver                       : 580.142

Alternative validation container:
- Image: nvcr.io/nvidia/vllm:26.02-py3
- vLLM: 0.15.1+nv26.2
- torch: 2.11.0a0+eb65b36914.nv26.2
- transformers: 4.57.5

Export under test:
- Model: Qwen/Qwen3-30B-A3B-Instruct-2507
- Export type: ModelOpt HF NVFP4
- Producer: modelopt 0.37.0
- Quant algo: NVFP4
- KV cache quant algo: FP8

---

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

---

model.layers.28.mlp.experts.0.down_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.gate_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.up_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.1.down_proj.weight_quantizer._double_scale
...

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="/workspace/export",
    quantization="modelopt_fp4",
    trust_remote_code=True,
    max_model_len=256,
    max_num_seqs=1,
    gpu_memory_utilization=0.9,
)

outputs = llm.generate(["Hola"], SamplingParams(max_tokens=8, temperature=0.0))
print(outputs[0].outputs[0].text)

---

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS                           : Linux (DGX Spark / GB10 host)
Architecture                 : aarch64
Host memory                  : 121 GiB RAM
GPU                          : NVIDIA GB10
Driver                       : 580.142

Alternative validation container:
- Image: nvcr.io/nvidia/vllm:26.02-py3
- vLLM: 0.15.1+nv26.2
- torch: 2.11.0a0+eb65b36914.nv26.2
- transformers: 4.57.5

Export under test:
- Model: Qwen/Qwen3-30B-A3B-Instruct-2507
- Export type: ModelOpt HF NVFP4
- Producer: modelopt 0.37.0
- Quant algo: NVFP4
- KV cache quant algo: FP8

</details>

🐛 Describe the bug

vLLM detects our exported checkpoint as a ModelOpt NVFP4 checkpoint, but fails to load a Qwen3 MoE export on DGX Spark / GB10 before generation starts.

The checkpoint was exported successfully in our lab and is already materialized as a packaged HF export with:

config.json
hf_quant_config.json
model.safetensors.index.json
model-00001-of-00004.safetensors ... model-00004-of-00004.safetensors

vLLM recognizes it as ModelOpt NVFP4, sees the GPU correctly, and starts the load path, but then fails with:

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

Important extra evidence from the exported checkpoint:

the export DOES contain _double_scale keys
however, in the HF index they are named with the HF projection names, e.g.:

model.layers.28.mlp.experts.0.down_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.gate_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.0.up_proj.weight_quantizer._double_scale
model.layers.28.mlp.experts.1.down_proj.weight_quantizer._double_scale
...

So this does not look like a missing export artifact. It looks more like a naming/loader contract mismatch for Qwen3 MoE ModelOpt NVFP4.

Reproduction

Our alternative validation script runs the model inside the official NVIDIA vLLM container and tries to load the already exported checkpoint with quantization=modelopt_fp4.

Equivalent minimal repro is:

from vllm import LLM, SamplingParams

llm = LLM(
    model="/workspace/export",
    quantization="modelopt_fp4",
    trust_remote_code=True,
    max_model_len=256,
    max_num_seqs=1,
    gpu_memory_utilization=0.9,
)

outputs = llm.generate(["Hola"], SamplingParams(max_tokens=8, temperature=0.0))
print(outputs[0].outputs[0].text)

And we run with the exported checkpoint mounted at /workspace/export.

Observed result:

KeyError: layers.28.mlp.experts.w2_weight_quantizer._double_scale

Additional runtime facts:

GPU is visible inside the container (NVIDIA GB10)
vLLM identifies the checkpoint as ModelOpt NVFP4
the failure happens during checkpoint load, before generation

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Additional context

We already opened the corresponding TRT-LLM upstream issue for the same exported checkpoint because TRT-LLM also fails to load it, but with a different symptom:
- NVIDIA/TensorRT-LLM#12762
In TRT-LLM, the export load fails with weight_scale size mismatches.
In vLLM, the export is recognized as ModelOpt NVFP4, but the loader looks for w2_weight_quantizer._double_scale while the exported HF checkpoint appears to use down_proj/gate_proj/up_proj naming.
This suggests the export itself exists and is structurally rich enough, but the Qwen3 MoE mapping path for ModelOpt NVFP4 may be expecting a different internal naming contract.

If useful, I can also provide the exact summary.json, validation_summary.json, and the full stderr log from the run.

extent analysis

TL;DR

The most likely fix is to update the vLLM loader to handle the naming convention used in the exported Qwen3 MoE ModelOpt NVFP4 checkpoint.

Guidance

Verify that the exported checkpoint is correctly formatted and contains the required _double_scale keys with the expected naming convention.
Investigate the vLLM loader code to determine why it is expecting a different naming convention (w2_weight_quantizer._double_scale) than what is present in the exported checkpoint (down_proj/gate_proj/up_proj).
Consider updating the vLLM loader to handle the naming convention used in the exported Qwen3 MoE ModelOpt NVFP4 checkpoint, potentially by adding support for the down_proj/gate_proj/up_proj naming scheme.
Review the corresponding TRT-LLM upstream issue (NVIDIA/TensorRT-LLM#12762) to see if there are any insights or fixes that can be applied to the vLLM loader.

Example

No code example is provided as the issue is related to the internal implementation of the vLLM loader and the naming convention used in the exported checkpoint.

Notes

The issue appears to be specific to the Qwen3 MoE ModelOpt NVFP4 checkpoint and the vLLM loader. The fact that the exported checkpoint is recognized as ModelOpt NVFP4 but fails to load due to a naming convention mismatch suggests that the issue is related to the internal implementation of the vLLM loader.

Recommendation

Apply a workaround by updating the vLLM loader to handle the naming convention used in the exported Qwen3 MoE ModelOpt NVFP4 checkpoint. This will likely require modifications to the vLLM loader code to support the down_proj/gate_proj/up_proj naming scheme.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: ModelOpt NVFP4 Qwen3-30B-A3B export fails to load on DGX Spark/GB10 (missing _double_scale key) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Additional context

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Before submitting a new issue...

Additional context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: ModelOpt NVFP4 Qwen3-30B-A3B export fails to load on DGX Spark/GB10 (missing _double_scale key) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Additional context

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Before submitting a new issue...

Additional context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING