vllm - ✅(Solved) Fix [Bug]: Qwen3-VL-MoE NVFP4 checkpoint (un-BMM'd per-expert format) fails load with IndexError: Dimension out of range [2 pull requests, 1 participants]

vllm2026-04-25 20:52:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40885•Fetched 2026-04-26 05:06:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Code4me2

Participants

Code4me2

Timeline (top)

referenced ×3cross-referenced ×1

Error Message

(EngineCore_DP0) ERROR [core.py:1104] EngineCore failed to start. (EngineCore_DP0) ERROR [core.py:1104] Traceback (most recent call last): (EngineCore_DP0) ERROR [core.py:1104] File ".../vllm/v1/engine/core.py", line 1094, in run_engine_core (EngineCore_DP0) ERROR [core.py:1104] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) ... (EngineCore_DP0) ERROR [core.py:1104] File ".../vllm/model_executor/models/qwen3_vl_moe.py", line 245, in load_weights (EngineCore_DP0) ERROR [core.py:1104] loaded_weight = loaded_weight.transpose(-1, -2) # no bias (EngineCore_DP0) ERROR [core.py:1104] IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2)

Root Cause

ModelOpt quantizes Qwen3-VL-MoE by un-BMM'ing the fused experts.gate_up_proj / experts.down_proj parameter tensors into per-expert nn.ModuleLists. The resulting state dict has names like:

model.language_model.layers.0.mlp.experts.gate_proj.0.weight_packed
model.language_model.layers.0.mlp.experts.gate_proj.0.weight_scale
model.language_model.layers.0.mlp.experts.gate_proj.0.weight_scale_2
model.language_model.layers.0.mlp.experts.gate_proj.0.input_scale
model.language_model.layers.0.mlp.experts.down_proj.0.weight_packed
...

vLLM's qwen3_vl_moe.py load path has three issues with this:

is_fused_expert detection uses substring matching. The check "experts.down_proj" in name fires for any name containing that substring — including experts.down_proj.0.weight_scale_2. This routes the tensor into the fused-BMM branch, which then calls loaded_weight.transpose(-1, -2) on a 0-D scalar scale and crashes with IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2).
Name convention mismatch. Even after fixing the crash, the loader expects Mixtral-style experts.{N}.gate_proj.* but ModelOpt emits experts.gate_proj.{N}.* (proj before index). Without name remapping, no tensor matches the expert params mapping and the scales are silently dropped, leaving the model in an uninitialized quantization state that produces garbage output.
ignore_suffixes does not list NVFP4 scale names (.weight_scale_2, .weight_global_scale, .input_global_scale, .pre_quant_scale). Scales that happen to fall through all matching branches hit the fused path instead of being cleanly skipped.

Fix Action

Fixed

Fixed by PR: [Bugfix][Model] Qwen3-VL-MoE NVFP4 (ModelOpt) per-expert weight loading (https://github.com/vllm-project/vllm/pull/40888)

PR fix notes

PR #40888: [Bugfix][Model] Qwen3-VL-MoE NVFP4 (ModelOpt) per-expert weight loading

Repository: vllm-project/vllm
Author: Code4me2
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40888

Description (problem / solution / changelog)

Purpose

Fix checkpoint loading for Qwen3-VL-MoE models quantized by NVIDIA TensorRT-Model-Optimizer (nvidia-modelopt >= 0.43, quant_algo: NVFP4).

ModelOpt un-BMMs Qwen3-VL-MoE's fused expert tensors into per-expert nn.ModuleLists, producing names like experts.gate_proj.0.weight_packed instead of vLLM's expected experts.0.gate_proj.weight_packed. The current loader has three issues with this:

is_fused_expert uses substring matching ("experts.down_proj" in name) which false-positives on experts.down_proj.0.weight_scale_2. The loader then calls transpose(-1, -2) on a 0-D scale tensor and crashes with IndexError: Dimension out of range.
Even with the crash fixed, names need remapping from experts.<proj>.<N>.<suffix> to experts.<N>.<proj>.<suffix> so they flow through make_expert_params_mapping. Without it, scales are silently dropped and the model loads but outputs garbage.
ignore_suffixes does not list NVFP4 scale suffixes (.weight_scale_2, .weight_global_scale, .input_global_scale, .pre_quant_scale).

This PR adds, scoped to qwen3_vl_moe.py only:

A precise regex-based is_fused_expert check that targets the truly-fused BMM format and excludes per-expert un-BMM'd names.
An up-front name remap for the ModelOpt per-expert form.
The four NVFP4 scale suffixes added to ignore_suffixes.

The change is additive — Llama4-style BMM-fused NVFP4 checkpoints continue to load unchanged because the new regex matches them precisely. The recently added LoRA base_layer. adapter handling in fused_expert_params_mapping (#37114) is preserved.

Intended as a near-term stop-gap until RFC #40182 ("Unified ModelOpt Quantization in vLLM") lands; the per-model name-remap shim can be removed once the unified ModelOpt loader path is in place. Mirrors the per-model pattern established by #39045 (Gemma 4 quantized MoE, MERGED).

Fixes: #40885

AI assistance disclosure

Per AGENTS.md, this PR was produced with non-trivial assistance from an AI coding assistant (Anthropic Claude, model claude-opus-4-7). The submitter reviewed every changed line, ran pre-commit run --all-files (passes), executed the regex unit tests added under tests/model_executor/, and ran the integration test against a real ModelOpt NVFP4 Qwen3-VL-MoE checkpoint (Code4me2/bu-30b-a3b-preview-NVFP4). The BF16 vs NVFP4 accuracy comparison below confirms the scales are actually applied. The commit uses the Co-authored-by: Claude trailer per the AGENTS.md example.

This PR is not a duplicate: the only adjacent open work is RFC #40182 (long-term unified ModelOpt path) and the per-model fix series #39045/#39256/#39084/#39406 (Gemma 4 only). No open PR addresses Qwen3-VL-MoE specifically.

Test Plan

Unit tests (added in tests/model_executor/test_qwen3_vl_moe_loader.py): cover the two regex patterns. Verify _TRULY_FUSED_EXPERT_RE matches the fused BMM format (BF16 bare Parameter, Linear-wrapped, NVFP4 fused) but does NOT match the un-BMM'd per-expert form, and that _MODELOPT_PEREXP_RE correctly remaps experts.<proj>.<N>.<suffix> → experts.<N>.<proj>.<suffix>.

Integration test (manual, against a published reference checkpoint):

vllm serve Code4me2/bu-30b-a3b-preview-NVFP4 \
    --max-model-len 32768 --dtype bfloat16 \
    --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --max-num-seqs 2

Confirms the loader no longer crashes and the served model produces correct outputs.

Regression check: Llama4-style BMM-fused NVFP4 checkpoints still load (the new fused-detection regex matches them unchanged because their .weight / .weight_packed suffix follows the proj name with no intermediate numeric index).

Test Result

Before the patch (against Code4me2/bu-30b-a3b-preview-NVFP4):

ERROR [core.py:1104]   File ".../vllm/model_executor/models/qwen3_vl_moe.py", line 245, in load_weights
ERROR [core.py:1104]     loaded_weight = loaded_weight.transpose(-1, -2)
ERROR [core.py:1104] IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2)
RuntimeError: Engine core initialization failed.

After the patch: server starts, generation produces correct outputs ("What is 7 times 8?" → "56").

Mini-eval, BF16 baseline vs NVFP4 (this patch), same vLLM build on Blackwell sm_120:

Task	n	BF16	NVFP4	Δ	NVFP4 / BF16 wallclock
MMLU (0-shot generative)	200	80.5 %	76.5 %	−4.0 pp	3.6× faster (9.1s vs 32.9s)
GSM8K (0-shot CoT)	200	89.5 %	87.0 %	−2.5 pp	1.2× faster (146s vs 174s)

Accuracy deltas are within typical NVFP4 ranges for an MoE; the meaningful confirmation is that the deltas are small, which means the scales are actually being applied (a broken loader would produce a much larger gap).

Checklist

Purpose described (and linked to issue #40885)
Test plan provided (unit tests + manual integration commands)
Test result provided (MMLU/GSM8K comparison above)
No documentation update needed (supported_models.md already lists Qwen3-VL-MoE)
DCO sign-off present

Changed files

tests/model_executor/test_qwen3_vl_moe_loader.py (added, +110/-0)
vllm/model_executor/models/qwen3_vl_moe.py (modified, +35/-4)

Code Example

PyTorch version: 2.10.0+cu130
CUDA available: True
CUDA runtime version: 13.0
cuDNN version: 9.15.1
GPUs: NVIDIA RTX PRO 6000 Blackwell Max-Q (x2), NVIDIA GeForce RTX 5090
Driver version: 590.48.01 (CUDA 13.1)
Python version: 3.13

vLLM version: 0.17.1rc0 (also reproduced against vLLM main @ 60cd878)
vLLM build flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

nvidia-modelopt: 0.43.0
transformers: 4.57.6
tiktoken: 0.12.0

---

model.language_model.layers.0.mlp.experts.gate_proj.0.weight_packed
model.language_model.layers.0.mlp.experts.gate_proj.0.weight_scale
model.language_model.layers.0.mlp.experts.gate_proj.0.weight_scale_2
model.language_model.layers.0.mlp.experts.gate_proj.0.input_scale
model.language_model.layers.0.mlp.experts.down_proj.0.weight_packed
...

---

vllm serve Code4me2/bu-30b-a3b-preview-NVFP4 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.85 \
    --kv-cache-dtype fp8 \
    --max-num-seqs 2

---

(EngineCore_DP0) ERROR [core.py:1104] EngineCore failed to start.
(EngineCore_DP0) ERROR [core.py:1104] Traceback (most recent call last):
(EngineCore_DP0) ERROR [core.py:1104]   File ".../vllm/v1/engine/core.py", line 1094, in run_engine_core
(EngineCore_DP0) ERROR [core.py:1104]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
...
(EngineCore_DP0) ERROR [core.py:1104]   File ".../vllm/model_executor/models/qwen3_vl_moe.py", line 245, in load_weights
(EngineCore_DP0) ERROR [core.py:1104]     loaded_weight = loaded_weight.transpose(-1, -2)  # no bias
(EngineCore_DP0) ERROR [core.py:1104] IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2)

RAW_BUFFERClick to expand / collapse

Issue: Qwen3-VL-MoE NVFP4 per-expert checkpoint fails to load

File via: New Issue → Bug Report

Suggested title: [Bug]: Qwen3-VL-MoE NVFP4 checkpoint (un-BMM'd per-expert format) fails load with IndexError: Dimension out of range

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

PyTorch version: 2.10.0+cu130
CUDA available: True
CUDA runtime version: 13.0
cuDNN version: 9.15.1
GPUs: NVIDIA RTX PRO 6000 Blackwell Max-Q (x2), NVIDIA GeForce RTX 5090
Driver version: 590.48.01 (CUDA 13.1)
Python version: 3.13

vLLM version: 0.17.1rc0 (also reproduced against vLLM main @ 60cd878)
vLLM build flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

nvidia-modelopt: 0.43.0
transformers: 4.57.6
tiktoken: 0.12.0

</details>

🐛 Describe the bug

vLLM's qwen3_vl_moe.py weight loader crashes when loading an NVFP4-quantized Qwen3-VL-MoE checkpoint produced by NVIDIA TensorRT-Model-Optimizer (nvidia-modelopt ≥ 0.43, configs NVFP4_DEFAULT_CFG / NVFP4_AWQ_LITE_CFG).

Root cause

ModelOpt quantizes Qwen3-VL-MoE by un-BMM'ing the fused experts.gate_up_proj / experts.down_proj parameter tensors into per-expert nn.ModuleLists. The resulting state dict has names like:

model.language_model.layers.0.mlp.experts.gate_proj.0.weight_packed
model.language_model.layers.0.mlp.experts.gate_proj.0.weight_scale
model.language_model.layers.0.mlp.experts.gate_proj.0.weight_scale_2
model.language_model.layers.0.mlp.experts.gate_proj.0.input_scale
model.language_model.layers.0.mlp.experts.down_proj.0.weight_packed
...

vLLM's qwen3_vl_moe.py load path has three issues with this:

is_fused_expert detection uses substring matching. The check "experts.down_proj" in name fires for any name containing that substring — including experts.down_proj.0.weight_scale_2. This routes the tensor into the fused-BMM branch, which then calls loaded_weight.transpose(-1, -2) on a 0-D scalar scale and crashes with IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2).
Name convention mismatch. Even after fixing the crash, the loader expects Mixtral-style experts.{N}.gate_proj.* but ModelOpt emits experts.gate_proj.{N}.* (proj before index). Without name remapping, no tensor matches the expert params mapping and the scales are silently dropped, leaving the model in an uninitialized quantization state that produces garbage output.
ignore_suffixes does not list NVFP4 scale names (.weight_scale_2, .weight_global_scale, .input_global_scale, .pre_quant_scale). Scales that happen to fall through all matching branches hit the fused path instead of being cleanly skipped.

Reproduction

A concrete reference checkpoint produced by ModelOpt for this model: Code4me2/bu-30b-a3b-preview-NVFP4 (derived from browser-use/bu-30b-a3b-preview, the 30B Qwen3-VL-MoE browser-agent model)

vllm serve Code4me2/bu-30b-a3b-preview-NVFP4 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.85 \
    --kv-cache-dtype fp8 \
    --max-num-seqs 2

Full traceback

(EngineCore_DP0) ERROR [core.py:1104] EngineCore failed to start.
(EngineCore_DP0) ERROR [core.py:1104] Traceback (most recent call last):
(EngineCore_DP0) ERROR [core.py:1104]   File ".../vllm/v1/engine/core.py", line 1094, in run_engine_core
(EngineCore_DP0) ERROR [core.py:1104]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
...
(EngineCore_DP0) ERROR [core.py:1104]   File ".../vllm/model_executor/models/qwen3_vl_moe.py", line 245, in load_weights
(EngineCore_DP0) ERROR [core.py:1104]     loaded_weight = loaded_weight.transpose(-1, -2)  # no bias
(EngineCore_DP0) ERROR [core.py:1104] IndexError: Dimension out of range (expected to be in range of [-1, 0], but got -2)

Related work

PR #39045 (MERGED) — Gemma 4 quantized MoE loader. Addressed the analogous problem for Gemma 4; similar fix pattern applies here.
PR #39256 (OPEN) — NVFP4 per-expert loading for Gemma 4 MoE (downstream of #39045).

Proposed fix

Extend vllm/model_executor/models/qwen3_vl_moe.py:

Precise regex-based is_fused_expert detection that only matches the truly-fused BMM tensor format (no intermediate per-expert index)
Add an up-front name-remap from ModelOpt's experts.<proj>.<N>.<suffix> to vLLM's standard experts.<N>.<proj>.<suffix>
Extend ignore_suffixes with the NVFP4 scale suffixes so the loader cleanly skips them when they are not present in params_dict

A PR implementing these will be filed separately; this issue is for tracking and discussion.

Additional notes

This is broader than a single model: any nvidia-modelopt produced NVFP4 / NVFP4_AWQ checkpoint for a Qwen3-VL-MoE architecture will hit this.
The NVFP4_AWQ variant has the additional requirement of runtime pre_quant_scale application, which is a separate feature (not addressed in this issue).

extent analysis

TL;DR

The most likely fix for the Qwen3-VL-MoE NVFP4 per-expert checkpoint loading issue is to modify the qwen3_vl_moe.py weight loader to correctly handle the renamed and restructured tensors produced by NVIDIA TensorRT-Model-Optimizer.

Guidance

Update the is_fused_expert detection to use precise regex-based matching to avoid incorrect routing of tensors.
Implement a name-remap to convert ModelOpt's tensor names to vLLM's standard naming convention.
Extend the ignore_suffixes list to include NVFP4 scale suffixes, ensuring the loader skips them when not present in params_dict.
Verify the fix by loading the provided reference checkpoint Code4me2/bu-30b-a3b-preview-NVFP4 and checking for successful model initialization and correct output.

Example

No code snippet is provided as the issue requires modifications to the qwen3_vl_moe.py file, which is not included in the issue body.

Notes

The proposed fix is based on the analysis of the issue and may require additional modifications to ensure correct functionality. The fix should be implemented and tested thoroughly to avoid any regressions.

Recommendation

Apply the proposed workaround by modifying the qwen3_vl_moe.py weight loader to correctly handle the renamed and restructured tensors produced by NVIDIA TensorRT-Model-Optimizer. This will allow the model to load the NVFP4 per-expert checkpoint correctly and produce the expected output.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#permission error #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.