vllm - ✅(Solved) Fix [Bug]: MiniMax-M2.5-NVFP4: KeyError 'layers.0.self_attn.qkv_proj.k_scale' during load_weights — checkpoint uses split q/k/v, loader expects fused qkv_proj [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39314Fetched 2026-04-09 07:51:56
View on GitHub
Comments
3
Participants
2
Timeline
6
Reactions
1
Timeline (top)
commented ×3closed ×1cross-referenced ×1labeled ×1

Error Message

File ".../vllm/model_executor/models/minimax_m2.py", line 442, in load_weights param = params_dict[name] KeyError: 'layers.0.self_attn.qkv_proj.k_scale'

Fix Action

Fixed

PR fix notes

PR #37214: Fix minimax m2.5 nvfp4 kv scales weight loading

Description (problem / solution / changelog)

Purpose

Fix kv scale weight loading issue in minimax m2.5 nvfp4:

vllm serve nvidia/MiniMax-M2.5-NVFP4     --trust-remote-code   --tensor-parallel-size 2
 [multiproc_executor.py:852]   File "/vllm/vllm/model_executor/models/utils.py", line 268, in _load_module
 [multiproc_executor.py:852]     loaded_params = module_load_weights(weights)
 [multiproc_executor.py:852]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 [multiproc_executor.py:852]   File "/vllm/vllm/model_executor/models/minimax_m2.py", line 442, in load_weights
 [multiproc_executor.py:852]     param = params_dict[name]
 [multiproc_executor.py:852]             ~~~~~~~~~~~^^^^^^
 [multiproc_executor.py:852] KeyError: 'layers.0.self_attn.qkv_proj.k_scale'

Test Plan

lm_eval --model local-completions --model_args "base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9310|±  |0.0070|
|     |       |strict-match    |     5|exact_match|↑  |0.9287|±  |0.0071|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/model_executor/models/minimax_m2.py (modified, +11/-0)

Code Example

File ".../vllm/model_executor/models/minimax_m2.py", line 442, in load_weights
    param = params_dict[name]
KeyError: 'layers.0.self_attn.qkv_proj.k_scale'
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: 0.18.0 (image: vllm/vllm-audio:v0.18.0 or equivalent) Model: (full main snapshot) GPU: NVIDIA H200 NVL, tensor_parallel_size=4 Quantization: ModelOpt NVFP4 (as detected by vLLM: Detected ModelOpt NVFP4 checkpoint)

🐛 Describe the bug

Summary Loading nvidia/MiniMax-M2.5-NVFP4 from a full Hugging Face–style directory (29 safetensors shards + model.safetensors.index.json) fails during load_weights with: KeyError: 'layers.0.self_attn.qkv_proj.k_scale' The index file does not contain any qkv_proj tensors; attention weights use separate q_proj, k_proj, and v_proj (e.g. model.layers.0.self_attn.k_proj.k_scale). So this looks like a loader / checkpoint layout mismatch, not a missing or corrupted file.

python3 -m vllm.entrypoints.openai.api_server
--model /path/to/MiniMax-M2.5-NVFP4
--trust-remote-code
--tensor-parallel-size 4
--kv-cache-dtype fp8
--tool-call-parser minimax_m2
--enable-auto-tool-choice

Actual behavior Engine startup fails while workers load weights. Stack trace includes:

File ".../vllm/model_executor/models/minimax_m2.py", line 442, in load_weights
    param = params_dict[name]
KeyError: 'layers.0.self_attn.qkv_proj.k_scale'

Expected behavior Weights load successfully for the official nvidia/MiniMax-M2.5-NVFP4 checkpoint, or documentation clearly states a minimum vLLM commit / image and any required conversion if the public HF layout is intentionally “unfused” q/k/v. Evidence: checkpoint key layout In model.safetensors.index.json from the HF repo: There are no keys containing qkv_proj (e.g. grep -c qkv_proj → 0). Layer 0 attention includes entries such as: model.layers.0.self_attn.k_proj.k_scale model.layers.0.self_attn.q_proj.weight model.layers.0.self_attn.v_proj.v_scale (and related weights) So the loader expecting layers.0.self_attn.qkv_proj.k_scale does not match the published safetensors index for this model.

Questions for maintainers Should MiniMaxM2ForCausalLM + ModelOpt NVFP4 use fused qkv_proj in vLLM while the NVIDIA HF checkpoint ships unfused q_proj / k_proj / v_proj? If so, is there a planned weight remapping or fusion step? Is there a known-good vLLM version / nightly for nvidia/MiniMax-M2.5-NVFP4 that we should use instead of 0.18.0? Additional context Full HF file set present locally (29 shards, index, config.json, hf_quant_config.json, tokenizer files, modeling_minimax_m2.py, etc.) — ~130 GiB, consistent with the model’s file list. NVIDIA model card references TensorRT-LLM for the primary NVFP4 recipe; we are specifically trying vLLM and hitting this key mismatch. Thank you.

serving, bug, model: MiniMax, quantization

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to update the weight loading logic in vllm to match the "unfused" q/k/v layout used in the NVIDIA MiniMax-M2.5-NVFP4 checkpoint.

Guidance

  • Verify that the model.safetensors.index.json file does not contain any keys with qkv_proj by running grep -c qkv_proj on the file.
  • Check the minimax_m2.py file for any hardcoded assumptions about the weight layout and consider updating it to handle both "fused" and "unfused" layouts.
  • Investigate whether a newer version of vllm has already addressed this issue, as the current version (0.18.0) seems to expect a "fused" layout.
  • Consider reaching out to the maintainers to ask about a planned weight remapping or fusion step for the MiniMaxM2ForCausalLM model with ModelOpt NVFP4.

Example

No code snippet is provided as the issue is more related to the weight layout and loading logic rather than a specific code bug.

Notes

The issue seems to be specific to the MiniMax-M2.5-NVFP4 model and the vllm version 0.18.0. The solution may involve updating the vllm code or using a different version of vllm that supports the "unfused" q/k/v layout.

Recommendation

Apply a workaround by updating the weight loading logic in vllm to match the "unfused" q/k/v layout used in the NVIDIA MiniMax-M2.5-NVFP4 checkpoint, as it seems that the current version of vllm does not support this layout.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING