vllm - 💡(How to fix) Fix [Bug]: v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4: lm_head.input_scale not registered

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vllm/vllm-openai:v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4 with a lm_head.input_scale loader error.

The same checkpoint previously loaded on a recent nightly image I had been using locally:

  • previous working image/version: vllm/vllm-openai:nightly, 0.21.1rc1.dev417+g22a58640b
  • previous working image digest: sha256:4cebac8c03f2cd9f5fabe72ac7c2a0b3aaa8450ef8f0e47429425fd1bfb83d42

After moving to the stable v0.22.0 image, the model fails during weight loading before the server starts.

Error Message

ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM.
The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Relevant trace section:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 674, in load_weights
  return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
...
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 337, in _load_module
  raise ValueError(msg)
ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Root Cause

vllm/vllm-openai:v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4 with a lm_head.input_scale loader error.

The same checkpoint previously loaded on a recent nightly image I had been using locally:

  • previous working image/version: vllm/vllm-openai:nightly, 0.21.1rc1.dev417+g22a58640b
  • previous working image digest: sha256:4cebac8c03f2cd9f5fabe72ac7c2a0b3aaa8450ef8f0e47429425fd1bfb83d42

After moving to the stable v0.22.0 image, the model fails during weight loading before the server starts.

Code Example

lm_head.input_scale
lm_head.weight
lm_head.weight_scale
lm_head.weight_scale_2

---

docker run --rm \
  --name vllm-qwen35-nvidia-sci \
  --runtime nvidia \
  --gpus all \
  --ipc=host \
  -p 8082:8000 \
  -v /path/to/qwen3.6-35b-a3b-nvidia-nvfp4:/model:ro \
  vllm/vllm-openai:v0.22.0 \
  /model \
  --served-model-name qwen35-nvidia-nvfp4-sci-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key local \
  --trust-remote-code \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8 \
  --generation-config vllm \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 8192 \
  --default-chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}' \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
  --quantization modelopt \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill \
  --enable-prefix-caching

---

ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM.
The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

---

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 674, in load_weights
  return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
...
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 337, in _load_module
  raise ValueError(msg)
ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

---

vLLM image: vllm/vllm-openai:v0.22.0
vLLM version: 0.22.0
Image digest: vllm/vllm-openai@sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416
GPU: NVIDIA GeForce RTX 5090, 32607 MiB
Driver: 610.47
Docker: Docker version 29.5.2, build 79eb04c
OS: WSL2 Linux x86_64
RAW_BUFFERClick to expand / collapse

Description

vllm/vllm-openai:v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4 with a lm_head.input_scale loader error.

The same checkpoint previously loaded on a recent nightly image I had been using locally:

  • previous working image/version: vllm/vllm-openai:nightly, 0.21.1rc1.dev417+g22a58640b
  • previous working image digest: sha256:4cebac8c03f2cd9f5fabe72ac7c2a0b3aaa8450ef8f0e47429425fd1bfb83d42

After moving to the stable v0.22.0 image, the model fails during weight loading before the server starts.

Model

nvidia/Qwen3.6-35B-A3B-NVFP4

Local checkpoint index contains quantized lm_head entries:

lm_head.input_scale
lm_head.weight
lm_head.weight_scale
lm_head.weight_scale_2

Command

docker run --rm \
  --name vllm-qwen35-nvidia-sci \
  --runtime nvidia \
  --gpus all \
  --ipc=host \
  -p 8082:8000 \
  -v /path/to/qwen3.6-35b-a3b-nvidia-nvfp4:/model:ro \
  vllm/vllm-openai:v0.22.0 \
  /model \
  --served-model-name qwen35-nvidia-nvfp4-sci-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key local \
  --trust-remote-code \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8 \
  --generation-config vllm \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 8192 \
  --default-chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}' \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
  --quantization modelopt \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill \
  --enable-prefix-caching

Error

ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM.
The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Relevant trace section:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 674, in load_weights
  return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
...
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 337, in _load_module
  raise ValueError(msg)
ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Expected behavior

The checkpoint should load, or vLLM should clearly indicate that this ModelOpt/NVFP4 checkpoint format with quantized lm_head is unsupported in v0.22.0.

The failure looks like a ParallelLMHead / quantized lm_head loader registration mismatch: the checkpoint provides lm_head.input_scale, lm_head.weight_scale, and lm_head.weight_scale_2, but vLLM only registers lm_head.weight for this model class.

Environment

vLLM image: vllm/vllm-openai:v0.22.0
vLLM version: 0.22.0
Image digest: vllm/vllm-openai@sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416
GPU: NVIDIA GeForce RTX 5090, 32607 MiB
Driver: 610.47
Docker: Docker version 29.5.2, build 79eb04c
OS: WSL2 Linux x86_64

Notes

This may be related to the broader ParallelLMHead quantization gap described in #40999, but this report is specifically for the NVIDIA Qwen3.6 35B A3B NVFP4 checkpoint failing to load on the stable v0.22.0 Docker image while a recent nightly had loaded it.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The checkpoint should load, or vLLM should clearly indicate that this ModelOpt/NVFP4 checkpoint format with quantized lm_head is unsupported in v0.22.0.

The failure looks like a ParallelLMHead / quantized lm_head loader registration mismatch: the checkpoint provides lm_head.input_scale, lm_head.weight_scale, and lm_head.weight_scale_2, but vLLM only registers lm_head.weight for this model class.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4: lm_head.input_scale not registered