vllm - 💡(How to fix) Fix [Bug]: v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4: lm_head.input_scale not registered

vllm2026-05-30 22:32:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

vllm/vllm-openai:v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4 with a lm_head.input_scale loader error.

The same checkpoint previously loaded on a recent nightly image I had been using locally:

previous working image/version: vllm/vllm-openai:nightly, 0.21.1rc1.dev417+g22a58640b
previous working image digest: sha256:4cebac8c03f2cd9f5fabe72ac7c2a0b3aaa8450ef8f0e47429425fd1bfb83d42

After moving to the stable v0.22.0 image, the model fails during weight loading before the server starts.

Error Message

ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM.
The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Relevant trace section:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 674, in load_weights
  return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
...
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 337, in _load_module
  raise ValueError(msg)
ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Root Cause

vllm/vllm-openai:v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4 with a lm_head.input_scale loader error.

The same checkpoint previously loaded on a recent nightly image I had been using locally:

previous working image/version: vllm/vllm-openai:nightly, 0.21.1rc1.dev417+g22a58640b
previous working image digest: sha256:4cebac8c03f2cd9f5fabe72ac7c2a0b3aaa8450ef8f0e47429425fd1bfb83d42

After moving to the stable v0.22.0 image, the model fails during weight loading before the server starts.

Code Example

lm_head.input_scale
lm_head.weight
lm_head.weight_scale
lm_head.weight_scale_2

---

docker run --rm \
  --name vllm-qwen35-nvidia-sci \
  --runtime nvidia \
  --gpus all \
  --ipc=host \
  -p 8082:8000 \
  -v /path/to/qwen3.6-35b-a3b-nvidia-nvfp4:/model:ro \
  vllm/vllm-openai:v0.22.0 \
  /model \
  --served-model-name qwen35-nvidia-nvfp4-sci-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key local \
  --trust-remote-code \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8 \
  --generation-config vllm \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 8192 \
  --default-chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}' \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
  --quantization modelopt \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill \
  --enable-prefix-caching

---

ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM.
The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

---

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 674, in load_weights
  return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
...
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 337, in _load_module
  raise ValueError(msg)
ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

---

vLLM image: vllm/vllm-openai:v0.22.0
vLLM version: 0.22.0
Image digest: vllm/vllm-openai@sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416
GPU: NVIDIA GeForce RTX 5090, 32607 MiB
Driver: 610.47
Docker: Docker version 29.5.2, build 79eb04c
OS: WSL2 Linux x86_64

RAW_BUFFERClick to expand / collapse

Description

vllm/vllm-openai:v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4 with a lm_head.input_scale loader error.

The same checkpoint previously loaded on a recent nightly image I had been using locally:

previous working image/version: vllm/vllm-openai:nightly, 0.21.1rc1.dev417+g22a58640b
previous working image digest: sha256:4cebac8c03f2cd9f5fabe72ac7c2a0b3aaa8450ef8f0e47429425fd1bfb83d42

After moving to the stable v0.22.0 image, the model fails during weight loading before the server starts.

Model

nvidia/Qwen3.6-35B-A3B-NVFP4

Local checkpoint index contains quantized lm_head entries:

lm_head.input_scale
lm_head.weight
lm_head.weight_scale
lm_head.weight_scale_2

Command

docker run --rm \
  --name vllm-qwen35-nvidia-sci \
  --runtime nvidia \
  --gpus all \
  --ipc=host \
  -p 8082:8000 \
  -v /path/to/qwen3.6-35b-a3b-nvidia-nvfp4:/model:ro \
  vllm/vllm-openai:v0.22.0 \
  /model \
  --served-model-name qwen35-nvidia-nvfp4-sci-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key local \
  --trust-remote-code \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8 \
  --generation-config vllm \
  --gpu-memory-utilization 0.94 \
  --max-num-batched-tokens 8192 \
  --default-chat-template-kwargs '{"enable_thinking":true,"preserve_thinking":true}' \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
  --quantization modelopt \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_xml \
  --enable-chunked-prefill \
  --enable-prefix-caching

Error

ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM.
The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Relevant trace section:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 674, in load_weights
  return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
...
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 337, in _load_module
  raise ValueError(msg)
ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

Expected behavior

The checkpoint should load, or vLLM should clearly indicate that this ModelOpt/NVFP4 checkpoint format with quantized lm_head is unsupported in v0.22.0.

The failure looks like a ParallelLMHead / quantized lm_head loader registration mismatch: the checkpoint provides lm_head.input_scale, lm_head.weight_scale, and lm_head.weight_scale_2, but vLLM only registers lm_head.weight for this model class.

Environment

vLLM image: vllm/vllm-openai:v0.22.0
vLLM version: 0.22.0
Image digest: vllm/vllm-openai@sha256:0fec7ec5f3e6bc168e54899935fb0557da908a4832a1dbc88e2debcf2f889416
GPU: NVIDIA GeForce RTX 5090, 32607 MiB
Driver: 610.47
Docker: Docker version 29.5.2, build 79eb04c
OS: WSL2 Linux x86_64

Notes

This may be related to the broader ParallelLMHead quantization gap described in #40999, but this report is specifically for the NVIDIA Qwen3.6 35B A3B NVFP4 checkpoint failing to load on the stable v0.22.0 Docker image while a recent nightly had loaded it.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The checkpoint should load, or vLLM should clearly indicate that this ModelOpt/NVFP4 checkpoint format with quantized lm_head is unsupported in v0.22.0.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: v0.22.0 fails to load nvidia/Qwen3.6-35B-A3B-NVFP4: lm_head.input_scale not registered

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Description

Model

Command

Error

Expected behavior

Environment

Notes

FAQ

Expected behavior

Still need to ship something?

TRENDING