vllm - 💡(How to fix) Fix [Bug]: Qwen3-Next NVFP4 quants silently produce garbage when linear_attn weights are missing from quantization_config.ignore [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40252Fetched 2026-04-19 15:04:44
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

This is a silent correctness failure — no exception, no WARNING about the nvfp4 path, inference continues with broken weights.

  1. Loudly warn and refuse to serve when the weight loader skips named parameters for a hybrid-attention layer in a compressed-tensors NVFP4 path. Silent skip loading with only an INFO/WARNING at the line level is not enough signal for a correctness failure of this magnitude — it gets lost in startup noise. Suggest raising or at least an ERROR-level message that includes "model will produce incorrect output."

Root Cause

Same bug surfaced for sglang in sgl-project/sglang#20973. The root cause is the quant authors' ignore-list, but the user-facing symptom (silent garbage output) is identical across inference engines — argues for the inference-engine-side fix.

Fix Action

Workaround

Patch the model's config.json to add the combined-name patterns to the ignore list before serving:

import json, pathlib
cfg_path = pathlib.Path(".../snapshot/config.json")
with open(cfg_path) as f: cfg = json.load(f)
ig = cfg["quantization_config"]["ignore"]
for pat in [r"re:.*linear_attn\.in_proj_qkvz$",
            r"re:.*linear_attn\.in_proj_ba$"]:
    if pat not in ig: ig.append(pat)
with open(cfg_path, "w") as f: json.dump(cfg, f, indent=2)

This tells vLLM to load those layers as unquantized BF16 (which is what the safetensors actually contain), and the model serves correctly.

Code Example

model.language_model.layers.{i}.linear_attn.in_proj_qkv
model.language_model.layers.{i}.linear_attn.in_proj_z
model.language_model.layers.{i}.linear_attn.in_proj_b
model.language_model.layers.{i}.linear_attn.in_proj_a

---

model.language_model.layers.{i}.linear_attn.in_proj_qkvz
model.language_model.layers.{i}.linear_attn.in_proj_ba

---

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --trust-remote-code --tensor-parallel-size 1

---

curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"RedHatAI/Qwen3.6-35B-A3B-NVFP4",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":20,"temperature":0.0}' | jq -r '.choices[0].message.content'

---

WARNING ... Parameter layers.N.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading
WARNING ... Parameter layers.N.linear_attn.in_proj_ba.weight not found in params_dict, skip loading

---

import json, pathlib
cfg_path = pathlib.Path(".../snapshot/config.json")
with open(cfg_path) as f: cfg = json.load(f)
ig = cfg["quantization_config"]["ignore"]
for pat in [r"re:.*linear_attn\.in_proj_qkvz$",
            r"re:.*linear_attn\.in_proj_ba$"]:
    if pat not in ig: ig.append(pat)
with open(cfg_path, "w") as f: json.dump(cfg, f, indent=2)
RAW_BUFFERClick to expand / collapse

[Bug]: Qwen3-Next NVFP4 quants silently produce garbage when linear_attn weights are missing from quantization_config.ignore

Your current environment

<details> <summary>Reproduction environment</summary>
  • vLLM: 0.20.0.dev (also reproduces on 0.16.0rc2 per sglang#20973)
  • Hardware: NVIDIA DGX Spark (GB10, sm_121); reproduces on any GPU
  • Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (also Sehyo/Qwen3.5-35B-A3B-NVFP4, AxionML/Qwen3.5-35B-A3B-NVFP4, apolo13x/Qwen3.5-35B-A3B-NVFP4, mmangkad/Qwen3.6-35B-A3B-NVFP4 — every NVFP4 quant of a Qwen3-Next-family model I've tried)
  • Serving: any vLLM config with --enable-prefix-caching, compressed-tensors quantization
</details>

🐛 Describe the bug

Every community NVFP4 quant of a Qwen3-Next-family model (Qwen 3.5, 3.6, and presumably 3.7 going forward) ships a quantization_config.ignore list that names the old split tensor names for the linear-attention block:

model.language_model.layers.{i}.linear_attn.in_proj_qkv
model.language_model.layers.{i}.linear_attn.in_proj_z
model.language_model.layers.{i}.linear_attn.in_proj_b
model.language_model.layers.{i}.linear_attn.in_proj_a

The actual safetensors shards, however, use the new combined names:

model.language_model.layers.{i}.linear_attn.in_proj_qkvz
model.language_model.layers.{i}.linear_attn.in_proj_ba

At load time, vLLM:

  1. Sees the combined in_proj_qkvz / in_proj_ba weights in the checkpoint
  2. Checks quantization_config.ignore for them — they're not there
  3. Attempts to load them as NVFP4-quantized (expecting scale factors that don't exist)
  4. Silently skips them (log message only: Parameter layers.X.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading)
  5. Leaves those layers effectively zero-valued at inference time

Result: the model serves and produces output, but the output is degenerate — every token collapses to the same character (we observed !!!!!!!!... for every prompt with temperature=0).

This is a silent correctness failure — no exception, no WARNING about the nvfp4 path, inference continues with broken weights.

Steps to reproduce

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --trust-remote-code --tensor-parallel-size 1

Then:

curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"RedHatAI/Qwen3.6-35B-A3B-NVFP4",
       "messages":[{"role":"user","content":"Hello"}],
       "max_tokens":20,"temperature":0.0}' | jq -r '.choices[0].message.content'

Expected: a coherent greeting. Actual: !!!!!!!!!!!!!!!!!!!!.

During startup, vLLM logs (paraphrased) for every hybrid-attention layer:

WARNING ... Parameter layers.N.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading
WARNING ... Parameter layers.N.linear_attn.in_proj_ba.weight not found in params_dict, skip loading

Workaround

Patch the model's config.json to add the combined-name patterns to the ignore list before serving:

import json, pathlib
cfg_path = pathlib.Path(".../snapshot/config.json")
with open(cfg_path) as f: cfg = json.load(f)
ig = cfg["quantization_config"]["ignore"]
for pat in [r"re:.*linear_attn\.in_proj_qkvz$",
            r"re:.*linear_attn\.in_proj_ba$"]:
    if pat not in ig: ig.append(pat)
with open(cfg_path, "w") as f: json.dump(cfg, f, indent=2)

This tells vLLM to load those layers as unquantized BF16 (which is what the safetensors actually contain), and the model serves correctly.

Suggested fix

Two non-mutually-exclusive options:

  1. Loudly warn and refuse to serve when the weight loader skips named parameters for a hybrid-attention layer in a compressed-tensors NVFP4 path. Silent skip loading with only an INFO/WARNING at the line level is not enough signal for a correctness failure of this magnitude — it gets lost in startup noise. Suggest raising or at least an ERROR-level message that includes "model will produce incorrect output."

  2. Auto-accept combined-name aliases when the quantization_config.ignore lists the split names. Since the split→combined rename is a known Qwen3-Next migration, vLLM could treat the presence of in_proj_qkv/in_proj_z in the ignore list as equivalent to in_proj_qkvz if the actual tensor has the combined name.

Option 1 is safer and more general. Option 2 fixes the common case automatically.

Also affected

Same bug surfaced for sglang in sgl-project/sglang#20973. The root cause is the quant authors' ignore-list, but the user-facing symptom (silent garbage output) is identical across inference engines — argues for the inference-engine-side fix.

Reference

Negative-finding gist with more context and the MoE kernel tuning trap that bit me while debugging this: https://gist.github.com/cghart/5374e7f749cb02e1ea96282893de64bd

extent analysis

TL;DR

The most likely fix for the silent correctness failure in Qwen3-Next NVFP4 quants is to patch the model's config.json to add the combined-name patterns to the ignore list or to implement a fix in vLLM to loudly warn and refuse to serve when the weight loader skips named parameters.

Guidance

  • Update the quantization_config.ignore list in the model's config.json to include the combined-name patterns (in_proj_qkvz and in_proj_ba) to ensure that vLLM loads the layers as unquantized BF16.
  • Consider implementing a fix in vLLM to loudly warn and refuse to serve when the weight loader skips named parameters for a hybrid-attention layer in a compressed-tensors NVFP4 path.
  • Verify that the model serves correctly after applying the workaround or fix by checking the output for a coherent greeting instead of degenerate output.
  • Be aware that this issue may affect other models and inference engines, so it's essential to test and verify the fix thoroughly.

Example

import json, pathlib
cfg_path = pathlib.Path(".../snapshot/config.json")
with open(cfg_path) as f: cfg = json.load(f)
ig = cfg["quantization_config"]["ignore"]
for pat in [r"re:.*linear_attn\.in_proj_qkvz$",
            r"re:.*linear_attn\.in_proj_ba$"]:
    if pat not in ig: ig.append(pat)
with open(cfg_path, "w") as f: json.dump(cfg, f, indent=2)

Notes

  • The root cause of the issue is the mismatch between the old split tensor names and the new combined names in the quantization_config.ignore list.
  • The fix may need to be adapted for other models and inference engines that are affected by the same issue.
  • It's crucial to test and verify the fix thoroughly to ensure that the model serves correctly and produces accurate output.

Recommendation

Apply the workaround by patching the model's config.json to add the combined-name patterns to the ignore list, as this is a straightforward and effective solution. This approach ensures that vLLM loads the layers as unquantized BF16, which should resolve the silent correctness failure.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3-Next NVFP4 quants silently produce garbage when linear_attn weights are missing from quantization_config.ignore [1 participants]