vllm - 💡(How to fix) Fix [Bug][DSV4] compressor / indexer.weights_proj / indexer.wq_b hardcoded with quant_config=None; breaks load of artifacts that calibrate these attention sub-modules

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In vllm/models/deepseek_v4/compressor.py and vllm/models/deepseek_v4/nvidia/ops/attention.py, the compressor's fused_wkv_wgate module and the indexer's weights_proj / indexer's compressor.fused_wkv_wgate / indexer's wq_b modules are constructed with quant_config=None — i.e. unconditionally as unquantized BF16 modules. This breaks loading of any DSv4-Flash artifact whose calibration recipe quantizes these attention sub-modules.

Error Message

File "vllm/models/deepseek_v4/nvidia/model.py", line 1418, in load_weights param = params_dict[name] KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'

Root Cause

Any calibration recipe that follows the predecessor's published guidance — FP8_BLOCK on the entire attention path including compressor + indexer — produces an artifact that the current model class cannot load. The artifact has .weight_scale keys with no place to land.

We work around this in our recipe by dequantizing those module weights to BF16 in the artifact at preprocess time (scripts/dequant_compressor.py, 166 weights, ~1.5 min wall). Detailed in RECIPE_RTX6000PRO.md §3.4. But this isn't a great long-term position — the load path should consume the artifact's quantization_config.config_groups natively.

Fix Action

Fix / Workaround

Option B — deeper: dispatch per-module based on artifact's config_groups

Code Example

import safetensors.torch as st
with st.safe_open("model-00001-of-00004.safetensors", framework="pt") as f:
    t  = f.get_tensor("layers.10.attn.compressor.wkv.weight")
    ts = f.get_tensor("layers.10.attn.compressor.wkv.weight_scale")
    print(t.dtype, t.shape)    # torch.float8_e4m3fn  torch.Size([1024, 4096])
    print(ts.dtype, ts.shape)  # torch.bfloat16  torch.Size([8, 32])  (block 128×128 scales)

---

self.fused_wkv_wgate = MergedColumnParallelLinear(
    self.hidden_size,
    [self.coff * self.head_dim, self.coff * self.head_dim],
    bias=False,
    return_bias=False,
    quant_config=None,           # ← hardcoded
    disable_tp=True,
    prefix=f"{prefix}.fused_wkv_wgate",
)

---

File "vllm/models/deepseek_v4/nvidia/model.py", line 1418, in load_weights
    param = params_dict[name]
KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'

---

# compressor.py
self.fused_wkv_wgate = MergedColumnParallelLinear(
    ...,
    quant_config=getattr(vllm_config, "quant_config", None),
    ...
)

---

# nvidia/ops/attention.py:weights_proj construction
self.weights_proj = ReplicatedLinear(
    hidden_size, self.n_head,
    bias=False,
    quant_config=quant_config,   # parameter already in scope
    prefix=f"{prefix}.weights_proj",
)
RAW_BUFFERClick to expand / collapse

Summary

In vllm/models/deepseek_v4/compressor.py and vllm/models/deepseek_v4/nvidia/ops/attention.py, the compressor's fused_wkv_wgate module and the indexer's weights_proj / indexer's compressor.fused_wkv_wgate / indexer's wq_b modules are constructed with quant_config=None — i.e. unconditionally as unquantized BF16 modules. This breaks loading of any DSv4-Flash artifact whose calibration recipe quantizes these attention sub-modules.

Repro

Artifact: canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (per the predecessor canada-quant/DeepSeek-V4-Flash-W4A16-FP8 recipe, attention path is FP8_BLOCK including the compressor and indexer).

import safetensors.torch as st
with st.safe_open("model-00001-of-00004.safetensors", framework="pt") as f:
    t  = f.get_tensor("layers.10.attn.compressor.wkv.weight")
    ts = f.get_tensor("layers.10.attn.compressor.wkv.weight_scale")
    print(t.dtype, t.shape)    # torch.float8_e4m3fn  torch.Size([1024, 4096])
    print(ts.dtype, ts.shape)  # torch.bfloat16  torch.Size([8, 32])  (block 128×128 scales)

vLLM source (current vllm/models/deepseek_v4/compressor.py:215-227):

self.fused_wkv_wgate = MergedColumnParallelLinear(
    self.hidden_size,
    [self.coff * self.head_dim, self.coff * self.head_dim],
    bias=False,
    return_bias=False,
    quant_config=None,           # ← hardcoded
    disable_tp=True,
    prefix=f"{prefix}.fused_wkv_wgate",
)

Loading the artifact then fails:

File "vllm/models/deepseek_v4/nvidia/model.py", line 1418, in load_weights
    param = params_dict[name]
KeyError: 'layers.10.attn.mla_attn.compressor.fused_wkv_wgate.weight_scale'

Same hardcoded quant_config=None lives at nvidia/ops/attention.py for the indexer's weights_proj and wq_b modules. The same KeyError fires for each.

Why this matters

Any calibration recipe that follows the predecessor's published guidance — FP8_BLOCK on the entire attention path including compressor + indexer — produces an artifact that the current model class cannot load. The artifact has .weight_scale keys with no place to land.

We work around this in our recipe by dequantizing those module weights to BF16 in the artifact at preprocess time (scripts/dequant_compressor.py, 166 weights, ~1.5 min wall). Detailed in RECIPE_RTX6000PRO.md §3.4. But this isn't a great long-term position — the load path should consume the artifact's quantization_config.config_groups natively.

Proposed fixes (two options)

Option A — minimal: pass through vllm_config.quant_config

# compressor.py
self.fused_wkv_wgate = MergedColumnParallelLinear(
    ...,
    quant_config=getattr(vllm_config, "quant_config", None),
    ...
)
# nvidia/ops/attention.py:weights_proj construction
self.weights_proj = ReplicatedLinear(
    hidden_size, self.n_head,
    bias=False,
    quant_config=quant_config,   # parameter already in scope
    prefix=f"{prefix}.weights_proj",
)

This lets compressed-tensors' scheme resolver see the module's prefix, match the artifact's targets=re:.*attn\.compressor\.(wgate|wkv|fused_wkv_wgate|...)$ regex, and allocate the scale slots correctly. Reverse-compatible — artifacts that DON'T quantize these modules fall through find_matched_target → None → UnquantizedLinearMethod (same as today).

Option B — deeper: dispatch per-module based on artifact's config_groups

The model class inspects config.quantization_config.config_groups at init time and per-module decides whether to instantiate with a quant_config or not. Cleaner if NVFP4 + FP8 + W4A16 mixed-precision recipes proliferate (they will).

Either option closes the gap. Option A is ~6 lines of changes; happy to file a PR.

Related

  • #43290 — added a weight_scale_inv → weight_scale fallback in this same nvidia/ops/attention.py file. This issue is the calibration-recipe-side companion: the load path also needs to be aware that compressor/indexer can be FP8.
  • #31085 — SM 12.0 NVFP4 backend selector. Different code path but similar shape ("kernels exist but selector hardcoded for SM100").
  • vllm-project/compressed-tensors#712 — ignore= honored at calibration but NOT at save. The artifact-side mirror of this issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug][DSV4] compressor / indexer.weights_proj / indexer.wq_b hardcoded with quant_config=None; breaks load of artifacts that calibrate these attention sub-modules