vllm - 💡(How to fix) Fix [Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name weight_scale (no _inv suffix), with mathematically identical content to the weight_scale_inv form vLLM's FP8 block-quant loader expects. The loader crashes with:

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

The crash site (vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73 and marlin_utils_fp8.py:106) accesses layer.weight_scale_inv directly. The DeepseekV4 weight renaming mapper (vllm/models/deepseek_v4/nvidia/model.py:1511) only renames .scale.weight_scale_inv; it does not handle the case where the artifact already uses the longer name .weight_scale.

A defensive getattr(layer, "weight_scale_inv", layer.weight_scale) fallback would accept both naming conventions transparently.

Error Message

AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale_inv'. Did you mean: 'weight_scale'?

Root Cause

llmcompressor's newer model_free_ptq path (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.

This is a class of bug, not a one-off. We've confirmed:

  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (post-dequant shipping fix)
  • The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use weight_scale (no _inv)
  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (Card A) has the same llmcompressor-produced naming

Fix Action

Fix / Workaround

Two-line defensive patch:

Code Example

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

---

from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
    t = f.get_tensor("layers.0.attn.wkv.weight")
    print(t.dtype, t.shape)         # torch.float8_e4m3fn, (512, 4096)
    s = f.get_tensor("layers.0.attn.wkv.weight_scale")  # NOT weight_scale_inv
    print(s.dtype, s.shape)         # torch.bfloat16, (4, 32)FP8 block 128×128 scales

---

--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
     def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
         if self.block_quant:
+            # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+            # (llmcompressor model_free_ptq naming). Math is identical.
+            scale = getattr(layer, "weight_scale_inv",
+                            getattr(layer, "weight_scale", None))
             weight, weight_scale_inv = process_fp8_weight_block_strategy(
-                layer.weight, layer.weight_scale_inv
+                layer.weight, scale
             )
             # Update layer with new values
             replace_parameter(layer, "weight", weight.data)
-            replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            # Always register the result under `weight_scale_inv` so downstream
+            # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+            if hasattr(layer, "weight_scale_inv"):
+                replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            else:
+                layer.register_parameter(
+                    "weight_scale_inv",
+                    torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+                )
RAW_BUFFERClick to expand / collapse

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save weight_scale rather than weight_scale_inv

Summary

A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name weight_scale (no _inv suffix), with mathematically identical content to the weight_scale_inv form vLLM's FP8 block-quant loader expects. The loader crashes with:

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

The crash site (vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73 and marlin_utils_fp8.py:106) accesses layer.weight_scale_inv directly. The DeepseekV4 weight renaming mapper (vllm/models/deepseek_v4/nvidia/model.py:1511) only renames .scale.weight_scale_inv; it does not handle the case where the artifact already uses the longer name .weight_scale.

A defensive getattr(layer, "weight_scale_inv", layer.weight_scale) fallback would accept both naming conventions transparently.

Reproducer

canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (DeepSeek-V4-Flash, W4A16 routed experts + FP8 block 128×128 attention + MTP draft head). Safetensors built by llmcompressor's model_free_ptq path which produces keys named <module>.weight_scale instead of <module>.weight_scale_inv.

from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
    t = f.get_tensor("layers.0.attn.wkv.weight")
    print(t.dtype, t.shape)         # torch.float8_e4m3fn, (512, 4096)
    s = f.get_tensor("layers.0.attn.wkv.weight_scale")  # NOT weight_scale_inv
    print(s.dtype, s.shape)         # torch.bfloat16, (4, 32) — FP8 block 128×128 scales

The mathematical content is identical to the canonical form — the block-scaled FP8 weight reconstruction is weight * scale.repeat_interleave(128, dim=0).repeat_interleave(128, dim=1). Only the attribute name differs from the loader's expectation.

Why this matters

llmcompressor's newer model_free_ptq path (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.

This is a class of bug, not a one-off. We've confirmed:

  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (post-dequant shipping fix)
  • The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use weight_scale (no _inv)
  • canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (Card A) has the same llmcompressor-produced naming

Proposed fix

Two-line defensive patch:

--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
     def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
         if self.block_quant:
+            # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+            # (llmcompressor model_free_ptq naming). Math is identical.
+            scale = getattr(layer, "weight_scale_inv",
+                            getattr(layer, "weight_scale", None))
             weight, weight_scale_inv = process_fp8_weight_block_strategy(
-                layer.weight, layer.weight_scale_inv
+                layer.weight, scale
             )
             # Update layer with new values
             replace_parameter(layer, "weight", weight.data)
-            replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            # Always register the result under `weight_scale_inv` so downstream
+            # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+            if hasattr(layer, "weight_scale_inv"):
+                replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            else:
+                layer.register_parameter(
+                    "weight_scale_inv",
+                    torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+                )

The same pattern likely applies to marlin_utils_fp8.py:106 (prepare_fp8_layer_for_marlin) and possibly to the DeepseekV4 renaming mapper in vllm/models/deepseek_v4/nvidia/model.py:1511 — we can extend the proposal to those sites in the same PR.

Open question for kylesayrs

Is weight_scale vs weight_scale_inv a deliberate semantic distinction (e.g., weight_scale_inv is the multiplicative inverse used during dequant fastpath, vs weight_scale for divide-by-scale dequant), or are they fully interchangeable in the FP8 block-quant path? The process_fp8_weight_block_strategy function appears to treat them identically, but if there's a subtle difference, the right fix is the source quantization step naming the attribute consistently, not the loader fallback.

We're happy to extend this PR (or split into a child PR for the model-mapper) once that's clarified.

Cross-references

cc @kylesayrs (compressed-tensors maintainer)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING