A class of compressed-tensors-quantized artifacts saves FP8 block-quant scale tensors under the attribute name weight_scale (no _inv suffix), with mathematically identical content to the weight_scale_inv form vLLM's FP8 block-quant loader expects. The loader crashes with:

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

The crash site (vllm/model_executor/kernels/linear/scaled_mm/marlin.py:73 and marlin_utils_fp8.py:106) accesses layer.weight_scale_inv directly. The DeepseekV4 weight renaming mapper (vllm/models/deepseek_v4/nvidia/model.py:1511) only renames .scale → .weight_scale_inv; it does not handle the case where the artifact already uses the longer name .weight_scale.

A defensive getattr(layer, "weight_scale_inv", layer.weight_scale) fallback would accept both naming conventions transparently.

Root Cause

llmcompressor's newer model_free_ptq path (which bypasses the PreTrainedModel integration step and writes safetensors directly) emits this naming. Any downstream artifact built with that path hits the loader crash, even though the math is correct and the artifact was previously usable on older vLLM builds that were tolerant of the naming.

This is a class of bug, not a one-off. We've confirmed:

canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (post-dequant shipping fix)
The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use weight_scale (no _inv)
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (Card A) has the same llmcompressor-produced naming

Code Example

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

---

from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
    t = f.get_tensor("layers.0.attn.wkv.weight")
    print(t.dtype, t.shape)         # torch.float8_e4m3fn, (512, 4096)
    s = f.get_tensor("layers.0.attn.wkv.weight_scale")  # NOT weight_scale_inv
    print(s.dtype, s.shape)         # torch.bfloat16, (4, 32) — FP8 block 128×128 scales

---

--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
     def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
         if self.block_quant:
+            # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+            # (llmcompressor model_free_ptq naming). Math is identical.
+            scale = getattr(layer, "weight_scale_inv",
+                            getattr(layer, "weight_scale", None))
             weight, weight_scale_inv = process_fp8_weight_block_strategy(
-                layer.weight, layer.weight_scale_inv
+                layer.weight, scale
             )
             # Update layer with new values
             replace_parameter(layer, "weight", weight.data)
-            replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            # Always register the result under `weight_scale_inv` so downstream
+            # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+            if hasattr(layer, "weight_scale_inv"):
+                replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            else:
+                layer.register_parameter(
+                    "weight_scale_inv",
+                    torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+                )

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save `weight_scale` rather than `weight_scale_inv`

Summary

AttributeError: 'MergedColumnParallelLinear' object has no attribute
'weight_scale_inv'. Did you mean: 'weight_scale'?

A defensive getattr(layer, "weight_scale_inv", layer.weight_scale) fallback would accept both naming conventions transparently.

Reproducer

canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (DeepSeek-V4-Flash, W4A16 routed experts + FP8 block 128×128 attention + MTP draft head). Safetensors built by llmcompressor's model_free_ptq path which produces keys named <module>.weight_scale instead of <module>.weight_scale_inv.

from safetensors import safe_open
with safe_open(".../layers.0.attn.wkv.weight", framework="pt") as f:
    t = f.get_tensor("layers.0.attn.wkv.weight")
    print(t.dtype, t.shape)         # torch.float8_e4m3fn, (512, 4096)
    s = f.get_tensor("layers.0.attn.wkv.weight_scale")  # NOT weight_scale_inv
    print(s.dtype, s.shape)         # torch.bfloat16, (4, 32) — FP8 block 128×128 scales

The mathematical content is identical to the canonical form — the block-scaled FP8 weight reconstruction is weight * scale.repeat_interleave(128, dim=0).repeat_interleave(128, dim=1). Only the attribute name differs from the loader's expectation.

Why this matters

This is a class of bug, not a one-off. We've confirmed:

canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP (post-dequant shipping fix)
The original (pre-dequant) FP8 attention in the same artifact also exhibits this — all 33,239 FP8/W4A16 scale tensors use weight_scale (no _inv)
canada-quant/DeepSeek-V4-Flash-W4A16-FP8 (Card A) has the same llmcompressor-produced naming

Proposed fix

Two-line defensive patch:

--- a/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
+++ b/vllm/model_executor/kernels/linear/scaled_mm/marlin.py
@@ -70,11 +70,13 @@ class CompressedTensorsW8A8Fp8MarlinScaledMMLinearKernel(...):
     def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
         if self.block_quant:
+            # Accept either `weight_scale_inv` (canonical) or `weight_scale`
+            # (llmcompressor model_free_ptq naming). Math is identical.
+            scale = getattr(layer, "weight_scale_inv",
+                            getattr(layer, "weight_scale", None))
             weight, weight_scale_inv = process_fp8_weight_block_strategy(
-                layer.weight, layer.weight_scale_inv
+                layer.weight, scale
             )
             # Update layer with new values
             replace_parameter(layer, "weight", weight.data)
-            replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            # Always register the result under `weight_scale_inv` so downstream
+            # forward-path code (Dynamo, BlockScaledMMLinearKernel) finds it.
+            if hasattr(layer, "weight_scale_inv"):
+                replace_parameter(layer, "weight_scale_inv", weight_scale_inv.data)
+            else:
+                layer.register_parameter(
+                    "weight_scale_inv",
+                    torch.nn.Parameter(weight_scale_inv.data, requires_grad=False),
+                )

The same pattern likely applies to marlin_utils_fp8.py:106 (prepare_fp8_layer_for_marlin) and possibly to the DeepseekV4 renaming mapper in vllm/models/deepseek_v4/nvidia/model.py:1511 — we can extend the proposal to those sites in the same PR.

Open question for kylesayrs

Is weight_scale vs weight_scale_inv a deliberate semantic distinction (e.g., weight_scale_inv is the multiplicative inverse used during dequant fastpath, vs weight_scale for divide-by-scale dequant), or are they fully interchangeable in the FP8 block-quant path? The process_fp8_weight_block_strategy function appears to treat them identically, but if there's a subtle difference, the right fix is the source quantization step naming the attribute consistently, not the loader fallback.

We're happy to extend this PR (or split into a child PR for the model-mapper) once that's clarified.

Cross-references

canada-quant repo audit: https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp/blob/main/docs/findings/cardd_marlin_patches_built_artifact_blocker_2026_05_25.md
This sat alongside our #40923 comment and #36889 reopen comment — same artifact, sibling bugs in the same Marlin path on SM 12.0.

cc @kylesayrs (compressed-tensors maintainer)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save `weight_scale` rather than `weight_scale_inv`

Summary

Reproducer

Why this matters

Proposed fix

Open question for kylesayrs

Cross-references

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save weight_scale rather than weight_scale_inv

Summary

Reproducer

Why this matters

Proposed fix

Open question for kylesayrs

Cross-references

Still need to ship something?

TRENDING

[Bug] FP8 block-quant loader rejects artifacts whose safetensors save `weight_scale` rather than `weight_scale_inv`