vllm - 💡(How to fix) Fix [Bug]: DeepseekV4Attention crashes with KeyError: scale_fmt on non-canonical DSv4 quantizations [2 comments, 2 participants]

vllm2026-05-04 04:35:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41604•Fetched 2026-05-05 05:44:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

MidasMining

Participants

Dnoob

MidasMining

Timeline (top)

commented ×2mentioned ×1subscribed ×1

vllm/model_executor/models/deepseek_v4.py:953 does a bare-key lookup:

self.scale_fmt = config.quantization_config["scale_fmt"]

This raises KeyError: 'scale_fmt' on any DSv4 quantization whose quantization_config doesn't include that field — i.e. anything not produced by DeepSeek's own quant pipeline (which uniquely emits scale_fmt). All worker processes die during model init; the engine never finishes booting.

scale_fmt already has a canonical default elsewhere in the same module (vllm/model_executor/layers/deepseek_v4_attention.py:1006):

self.scale_fmt = "ue8m0"

The fix is one line: use the same default at the missing-field site.

Error Message

File "/.../vllm/model_executor/models/deepseek_v4.py", line 953, in init self.scale_fmt = config.quantization_config["scale_fmt"] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ KeyError: 'scale_fmt'

Root Cause

vllm/model_executor/models/deepseek_v4.py:953 does a bare-key lookup:

self.scale_fmt = config.quantization_config["scale_fmt"]

scale_fmt already has a canonical default elsewhere in the same module (vllm/model_executor/layers/deepseek_v4_attention.py:1006):

self.scale_fmt = "ue8m0"

The fix is one line: use the same default at the missing-field site.

Fix Action

Fix / Workaround

It's a one-line patch but not a busywork-tier change:

It blocks an entire class of community quants from loading at all (no workaround other than editing the model's quantization_config.json to inject a fake scale_fmt).
v0.20.1 was a DSv4 stabilization patch — this fix would naturally fold into the next round of DSv4 work.
scale_fmt is a DeepSeek-specific extension, not a standard quant-config field; assuming it's always present is the bug.
Issue #41565 (workspace regression) — unrelated bug also surfaced during the v0.20 stabilization push, filed earlier this week.
v0.20.1 release: included DSv4 stabilization patches but didn't touch this line.

Code Example

self.scale_fmt = config.quantization_config["scale_fmt"]

---

self.scale_fmt = "ue8m0"

---

vllm serve Intel/DeepSeek-V4-Flash-W4A16-AutoRound \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --gpu-memory-utilization 0.88 --cpu-offload-gb 6 \
  --max-num-seqs 2 --max-model-len 16384 \
  --quantization gptq_marlin --kv-cache-dtype fp8 --trust-remote-code

---

File "/.../vllm/model_executor/models/deepseek_v4.py", line 953, in __init__
    self.scale_fmt = config.quantization_config["scale_fmt"]
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'scale_fmt'

---

{
  "quant_method": "auto-round",
  "packing_format": "auto_round:auto_gptq",
  "bits": 4,
  "group_size": 128,
  "extra_config": { "head": { "bits": 16, "data_type": "float" } }
}

---

-        self.scale_fmt = config.quantization_config["scale_fmt"]
+        self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0")

RAW_BUFFERClick to expand / collapse

[Bug]: `DeepseekV4Attention.init` crashes with `KeyError: 'scale_fmt'` on quantizations that don't carry the field

Summary

vllm/model_executor/models/deepseek_v4.py:953 does a bare-key lookup:

self.scale_fmt = config.quantization_config["scale_fmt"]

scale_fmt already has a canonical default elsewhere in the same module (vllm/model_executor/layers/deepseek_v4_attention.py:1006):

self.scale_fmt = "ue8m0"

The fix is one line: use the same default at the missing-field site.

Reproduction

Any non-canonical DSv4 W4A16 quantization triggers the crash. For example, Intel/DeepSeek-V4-Flash-W4A16-AutoRound (AutoRound 0.13.0):

vllm serve Intel/DeepSeek-V4-Flash-W4A16-AutoRound \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --gpu-memory-utilization 0.88 --cpu-offload-gb 6 \
  --max-num-seqs 2 --max-model-len 16384 \
  --quantization gptq_marlin --kv-cache-dtype fp8 --trust-remote-code

Expected: engine boots (or fails on a different, downstream issue).

Actual: each worker dies during DeepseekV4DecoderLayer.__init__ with:

File "/.../vllm/model_executor/models/deepseek_v4.py", line 953, in __init__
    self.scale_fmt = config.quantization_config["scale_fmt"]
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'scale_fmt'

The crash happens before any GPU memory allocation, so it's deterministic regardless of TP / EP / kv-cache-dtype settings.

The Intel AutoRound quant config (truncated):

{
  "quant_method": "auto-round",
  "packing_format": "auto_round:auto_gptq",
  "bits": 4,
  "group_size": 128,
  "extra_config": { "head": { "bits": 16, "data_type": "float" } }
}

No scale_fmt field — and there's no reason for a non-DeepSeek quant pipeline to emit one. (AutoRound, GPTQ, AWQ, compressed-tensors all omit it.)

Suggested fix

-        self.scale_fmt = config.quantization_config["scale_fmt"]
+        self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0")

Rationale:

Matches the canonical default already hard-coded at deepseek_v4_attention.py:1006.
Preserves behavior for canonical DeepSeek quants (which always set scale_fmt).
Unblocks all non-DeepSeek DSv4 repacks (AutoRound, the upcoming community AWQ/GPTQ variants, anything Intel/QuantTrio/cyankiwi may publish).

Why this is non-trivial

It's a one-line patch but not a busywork-tier change:

It blocks an entire class of community quants from loading at all (no workaround other than editing the model's quantization_config.json to inject a fake scale_fmt).
v0.20.1 was a DSv4 stabilization patch — this fix would naturally fold into the next round of DSv4 work.
scale_fmt is a DeepSeek-specific extension, not a standard quant-config field; assuming it's always present is the bug.

Environment

vLLM: 0.20.1.dev0+g88d34c640.d20260428 (also reproduces on 0.20.1 GA tag)
PyTorch: 2.11.0+cu130
Driver/CUDA: 580.76.05 / 13.0
Hardware: 8× RTX A4000 (SM86) — but the crash is platform-independent (happens during config parsing, before CUDA touch)
Models that reproduce: Intel/DeepSeek-V4-Flash-W4A16-AutoRound, and presumably any other non-DeepSeek-pipeline DSv4 W4A16 / W8A16 quant.
Models that do NOT reproduce: any DeepSeek-published quant, since their pipeline always emits scale_fmt.

Issue #41565 (workspace regression) — unrelated bug also surfaced during the v0.20 stabilization push, filed earlier this week.
v0.20.1 release: included DSv4 stabilization patches but didn't touch this line.

— MidasMining, 8× RTX A4000 SM86 / vLLM 0.20.1

extent analysis

TL;DR

The most likely fix is to use the get method to provide a default value for scale_fmt in the DeepseekV4Attention.__init__ method.

Guidance

Use the get method to provide a default value for scale_fmt, as suggested in the issue: self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0").
Verify that the fix works by running the reproduction command with the modified code.
Test the fix with different models, including those that do and do not include the scale_fmt field in their quantization config.
Consider adding a check to ensure that the default value is only used when the scale_fmt field is missing, to avoid overriding user-provided values.

Example

self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0")

Notes

This fix assumes that the default value "ue8m0" is correct for all cases where the scale_fmt field is missing. If this is not the case, additional logic may be needed to determine the correct default value.

Recommendation

Apply the suggested fix using the get method to provide a default value for scale_fmt. This fix is non-trivial because it blocks an entire class of community quants from loading at all, and assuming the scale_fmt field is always present is the bug.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#serialization error #model compatibility #GPU setup #container setup #orchestration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: DeepseekV4Attention crashes with KeyError: scale_fmt on non-canonical DSv4 quantizations [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

[Bug]: `DeepseekV4Attention.init` crashes with `KeyError: 'scale_fmt'` on quantizations that don't carry the field

Summary

Reproduction

Suggested fix

Why this is non-trivial

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DeepseekV4Attention crashes with KeyError: scale_fmt on non-canonical DSv4 quantizations [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

[Bug]: DeepseekV4Attention.__init__ crashes with KeyError: 'scale_fmt' on quantizations that don't carry the field

Summary

Reproduction

Suggested fix

Why this is non-trivial

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

[Bug]: `DeepseekV4Attention.init` crashes with `KeyError: 'scale_fmt'` on quantizations that don't carry the field