vllm - 💡(How to fix) Fix [Bug]: DeepseekV4Attention crashes with KeyError: scale_fmt on non-canonical DSv4 quantizations [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41604Fetched 2026-05-05 05:44:46
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×2mentioned ×1subscribed ×1

vllm/model_executor/models/deepseek_v4.py:953 does a bare-key lookup:

self.scale_fmt = config.quantization_config["scale_fmt"]

This raises KeyError: 'scale_fmt' on any DSv4 quantization whose quantization_config doesn't include that field — i.e. anything not produced by DeepSeek's own quant pipeline (which uniquely emits scale_fmt). All worker processes die during model init; the engine never finishes booting.

scale_fmt already has a canonical default elsewhere in the same module (vllm/model_executor/layers/deepseek_v4_attention.py:1006):

self.scale_fmt = "ue8m0"

The fix is one line: use the same default at the missing-field site.

Error Message

File "/.../vllm/model_executor/models/deepseek_v4.py", line 953, in init self.scale_fmt = config.quantization_config["scale_fmt"] ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ KeyError: 'scale_fmt'

Root Cause

vllm/model_executor/models/deepseek_v4.py:953 does a bare-key lookup:

self.scale_fmt = config.quantization_config["scale_fmt"]

This raises KeyError: 'scale_fmt' on any DSv4 quantization whose quantization_config doesn't include that field — i.e. anything not produced by DeepSeek's own quant pipeline (which uniquely emits scale_fmt). All worker processes die during model init; the engine never finishes booting.

scale_fmt already has a canonical default elsewhere in the same module (vllm/model_executor/layers/deepseek_v4_attention.py:1006):

self.scale_fmt = "ue8m0"

The fix is one line: use the same default at the missing-field site.

Fix Action

Fix / Workaround

It's a one-line patch but not a busywork-tier change:

  • It blocks an entire class of community quants from loading at all (no workaround other than editing the model's quantization_config.json to inject a fake scale_fmt).

  • v0.20.1 was a DSv4 stabilization patch — this fix would naturally fold into the next round of DSv4 work.

  • scale_fmt is a DeepSeek-specific extension, not a standard quant-config field; assuming it's always present is the bug.

  • Issue #41565 (workspace regression) — unrelated bug also surfaced during the v0.20 stabilization push, filed earlier this week.

  • v0.20.1 release: included DSv4 stabilization patches but didn't touch this line.

Code Example

self.scale_fmt = config.quantization_config["scale_fmt"]

---

self.scale_fmt = "ue8m0"

---

vllm serve Intel/DeepSeek-V4-Flash-W4A16-AutoRound \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --gpu-memory-utilization 0.88 --cpu-offload-gb 6 \
  --max-num-seqs 2 --max-model-len 16384 \
  --quantization gptq_marlin --kv-cache-dtype fp8 --trust-remote-code

---

File "/.../vllm/model_executor/models/deepseek_v4.py", line 953, in __init__
    self.scale_fmt = config.quantization_config["scale_fmt"]
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'scale_fmt'

---

{
  "quant_method": "auto-round",
  "packing_format": "auto_round:auto_gptq",
  "bits": 4,
  "group_size": 128,
  "extra_config": { "head": { "bits": 16, "data_type": "float" } }
}

---

-        self.scale_fmt = config.quantization_config["scale_fmt"]
+        self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0")
RAW_BUFFERClick to expand / collapse

[Bug]: DeepseekV4Attention.__init__ crashes with KeyError: 'scale_fmt' on quantizations that don't carry the field

Summary

vllm/model_executor/models/deepseek_v4.py:953 does a bare-key lookup:

self.scale_fmt = config.quantization_config["scale_fmt"]

This raises KeyError: 'scale_fmt' on any DSv4 quantization whose quantization_config doesn't include that field — i.e. anything not produced by DeepSeek's own quant pipeline (which uniquely emits scale_fmt). All worker processes die during model init; the engine never finishes booting.

scale_fmt already has a canonical default elsewhere in the same module (vllm/model_executor/layers/deepseek_v4_attention.py:1006):

self.scale_fmt = "ue8m0"

The fix is one line: use the same default at the missing-field site.

Reproduction

Any non-canonical DSv4 W4A16 quantization triggers the crash. For example, Intel/DeepSeek-V4-Flash-W4A16-AutoRound (AutoRound 0.13.0):

vllm serve Intel/DeepSeek-V4-Flash-W4A16-AutoRound \
  --tensor-parallel-size 8 --enable-expert-parallel \
  --gpu-memory-utilization 0.88 --cpu-offload-gb 6 \
  --max-num-seqs 2 --max-model-len 16384 \
  --quantization gptq_marlin --kv-cache-dtype fp8 --trust-remote-code

Expected: engine boots (or fails on a different, downstream issue).

Actual: each worker dies during DeepseekV4DecoderLayer.__init__ with:

File "/.../vllm/model_executor/models/deepseek_v4.py", line 953, in __init__
    self.scale_fmt = config.quantization_config["scale_fmt"]
                     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'scale_fmt'

The crash happens before any GPU memory allocation, so it's deterministic regardless of TP / EP / kv-cache-dtype settings.

The Intel AutoRound quant config (truncated):

{
  "quant_method": "auto-round",
  "packing_format": "auto_round:auto_gptq",
  "bits": 4,
  "group_size": 128,
  "extra_config": { "head": { "bits": 16, "data_type": "float" } }
}

No scale_fmt field — and there's no reason for a non-DeepSeek quant pipeline to emit one. (AutoRound, GPTQ, AWQ, compressed-tensors all omit it.)

Suggested fix

-        self.scale_fmt = config.quantization_config["scale_fmt"]
+        self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0")

Rationale:

  • Matches the canonical default already hard-coded at deepseek_v4_attention.py:1006.
  • Preserves behavior for canonical DeepSeek quants (which always set scale_fmt).
  • Unblocks all non-DeepSeek DSv4 repacks (AutoRound, the upcoming community AWQ/GPTQ variants, anything Intel/QuantTrio/cyankiwi may publish).

Why this is non-trivial

It's a one-line patch but not a busywork-tier change:

  • It blocks an entire class of community quants from loading at all (no workaround other than editing the model's quantization_config.json to inject a fake scale_fmt).
  • v0.20.1 was a DSv4 stabilization patch — this fix would naturally fold into the next round of DSv4 work.
  • scale_fmt is a DeepSeek-specific extension, not a standard quant-config field; assuming it's always present is the bug.

Environment

  • vLLM: 0.20.1.dev0+g88d34c640.d20260428 (also reproduces on 0.20.1 GA tag)
  • PyTorch: 2.11.0+cu130
  • Driver/CUDA: 580.76.05 / 13.0
  • Hardware: 8× RTX A4000 (SM86) — but the crash is platform-independent (happens during config parsing, before CUDA touch)
  • Models that reproduce: Intel/DeepSeek-V4-Flash-W4A16-AutoRound, and presumably any other non-DeepSeek-pipeline DSv4 W4A16 / W8A16 quant.
  • Models that do NOT reproduce: any DeepSeek-published quant, since their pipeline always emits scale_fmt.

Related

  • Issue #41565 (workspace regression) — unrelated bug also surfaced during the v0.20 stabilization push, filed earlier this week.
  • v0.20.1 release: included DSv4 stabilization patches but didn't touch this line.

— MidasMining, 8× RTX A4000 SM86 / vLLM 0.20.1

extent analysis

TL;DR

The most likely fix is to use the get method to provide a default value for scale_fmt in the DeepseekV4Attention.__init__ method.

Guidance

  • Use the get method to provide a default value for scale_fmt, as suggested in the issue: self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0").
  • Verify that the fix works by running the reproduction command with the modified code.
  • Test the fix with different models, including those that do and do not include the scale_fmt field in their quantization config.
  • Consider adding a check to ensure that the default value is only used when the scale_fmt field is missing, to avoid overriding user-provided values.

Example

self.scale_fmt = config.quantization_config.get("scale_fmt", "ue8m0")

Notes

This fix assumes that the default value "ue8m0" is correct for all cases where the scale_fmt field is missing. If this is not the case, additional logic may be needed to determine the correct default value.

Recommendation

Apply the suggested fix using the get method to provide a default value for scale_fmt. This fix is non-trivial because it blocks an entire class of community quants from loading at all, and assuming the scale_fmt field is always present is the bug.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING