vllm - ✅(Solved) Fix [Bug]: fp8_e5m2 kv-cache gate in _init_kv_cache_quant fires on any quantized checkpoint, not only fp8 checkpoints [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39137Fetched 2026-04-08 03:01:47
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
2
Author
Participants
Timeline (top)
commented ×2cross-referenced ×1referenced ×1

Error Message

(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 725, in <lambda> (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] lambda prefix: Gemma4DecoderLayer( (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 462, in init (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] self.self_attn = Gemma4Attention( (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] ^^^^^^^^^^^^^^^^ (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 372, in init (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] self.attn = Attention( (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] ^^^^^^^^^^ (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 381, in init (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] _init_kv_cache_quant(self, quant_config, prefix) (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 167, in _init_kv_cache_quant (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.") (Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

Fix Action

Fixed

PR fix notes

PR #39195: [Bugfix] Narrow fp8_e5m2 kv-cache gate to only reject actual fp8 checkpoints

Description (problem / solution / changelog)

Summary

Fixes #39137

The fp8_e5m2 kv-cache dtype check in _init_kv_cache_quant incorrectly rejects all quantized checkpoints when combined with --kv-cache-dtype fp8_e5m2, even though the error message says "fp8 checkpoints." The root cause is that should_load_quant_weights(quant_method) returns True for any BaseKVCacheMethod subclass, including CompressedTensorsKVCacheMethod and QuarkKVCacheMethod, which may be used with non-fp8 checkpoints (e.g., INT4, INT8, AWQ, GPTQ).

This PR narrows the gate to only fire when the quant method is specifically Fp8KVCacheMethod or ModelOptFp8KVCacheMethod — the actual fp8-checkpoint KV cache methods where fp8_e5m2 is incompatible due to conflicting scales.

Changes

  • Added _is_fp8_kv_cache_method() helper that checks if the quant method is an fp8-specific KV cache method
  • Updated the fp8_e5m2 guard in _init_kv_cache_quant to use this helper, so non-fp8 quantized checkpoints (INT4, INT8, etc.) can use --kv-cache-dtype fp8_e5m2 without being incorrectly rejected

Before (broken)

vllm serve cyankiwi/gemma-4-31B-it-AWQ-4bit --kv-cache-dtype fp8_e5m2 ...
# ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.
# (but this is an INT4 checkpoint, not fp8!)

After (fixed)

The error only fires for actual fp8 checkpoints (Fp8Config, ModelOpt fp8). Non-fp8 quantized checkpoints proceed to load KV cache scales normally (defaulting to 1.0 when no scales are present in the checkpoint).

Test plan

  • Verified Python syntax compiles without errors
  • The fix is a strict narrowing of an existing gate condition — fp8 checkpoints still get the same error, while non-fp8 checkpoints are no longer incorrectly blocked
  • AI assistance was used (Claude). The change was reviewed line-by-line.
  • No existing PR addresses this issue.

🤖 Generated with Claude Code

Changed files

  • vllm/model_executor/layers/attention/attention.py (modified, +22/-2)

Code Example

"quantization_config": {
  "quant_method": "compressed-tensors",
  "format": "pack-quantized",
  "config_groups": {
    "group_0": {
      "weights": {
        "num_bits": 4,
        "group_size": 32,
        "observer": "mse",
        "strategy": "group",
        "symmetric": true,
        "type": "int"
      }
    }
  }
}

---

vllm serve /models/cyankiwi/gemma-4-31B-it-AWQ-4bit \
  --tensor-parallel-size 2 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --reasoning-parser gemma4 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --kv-cache-dtype fp8_e5m2 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --async-scheduling

---

(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 725, in <lambda>
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     lambda prefix: Gemma4DecoderLayer(
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]                    ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 462, in __init__
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     self.self_attn = Gemma4Attention(
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]                      ^^^^^^^^^^^^^^^^
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 372, in __init__
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     self.attn = Attention(
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]                 ^^^^^^^^^^
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 381, in __init__
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     _init_kv_cache_quant(self, quant_config, prefix)
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 167, in _init_kv_cache_quant
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

---

def should_load_quant_weights(quant_method: QuantizeMethodBase | None) -> bool:
    """Returns whether the quantization method should load quantized weights."""
    return quant_method is not None and not isinstance(
        quant_method, UnquantizedLinearMethod
    )

---

quant_method = (
        quant_config.get_quant_method(layer, prefix=prefix) if quant_config else None
    )

    # See [Note: Register q/k/v/prob scales in state dict]
    if should_load_quant_weights(quant_method):
        assert isinstance(quant_method, BaseKVCacheMethod)
        # TODO (mgoin): kv cache dtype should be specified in the FP8
        # checkpoint config and become the "auto" behavior
        if layer.kv_cache_dtype == "fp8_e5m2":
            raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
        # If quantization is enabled, we make "k_scale" and "v_scale"
        # parameters so that it can be loaded from the model checkpoint.
        # The k/v_scale will then be converted back to native float32
        # values after weight loading.
        layer.quant_method = quant_method
        layer.quant_method.create_weights(layer)
RAW_BUFFERClick to expand / collapse

Filing this as a separate issue per suggestion in #39133 so it doesn't get lost in the Gemma 4 KV thread.

In vllm/model_executor/layers/attention/attention.py, _init_kv_cache_quant raises ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.") when the checkpoint being loaded is not in fact an fp8 checkpoint. The gate that leads to the raise is should_load_quant_weights(quant_method), which returns True for any quantization method other than UnquantizedLinearMethod — so an INT4 compressed-tensors checkpoint trips the raise even though there is nothing fp8 about it.

Reproduction

vLLM version on the running container: 0.1.dev1+gd56e95223 (custom build; same code path is present on main).

Model: cyankiwi/gemma-4-31B-it-AWQ-4bit. Its quantization_config from config.json:

"quantization_config": {
  "quant_method": "compressed-tensors",
  "format": "pack-quantized",
  "config_groups": {
    "group_0": {
      "weights": {
        "num_bits": 4,
        "group_size": 32,
        "observer": "mse",
        "strategy": "group",
        "symmetric": true,
        "type": "int"
      }
    }
  }
}

Launched with:

vllm serve /models/cyankiwi/gemma-4-31B-it-AWQ-4bit \
  --tensor-parallel-size 2 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 131072 \
  --reasoning-parser gemma4 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --kv-cache-dtype fp8_e5m2 \
  --enable-prefix-caching \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --async-scheduling

Hardware: 2× RTX 3090 (SM 8.6). The choice of fp8_e5m2 is deliberate — on SM 8.6, Triton only supports fp8e4b15 and fp8e5 fp8 dtypes, so fp8_e5m2 is the only fp8 KV cache dtype that could plausibly run on Ampere for this model.

Traceback from the engine-core worker:

(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 725, in <lambda>
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     lambda prefix: Gemma4DecoderLayer(
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]                    ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 462, in __init__
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     self.self_attn = Gemma4Attention(
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]                      ^^^^^^^^^^^^^^^^
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py", line 372, in __init__
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     self.attn = Attention(
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]                 ^^^^^^^^^^
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 381, in __init__
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     _init_kv_cache_quant(self, quant_config, prefix)
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/attention.py", line 167, in _init_kv_cache_quant
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871]     raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
(Worker_TP0 pid=601) ERROR 04-07 00:57:27 [multiproc_executor.py:871] ValueError: fp8_e5m2 kv-cache is not supported with fp8 checkpoints.

The code

vllm/model_executor/layers/attention/attention.py, the helper and gate (verbatim from the running container):

def should_load_quant_weights(quant_method: QuantizeMethodBase | None) -> bool:
    """Returns whether the quantization method should load quantized weights."""
    return quant_method is not None and not isinstance(
        quant_method, UnquantizedLinearMethod
    )

and the code that raises, in _init_kv_cache_quant:

    quant_method = (
        quant_config.get_quant_method(layer, prefix=prefix) if quant_config else None
    )

    # See [Note: Register q/k/v/prob scales in state dict]
    if should_load_quant_weights(quant_method):
        assert isinstance(quant_method, BaseKVCacheMethod)
        # TODO (mgoin): kv cache dtype should be specified in the FP8
        # checkpoint config and become the "auto" behavior
        if layer.kv_cache_dtype == "fp8_e5m2":
            raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
        # If quantization is enabled, we make "k_scale" and "v_scale"
        # parameters so that it can be loaded from the model checkpoint.
        # The k/v_scale will then be converted back to native float32
        # values after weight loading.
        layer.quant_method = quant_method
        layer.quant_method.create_weights(layer)

should_load_quant_weights returns True for any quant_method that is not None and not UnquantizedLinearMethod. For a compressed-tensors INT4 checkpoint, quant_config.get_quant_method(...) returns a compressed-tensors method (for this model the linear kernel is logged as CompressedTensorsWNA16 → MarlinLinearKernel), not an UnquantizedLinearMethod, so the gate is True. Inside the gate, there is no additional check that the underlying checkpoint is actually fp8; the fp8_e5m2 branch raises unconditionally whenever the gate is True.

The result is that any non-unquantized, non-fp8 checkpoint (INT4, INT8, AWQ, GPTQ, etc.) that the user tries to run with --kv-cache-dtype fp8_e5m2 fails with an error message that refers to a checkpoint type the model isn't in fact using.

What I am reporting (and not)

I am reporting the mismatch between the gate and the error message. I am not asserting that running fp8_e5m2 KV cache against an INT4 compressed-tensors checkpoint is guaranteed to work if the gate is relaxed — I haven't tested it end-to-end. I'm reporting:

  1. The gate fires on non-fp8 quantized checkpoints even though the error message says "fp8 checkpoints."
  2. The # TODO (mgoin) comment directly above the raise acknowledges that the intended distinction is based on the FP8 checkpoint config, not the current should_load_quant_weights gate.
  3. On Ampere (SM 8.6), fp8_e5m2 is the only fp8 KV cache dtype that could run at all (fp8_e4m3 fails with ValueError("type fp8e4nv not supported in this architecture") during Inductor codegen, which is a hardware limitation and not a vLLM bug). If this gate is really supposed to catch only fp8 checkpoints, then on Ampere this path is currently unreachable for every quantized-but-not-fp8 model.

Happy to try a candidate fix locally (narrow the gate to an actual fp8-checkpoint predicate, or at minimum make the error message match what the gate actually catches) and report whether the KV cache then initializes successfully on the INT4 model, if that's useful.

extent analysis

TL;DR

The issue can be resolved by narrowing the gate in _init_kv_cache_quant to only raise an error when the checkpoint is actually an fp8 checkpoint, or by updating the error message to reflect the current gate behavior.

Guidance

  • Review the should_load_quant_weights function to ensure it correctly identifies fp8 checkpoints.
  • Update the error message in _init_kv_cache_quant to reflect the current gate behavior, which catches non-unquantized, non-fp8 checkpoints.
  • Consider adding an additional check to verify if the underlying checkpoint is actually an fp8 checkpoint before raising the error.
  • Test the updated code with an INT4 compressed-tensors checkpoint to ensure the KV cache initializes successfully.

Example

def _init_kv_cache_quant(self, quant_config, prefix):
    # ...
    if should_load_quant_weights(quant_method) and quant_method.is_fp8_checkpoint:
        # ...
        if layer.kv_cache_dtype == "fp8_e5m2":
            raise ValueError("fp8_e5m2 kv-cache is not supported with fp8 checkpoints.")
    # ...

Note: The is_fp8_checkpoint attribute is assumed to be added to the quant_method object to correctly identify fp8 checkpoints.

Notes

The current implementation of the gate in _init_kv_cache_quant may be too broad, catching non-fp8 checkpoints. Narrowing the gate or updating the error message can help resolve the issue. However, further testing is required to ensure the KV cache initializes successfully with non-fp8 checkpoints.

Recommendation

Apply a workaround by updating the error message in _init_kv_cache_quant to reflect the current gate behavior, and consider adding an additional check to verify if the underlying checkpoint is actually an fp8 checkpoint. This will help resolve the issue and provide a more accurate error message.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING