vllm - ✅(Solved) Fix [Bug]: Gemma 4 fails to initialize with per-token-head KV cache quantization [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 942, in unify_kv_cache_spec_page_size (Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971] raise NotImplementedError( (Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971] NotImplementedError: The page size of the layer is not divisible by the maximum page size. Cannot unify by adjusting block_size.

Root Cause

Gemma 4 fails to initialize in vLLM V1 when --kv_cache_dtype int8_per_token_head is enabled. The same issue also affects fp8_per_token_head. The failure is caused by Gemma 4's hybrid attention layout using two different KV head dimensions:

  • sliding/local layers use head_dim = 256
  • global/full-attention layers use head_dim = 512

Fix Action

Fix / Workaround

Potential direction for a fix: I want to post a temporary workaround PR later for reference later, which as I tested, can solve this problem. But it certainly needs more review and discussion.

PR fix notes

PR #40391: Fix Gemma4 KV cache page-size alignment for per-token-head quantization

Description (problem / solution / changelog)

Purpose

Fix Gemma4 initialization failure in V1 when per-token-head KV cache quantization is enabled (int8_per_token_head / fp8_per_token_head), as reported in #40388.

Root cause: Gemma4 hybrid attention mixes local (head_dim=256) and global (head_dim=512) layers. With per-token-head KV quantization, per-token scale metadata changes page-size factors to:

  • local: (256 * 1) * 2 + 8 = 520
  • global: (512 * 1) * 2 + 8 = 1032 1032 is not divisible by 520, so V1 page-size unification can fail.

Additionally, when page_size_padded is present, block-size scaling and KV cache shape reconstruction paths needed consistent padded-size handling.

Fix:

  • Add AttentionSpec.copy_with_new_block_size handling to correctly scale and validate page_size_padded when block size changes.
  • In Gemma4, set kv_cache_page_size_padded for global-attention layers (1040-based padded factor) under per-token-head cache dtypes, and propagate this through Attention KV spec construction.
  • Add shared helper adjust_kv_cache_shape_for_padded_page_size and use it consistently in:
    • vllm/v1/worker/gpu/attn_utils.py
    • vllm/v1/worker/gpu_model_runner.py
    • vllm/v1/worker/kv_connector_model_runner_mixin.py

Fixes issue #40388.

Test Plan

Runtime reproduction for issue validation on Gemma4 environment:

vllm serve <gemma4-model> --kv_cache_dtype int8_per_token_head

Because this change was tested on Turing hardware, #39018 should be merged first before this PR.

Test Results

(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO:     172.17.0.1:33632 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.17.0.1:33640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-20 17:59:23 [loggers.py:271] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 10.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-20 17:59:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:40910 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.17.0.1:40920 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-20 18:00:33 [loggers.py:271] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

As a relatively new contributor in this area, I truly appreciate detailed review comments and suggestions, and I will actively iterate on this PR based on feedback. Plz let me know if there are any questions on this change.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/v1/core/test_kv_cache_utils.py (modified, +28/-0)
  • tests/v1/worker/test_kv_cache_shape_utils.py (added, +75/-0)
  • vllm/model_executor/layers/attention/attention.py (modified, +5/-0)
  • vllm/model_executor/models/gemma4.py (modified, +24/-0)
  • vllm/v1/core/kv_cache_utils.py (modified, +1/-1)
  • vllm/v1/kv_cache_interface.py (modified, +17/-0)
  • vllm/v1/worker/gpu/attn_utils.py (modified, +17/-6)
  • vllm/v1/worker/gpu_model_runner.py (modified, +24/-6)
  • vllm/v1/worker/kv_cache_shape_utils.py (added, +90/-0)
  • vllm/v1/worker/kv_connector_model_runner_mixin.py (modified, +25/-6)

Code Example

Use the builtin dockerfile to build & manually merge PR https://github.com/vllm-project/vllm/pull/39018 for turing devices base compatibility
Main commit hash:726efe177bf22874743d11dfdfef9247dbfb5ff0

---

vllm serve <gemma4-model> \
  --kv_cache_dtype int8_per_token_head

---

(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 942, in unify_kv_cache_spec_page_size
(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971]     raise NotImplementedError(
(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971] NotImplementedError: The page size of the layer is not divisible by the maximum page size. Cannot unify by adjusting block_size.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Use the builtin dockerfile to build & manually merge PR https://github.com/vllm-project/vllm/pull/39018 for turing devices base compatibility
Main commit hash:726efe177bf22874743d11dfdfef9247dbfb5ff0
</details>

🐛 Describe the bug

Gemma 4 fails to initialize in vLLM V1 when --kv_cache_dtype int8_per_token_head is enabled. The same issue also affects fp8_per_token_head. The failure is caused by Gemma 4's hybrid attention layout using two different KV head dimensions:

  • sliding/local layers use head_dim = 256
  • global/full-attention layers use head_dim = 512

With per-token-head KV quantization, each KV head also stores scale metadata per token. That extra scale payload breaks the exact 2:1 page-size ratio between the 512-dim and 256-dim layers, so V1 KV page-size unification fails during initialization.

Minimal reproduction:

vllm serve <gemma4-model> \
  --kv_cache_dtype int8_per_token_head

Observed result:

(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 942, in unify_kv_cache_spec_page_size
(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971]     raise NotImplementedError(
(Worker_TP1 pid=248) ERROR 04-19 14:16:08 [multiproc_executor.py:971] NotImplementedError: The page size of the layer is not divisible by the maximum page size. Cannot unify by adjusting block_size.

Why this happens:

  • local layer page-size factor: (256 * 1 byte * 2) + 8 = 520
  • global layer page-size factor: (512 * 1 byte * 2) + 8 = 1032 1032 is not divisible by 520, so the current page-size unification logic cannot reconcile the two layer types by changing block_size alone.

Relevant code paths:

  • vllm/model_executor/models/gemma4.py
  • vllm/model_executor/layers/attention/attention.py
  • vllm/v1/core/kv_cache_utils.py
  • vllm/v1/kv_cache_interface.py
  • vllm/v1/worker/gpu/attn_utils.py
  • vllm/v1/worker/gpu_model_runner.py

Environment

ItemValue
Commit726efe177bf22874743d11dfdfef9247dbfb5ff0
GPU2× NVIDIA GeForce RTX 2080 Ti (22 GB, CC 7.5, Turing)
CUDA Driver580.126.18, CUDA 13.0
Python3.12 (Docker)
ModelGemma4-31B AWQ 4-bit (google/gemma-4-31b-it), also reproduced with cyankiwi/gemma-4-31B-it-AWQ-4bit

I'm still new to this project, so if this report doesn't match the community's preferred format or I've misunderstood something basic, corrections are very welcome.

Potential direction for a fix: I want to post a temporary workaround PR later for reference later, which as I tested, can solve this problem. But it certainly needs more review and discussion.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be fixed by modifying the page-size unification logic in vllm/v1/core/kv_cache_utils.py to accommodate the different KV head dimensions used in Gemma 4's hybrid attention layout.

Guidance

  • Review the unify_kv_cache_spec_page_size function in vllm/v1/core/kv_cache_utils.py to understand the current page-size unification logic and identify potential modifications to support the 2:1 page-size ratio between the 512-dim and 256-dim layers.
  • Consider adding a special case or adjustment to the block_size calculation to reconcile the different page sizes used in local and global layers.
  • Investigate the potential workaround PR mentioned in the issue report and review its implementation to determine its feasibility as a temporary solution.
  • Verify that any modifications or workarounds do not introduce performance regressions or other issues.

Example

No code example is provided due to the complexity of the issue and the need for a thorough review of the relevant code paths.

Notes

The fix may require a deeper understanding of the Gemma 4 model architecture and the page-size unification logic used in the vllm project. Additional discussion and review may be necessary to ensure that any modifications are correct and effective.

Recommendation

Apply a workaround, as a temporary PR has already been tested and shown to solve the problem, but it requires further review and discussion to ensure its correctness and feasibility as a long-term solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma 4 fails to initialize with per-token-head KV cache quantization [1 pull requests]