vllm - 💡(How to fix) Fix [Bug] Gemma 4 31B crashes on k_eq_v full-attention layers (QKV split shape mismatch) [1 participants]

Cloumeau · 2026-04-29T20:14:50Z

[vllm] Loading google/gemma-4-31b-it on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split. Loading `google/gemma-4-31b-it` on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split. ## Fix PR #41253 fixes this by giving `k_eq_v` full-attention layers separate `q_proj` / `k_proj` `ColumnParallelLinear` modules and setting `v = k` directly in the forward pass, matching the checkpoint layout exactly. The `_weight_iterator` duplication hack is also removed. ## Description Loading `google/gemma-4-31b-it` on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split. ## Error ``` RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480 ``` ## Root Cause Gemma 4 31B has two attention layer types: - **Sliding layers** — standard GQA with separate Q, K, V projections - **Full-attention (`k_eq_v`) layers** — K and V share the same projection; the checkpoint has `q_proj` and `k_proj` on disk but **no `v_proj`** `Gemma4Attention` routes all layers through `QKVParallelLinear`, then splits the output as `[q_size, kv_size, kv_size]`. For `k_eq_v` full-attention layers the actual packed tensor is only `[q_size + kv_size]` (no V column), so the split sums to `q_size + 2×kv_size` but the tensor is `q_size + kv_size` — causing the crash. The `_weight_iterator` attempted to paper over this by duplicating `k_proj` weights as `v_proj`, but `QKVParallelLinear` still expects both slots to be present in the packed layout. ## Affected models - `google/gemma-4-31b-it` (first Gemma 4 model with mixed sliding/full-attention layers) - Gemma 4 E4B and other small variants are not affected ## Fix PR #41253 fixes this by giving `k_eq_v` full-attention layers separate `q_proj` / `k_proj` `ColumnParallelLinear` modules and setting `v = k` directly in the forward pass, matching the checkpoint layout exactly. The `_weight_iterator` duplication hack is also removed. ## Environment - vLLM 0.20.0 - Bug also present in current `main`

vllm2026-04-29 20:14:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41283•Fetched 2026-04-30 06:19:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Cloumeau

Participants

Cloumeau

Loading google/gemma-4-31b-it on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split.

Error Message

RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480

Root Cause

Gemma 4 31B has two attention layer types:

Sliding layers — standard GQA with separate Q, K, V projections
Full-attention (k_eq_v) layers — K and V share the same projection; the checkpoint has q_proj and k_proj on disk but no v_proj

Gemma4Attention routes all layers through QKVParallelLinear, then splits the output as [q_size, kv_size, kv_size]. For k_eq_v full-attention layers the actual packed tensor is only [q_size + kv_size] (no V column), so the split sums to q_size + 2×kv_size but the tensor is q_size + kv_size — causing the crash.

The _weight_iterator attempted to paper over this by duplicating k_proj weights as v_proj, but QKVParallelLinear still expects both slots to be present in the packed layout.

Fix Action

Fix

PR #41253 fixes this by giving k_eq_v full-attention layers separate q_proj / k_proj ColumnParallelLinear modules and setting v = k directly in the forward pass, matching the checkpoint layout exactly. The _weight_iterator duplication hack is also removed.

Code Example

RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480

RAW_BUFFERClick to expand / collapse

Description

Loading google/gemma-4-31b-it on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split.

Error

RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480

Root Cause

Gemma 4 31B has two attention layer types:

Sliding layers — standard GQA with separate Q, K, V projections
Full-attention (k_eq_v) layers — K and V share the same projection; the checkpoint has q_proj and k_proj on disk but no v_proj

The _weight_iterator attempted to paper over this by duplicating k_proj weights as v_proj, but QKVParallelLinear still expects both slots to be present in the packed layout.

Affected models

google/gemma-4-31b-it (first Gemma 4 model with mixed sliding/full-attention layers)
Gemma 4 E4B and other small variants are not affected

Fix

Environment

vLLM 0.20.0
Bug also present in current main

extent analysis

TL;DR

Apply the fix from PR #41253 to handle the tensor shape mismatch in k_eq_v full-attention layers.

Guidance

Identify if your model uses mixed sliding and full-attention layers, as only these are affected.
Verify the error by checking the tensor shapes during the QKV split.
Consider applying the fix from PR #41253 to your codebase to resolve the issue.
If using google/gemma-4-31b-it or other affected models, ensure to update your code to match the checkpoint layout.

Example

No code example is provided as the fix involves modifying the Gemma4Attention module and QKVParallelLinear to handle k_eq_v layers correctly, which requires a more extensive code change.

Notes

This fix is specific to models with mixed sliding and full-attention layers, such as google/gemma-4-31b-it. Other models like Gemma 4 E4B are not affected.

Recommendation

Apply the workaround by implementing the fix from PR #41253, as it directly addresses the root cause of the tensor shape mismatch issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] Gemma 4 31B crashes on k_eq_v full-attention layers (QKV split shape mismatch) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

Code Example

Description

Error

Root Cause

Affected models

Fix

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] Gemma 4 31B crashes on k_eq_v full-attention layers (QKV split shape mismatch) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

Code Example

Description

Error

Root Cause

Affected models

Fix

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING