vllm - 💡(How to fix) Fix [Bug] Gemma 4 31B crashes on k_eq_v full-attention layers (QKV split shape mismatch) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41283Fetched 2026-04-30 06:19:06
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Loading google/gemma-4-31b-it on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split.

Error Message

RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480

Root Cause

Gemma 4 31B has two attention layer types:

  • Sliding layers — standard GQA with separate Q, K, V projections
  • Full-attention (k_eq_v) layers — K and V share the same projection; the checkpoint has q_proj and k_proj on disk but no v_proj

Gemma4Attention routes all layers through QKVParallelLinear, then splits the output as [q_size, kv_size, kv_size]. For k_eq_v full-attention layers the actual packed tensor is only [q_size + kv_size] (no V column), so the split sums to q_size + 2×kv_size but the tensor is q_size + kv_size — causing the crash.

The _weight_iterator attempted to paper over this by duplicating k_proj weights as v_proj, but QKVParallelLinear still expects both slots to be present in the packed layout.

Fix Action

Fix

PR #41253 fixes this by giving k_eq_v full-attention layers separate q_proj / k_proj ColumnParallelLinear modules and setting v = k directly in the forward pass, matching the checkpoint layout exactly. The _weight_iterator duplication hack is also removed.

Code Example

RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480
RAW_BUFFERClick to expand / collapse

Description

Loading google/gemma-4-31b-it on vLLM 0.20.0 crashes immediately with a tensor shape mismatch during the QKV split.

Error

RuntimeError: split_with_sizes expects split_sizes to sum exactly to the size of dimension -1 (18432), but got split_sizes=[16384, 2048, 2048] summing to 20480

Root Cause

Gemma 4 31B has two attention layer types:

  • Sliding layers — standard GQA with separate Q, K, V projections
  • Full-attention (k_eq_v) layers — K and V share the same projection; the checkpoint has q_proj and k_proj on disk but no v_proj

Gemma4Attention routes all layers through QKVParallelLinear, then splits the output as [q_size, kv_size, kv_size]. For k_eq_v full-attention layers the actual packed tensor is only [q_size + kv_size] (no V column), so the split sums to q_size + 2×kv_size but the tensor is q_size + kv_size — causing the crash.

The _weight_iterator attempted to paper over this by duplicating k_proj weights as v_proj, but QKVParallelLinear still expects both slots to be present in the packed layout.

Affected models

  • google/gemma-4-31b-it (first Gemma 4 model with mixed sliding/full-attention layers)
  • Gemma 4 E4B and other small variants are not affected

Fix

PR #41253 fixes this by giving k_eq_v full-attention layers separate q_proj / k_proj ColumnParallelLinear modules and setting v = k directly in the forward pass, matching the checkpoint layout exactly. The _weight_iterator duplication hack is also removed.

Environment

  • vLLM 0.20.0
  • Bug also present in current main

extent analysis

TL;DR

Apply the fix from PR #41253 to handle the tensor shape mismatch in k_eq_v full-attention layers.

Guidance

  • Identify if your model uses mixed sliding and full-attention layers, as only these are affected.
  • Verify the error by checking the tensor shapes during the QKV split.
  • Consider applying the fix from PR #41253 to your codebase to resolve the issue.
  • If using google/gemma-4-31b-it or other affected models, ensure to update your code to match the checkpoint layout.

Example

No code example is provided as the fix involves modifying the Gemma4Attention module and QKVParallelLinear to handle k_eq_v layers correctly, which requires a more extensive code change.

Notes

This fix is specific to models with mixed sliding and full-attention layers, such as google/gemma-4-31b-it. Other models like Gemma 4 E4B are not affected.

Recommendation

Apply the workaround by implementing the fix from PR #41253, as it directly addresses the root cause of the tensor shape mismatch issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] Gemma 4 31B crashes on k_eq_v full-attention layers (QKV split shape mismatch) [1 participants]