vllm - 💡(How to fix) Fix [Bug]: Gemma4 0% prefix cache hits with hybrid attention + DFlash [4 comments, 3 participants]

vllm2026-04-22 14:56:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40624•Fetched 2026-04-23 07:23:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4subscribed ×4labeled ×1mentioned ×1

Root Cause

The DFlash draft model (Qwen3-based) adds its own attention layers to the model. get_kv_cache_spec() (gpu_model_runner.py:6887) scans ALL attention layers — both the Gemma4 target and the Qwen3 draft — producing three distinct KVCacheSpec types:

Layer type	Spec	head_size	num_kv_heads	block_size (after page unification)
Gemma4 full attention	`FullAttentionSpec`	`global_head_dim` (e.g. 256)	4	16
Gemma4 sliding window	`SlidingWindowSpec`	`head_dim` (e.g. 128)	4	32
DFlash Qwen3 draft	`FullAttentionSpec`	128	2	64

Gemma4 has different head dimensions per attention type (head_dim vs global_head_dim), so unify_kv_cache_spec_page_size() adjusts block sizes to equalize page sizes. The Qwen3 draft layers have yet another head_size/num_kv_heads combination, creating a third distinct spec.

In HybridKVCacheCoordinator.verify_and_split_kv_cache_groups() (kv_cache_coordinator.py:410), groups are merged by spec == spec equality. Since all three specs differ, this produces 3 attention groups.

The is_simple_hybrid guard from #33524 (kv_cache_coordinator.py:493) only triggers when len(self.attention_groups) == 2:

is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
    self.attention_groups[0][0], FullAttentionSpec
)

With 3 attention groups, is_simple_hybrid = False, and the convergence loop runs without early exit. The EAGLE spiral block-dropping problem from #32802 kicks in:

Gemma4 FA (block_size=16) finds N blocks, EAGLE drops → N-1, alignment pop → N-2
DFlash Qwen3 FA (block_size=64) gets reduced range, does its own EAGLE drop
Gemma4 SW (block_size=32) gets further reduced range, does its own EAGLE drop
curr_hit_length < hit_length → loop restarts
Each iteration drops more blocks → spirals to 0

Even without the spiral, the code has a secondary issue: use_eagle=True is passed to all managers independently, causing each to apply its own EAGLE block drop + alignment pop. The EAGLE drop only needs to happen once (its purpose is to force recomputation of the last block for hidden states), but it's applied 3 times with cascading alignment pops due to the different block sizes.

Fix Action

Fix / Workaround

Gemma4 + DFlash speculative decoding produces 0% prefix cache hit rates when the hybrid KV cache manager is enabled. This is the same EAGLE spiral block-dropping bug as #32802, which was partially fixed by #33524 — but the fix's is_simple_hybrid guard does not cover the Gemma4 + DFlash case.

Workaround: disable hybrid KV cache manager

vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash
-tp 2
--attention-backend FLASH_ATTN
--disable-hybrid-kv-cache-manager
--max-model-len 40000

Prefix cache hit rate: ~88.9%

max model length required to get model to fit on H100

Code Example

Your output of `python collect_env.py` here

---

# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py

# With DFlash: 0% prefix cache hit rate
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 0.0%

# Without DFlash: ~80% hit rate as expected
vllm serve google/gemma-4-31B-it -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 91.1%

# Workaround: disable hybrid KV cache manager
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash \
-tp 2 \
--attention-backend FLASH_ATTN \
--disable-hybrid-kv-cache-manager \
--max-model-len 40000
# Prefix cache hit rate: ~88.9%
# max model length required to get model to fit on H100

---

is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
    self.attention_groups[0][0], FullAttentionSpec
)

---

# After convergence loop:
if self.use_eagle and hit_length > 0:
    hit_length = (hit_length // self.lcm_block_size - 1) * self.lcm_block_size
    # truncate all groups to new hit_length

---

full_attn_groups = [g for g in self.attention_groups if isinstance(g[0], FullAttentionSpec)]
other_groups = [g for g in self.attention_groups if not isinstance(g[0], FullAttentionSpec)]
is_simple_hybrid = len(other_groups) == 1 and len(full_attn_groups) >= 1

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py

# With DFlash: 0% prefix cache hit rate
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 0.0%

# Without DFlash: ~80% hit rate as expected
vllm serve google/gemma-4-31B-it -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 91.1%

# Workaround: disable hybrid KV cache manager
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash \
-tp 2 \
--attention-backend FLASH_ATTN \
--disable-hybrid-kv-cache-manager \
--max-model-len 40000
# Prefix cache hit rate: ~88.9%
# max model length required to get model to fit on H100

Root Cause

Layer type	Spec	head_size	num_kv_heads	block_size (after page unification)
Gemma4 full attention	`FullAttentionSpec`	`global_head_dim` (e.g. 256)	4	16
Gemma4 sliding window	`SlidingWindowSpec`	`head_dim` (e.g. 128)	4	32
DFlash Qwen3 draft	`FullAttentionSpec`	128	2	64

The is_simple_hybrid guard from #33524 (kv_cache_coordinator.py:493) only triggers when len(self.attention_groups) == 2:

is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
    self.attention_groups[0][0], FullAttentionSpec
)

With 3 attention groups, is_simple_hybrid = False, and the convergence loop runs without early exit. The EAGLE spiral block-dropping problem from #32802 kicks in:

Gemma4 FA (block_size=16) finds N blocks, EAGLE drops → N-1, alignment pop → N-2
DFlash Qwen3 FA (block_size=64) gets reduced range, does its own EAGLE drop
Gemma4 SW (block_size=32) gets further reduced range, does its own EAGLE drop
curr_hit_length < hit_length → loop restarts
Each iteration drops more blocks → spirals to 0

Why #33524 doesn't cover this case

#33524 added the is_simple_hybrid early exit for models with exactly 2 attention groups (1 full + 1 other). This covers:

GPT-OSS (full + sliding window, same head_size) → 2 groups ✓
Gemma4 without spec decode (full + sliding window, different head_sizes) → 2 groups ✓

But it does not cover:

Gemma4 + DFlash (full + sliding window + draft full attention) → 3 groups ✗
Any hybrid model + EAGLE/DFlash where the draft model introduces a new spec type → 3+ groups ✗

Potential Solutions

Option A: Coordinator-level EAGLE drop (from #32802 Option 1)

Remove use_eagle from individual SingleTypeKVCacheManager.find_longest_cache_hit() calls. Have HybridKVCacheCoordinator apply a single EAGLE drop after all managers converge:

# After convergence loop:
if self.use_eagle and hit_length > 0:
    hit_length = (hit_length // self.lcm_block_size - 1) * self.lcm_block_size
    # truncate all groups to new hit_length

This eliminates both the spiral and the double/triple-drop, regardless of how many attention groups exist.

Option B: Extend is_simple_hybrid to cover draft model groups

Generalize the check to handle N full-attention groups + 1 other type, recognizing that multiple full-attention groups (target + draft) should be treated as "simple":

full_attn_groups = [g for g in self.attention_groups if isinstance(g[0], FullAttentionSpec)]
other_groups = [g for g in self.attention_groups if not isinstance(g[0], FullAttentionSpec)]
is_simple_hybrid = len(other_groups) == 1 and len(full_attn_groups) >= 1

Simpler change but still papering over the root cause (per-manager EAGLE drops).

Related issues

#32802: Original report of 0% prefix cache hits with hybrid attention + EAGLE (GPT-OSS)
#33524: Partial fix (simple hybrid early-exit), prevents spiral for 2-group models only

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the 0% prefix cache hit rate issue with Gemma4 + DFlash speculative decoding is to apply a coordinator-level EAGLE drop, as described in Option A of the Potential Solutions section.

Guidance

Identify the root cause of the issue, which is the presence of three distinct KVCacheSpec types due to the DFlash draft model introducing a new attention layer.
Consider implementing Option A: Coordinator-level EAGLE drop, which removes use_eagle from individual SingleTypeKVCacheManager.find_longest_cache_hit() calls and applies a single EAGLE drop after all managers converge.
Alternatively, consider extending the is_simple_hybrid check to cover draft model groups, as described in Option B of the Potential Solutions section.
Verify the fix by running the evaluation script with the modified code and checking the prefix cache hit rate.

Example

# After convergence loop:
if self.use_eagle and hit_length > 0:
    hit_length = (hit_length // self.lcm_block_size - 1) * self.lcm_block_size
    # truncate all groups to new hit_length

Notes

The provided solutions assume that the issue is caused by the presence of multiple attention groups and the application of EAGLE drops to each group independently. The effectiveness of the solutions may depend on the specific use case and model configuration.

Recommendation

Apply Option A: Coordinator-level EAGLE drop, as it eliminates both the spiral and the double/triple-drop, regardless of how many attention groups exist. This solution is more comprehensive and addresses the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Gemma4 0% prefix cache hits with hybrid attention + DFlash [4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround: disable hybrid KV cache manager

Prefix cache hit rate: ~88.9%

max model length required to get model to fit on H100

Code Example

Your current environment

🐛 Describe the bug

Root Cause

Why #33524 doesn't cover this case

Potential Solutions

Related issues

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Gemma4 0% prefix cache hits with hybrid attention + DFlash [4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround: disable hybrid KV cache manager

Prefix cache hit rate: ~88.9%

max model length required to get model to fit on H100

Code Example

Your current environment

🐛 Describe the bug

Root Cause

Why #33524 doesn't cover this case

Potential Solutions

Related issues

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING