vllm - 💡(How to fix) Fix [Bug]: Gemma4 0% prefix cache hits with hybrid attention + DFlash [4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40624Fetched 2026-04-23 07:23:47
View on GitHub
Comments
4
Participants
3
Timeline
10
Reactions
0
Author
Timeline (top)
commented ×4subscribed ×4labeled ×1mentioned ×1

Root Cause

The DFlash draft model (Qwen3-based) adds its own attention layers to the model. get_kv_cache_spec() (gpu_model_runner.py:6887) scans ALL attention layers — both the Gemma4 target and the Qwen3 draft — producing three distinct KVCacheSpec types:

Layer typeSpechead_sizenum_kv_headsblock_size (after page unification)
Gemma4 full attentionFullAttentionSpecglobal_head_dim (e.g. 256)416
Gemma4 sliding windowSlidingWindowSpechead_dim (e.g. 128)432
DFlash Qwen3 draftFullAttentionSpec128264

Gemma4 has different head dimensions per attention type (head_dim vs global_head_dim), so unify_kv_cache_spec_page_size() adjusts block sizes to equalize page sizes. The Qwen3 draft layers have yet another head_size/num_kv_heads combination, creating a third distinct spec.

In HybridKVCacheCoordinator.verify_and_split_kv_cache_groups() (kv_cache_coordinator.py:410), groups are merged by spec == spec equality. Since all three specs differ, this produces 3 attention groups.

The is_simple_hybrid guard from #33524 (kv_cache_coordinator.py:493) only triggers when len(self.attention_groups) == 2:

is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
    self.attention_groups[0][0], FullAttentionSpec
)

With 3 attention groups, is_simple_hybrid = False, and the convergence loop runs without early exit. The EAGLE spiral block-dropping problem from #32802 kicks in:

  1. Gemma4 FA (block_size=16) finds N blocks, EAGLE drops → N-1, alignment pop → N-2
  2. DFlash Qwen3 FA (block_size=64) gets reduced range, does its own EAGLE drop
  3. Gemma4 SW (block_size=32) gets further reduced range, does its own EAGLE drop
  4. curr_hit_length < hit_length → loop restarts
  5. Each iteration drops more blocks → spirals to 0

Even without the spiral, the code has a secondary issue: use_eagle=True is passed to all managers independently, causing each to apply its own EAGLE block drop + alignment pop. The EAGLE drop only needs to happen once (its purpose is to force recomputation of the last block for hidden states), but it's applied 3 times with cascading alignment pops due to the different block sizes.

Fix Action

Fix / Workaround

Gemma4 + DFlash speculative decoding produces 0% prefix cache hit rates when the hybrid KV cache manager is enabled. This is the same EAGLE spiral block-dropping bug as #32802, which was partially fixed by #33524 — but the fix's is_simple_hybrid guard does not cover the Gemma4 + DFlash case.

Workaround: disable hybrid KV cache manager

vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash
-tp 2
--attention-backend FLASH_ATTN
--disable-hybrid-kv-cache-manager
--max-model-len 40000

Prefix cache hit rate: ~88.9%

max model length required to get model to fit on H100

Code Example

Your output of `python collect_env.py` here

---

# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py

# With DFlash: 0% prefix cache hit rate
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 0.0%

# Without DFlash: ~80% hit rate as expected
vllm serve google/gemma-4-31B-it -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 91.1%

# Workaround: disable hybrid KV cache manager
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash \
-tp 2 \
--attention-backend FLASH_ATTN \
--disable-hybrid-kv-cache-manager \
--max-model-len 40000
# Prefix cache hit rate: ~88.9%
# max model length required to get model to fit on H100

---

is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
    self.attention_groups[0][0], FullAttentionSpec
)

---

# After convergence loop:
if self.use_eagle and hit_length > 0:
    hit_length = (hit_length // self.lcm_block_size - 1) * self.lcm_block_size
    # truncate all groups to new hit_length

---

full_attn_groups = [g for g in self.attention_groups if isinstance(g[0], FullAttentionSpec)]
other_groups = [g for g in self.attention_groups if not isinstance(g[0], FullAttentionSpec)]
is_simple_hybrid = len(other_groups) == 1 and len(full_attn_groups) >= 1
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Gemma4 + DFlash speculative decoding produces 0% prefix cache hit rates when the hybrid KV cache manager is enabled. This is the same EAGLE spiral block-dropping bug as #32802, which was partially fixed by #33524 — but the fix's is_simple_hybrid guard does not cover the Gemma4 + DFlash case.

# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py

# With DFlash: 0% prefix cache hit rate
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 0.0%

# Without DFlash: ~80% hit rate as expected
vllm serve google/gemma-4-31B-it -tp 2 --attention-backend FLASH_ATTN
# Prefix cache hit rate: 91.1%

# Workaround: disable hybrid KV cache manager
vllm serve RedHatAI/gemma-4-31B-it-speculator.dflash \
-tp 2 \
--attention-backend FLASH_ATTN \
--disable-hybrid-kv-cache-manager \
--max-model-len 40000
# Prefix cache hit rate: ~88.9%
# max model length required to get model to fit on H100

Root Cause

The DFlash draft model (Qwen3-based) adds its own attention layers to the model. get_kv_cache_spec() (gpu_model_runner.py:6887) scans ALL attention layers — both the Gemma4 target and the Qwen3 draft — producing three distinct KVCacheSpec types:

Layer typeSpechead_sizenum_kv_headsblock_size (after page unification)
Gemma4 full attentionFullAttentionSpecglobal_head_dim (e.g. 256)416
Gemma4 sliding windowSlidingWindowSpechead_dim (e.g. 128)432
DFlash Qwen3 draftFullAttentionSpec128264

Gemma4 has different head dimensions per attention type (head_dim vs global_head_dim), so unify_kv_cache_spec_page_size() adjusts block sizes to equalize page sizes. The Qwen3 draft layers have yet another head_size/num_kv_heads combination, creating a third distinct spec.

In HybridKVCacheCoordinator.verify_and_split_kv_cache_groups() (kv_cache_coordinator.py:410), groups are merged by spec == spec equality. Since all three specs differ, this produces 3 attention groups.

The is_simple_hybrid guard from #33524 (kv_cache_coordinator.py:493) only triggers when len(self.attention_groups) == 2:

is_simple_hybrid = len(self.attention_groups) == 2 and isinstance(
    self.attention_groups[0][0], FullAttentionSpec
)

With 3 attention groups, is_simple_hybrid = False, and the convergence loop runs without early exit. The EAGLE spiral block-dropping problem from #32802 kicks in:

  1. Gemma4 FA (block_size=16) finds N blocks, EAGLE drops → N-1, alignment pop → N-2
  2. DFlash Qwen3 FA (block_size=64) gets reduced range, does its own EAGLE drop
  3. Gemma4 SW (block_size=32) gets further reduced range, does its own EAGLE drop
  4. curr_hit_length < hit_length → loop restarts
  5. Each iteration drops more blocks → spirals to 0

Even without the spiral, the code has a secondary issue: use_eagle=True is passed to all managers independently, causing each to apply its own EAGLE block drop + alignment pop. The EAGLE drop only needs to happen once (its purpose is to force recomputation of the last block for hidden states), but it's applied 3 times with cascading alignment pops due to the different block sizes.

Why #33524 doesn't cover this case

#33524 added the is_simple_hybrid early exit for models with exactly 2 attention groups (1 full + 1 other). This covers:

  • GPT-OSS (full + sliding window, same head_size) → 2 groups ✓
  • Gemma4 without spec decode (full + sliding window, different head_sizes) → 2 groups ✓

But it does not cover:

  • Gemma4 + DFlash (full + sliding window + draft full attention) → 3 groups
  • Any hybrid model + EAGLE/DFlash where the draft model introduces a new spec type → 3+ groups

Potential Solutions

Option A: Coordinator-level EAGLE drop (from #32802 Option 1)

Remove use_eagle from individual SingleTypeKVCacheManager.find_longest_cache_hit() calls. Have HybridKVCacheCoordinator apply a single EAGLE drop after all managers converge:

# After convergence loop:
if self.use_eagle and hit_length > 0:
    hit_length = (hit_length // self.lcm_block_size - 1) * self.lcm_block_size
    # truncate all groups to new hit_length

This eliminates both the spiral and the double/triple-drop, regardless of how many attention groups exist.

Option B: Extend is_simple_hybrid to cover draft model groups

Generalize the check to handle N full-attention groups + 1 other type, recognizing that multiple full-attention groups (target + draft) should be treated as "simple":

full_attn_groups = [g for g in self.attention_groups if isinstance(g[0], FullAttentionSpec)]
other_groups = [g for g in self.attention_groups if not isinstance(g[0], FullAttentionSpec)]
is_simple_hybrid = len(other_groups) == 1 and len(full_attn_groups) >= 1

Simpler change but still papering over the root cause (per-manager EAGLE drops).

Related issues

  • #32802: Original report of 0% prefix cache hits with hybrid attention + EAGLE (GPT-OSS)
  • #33524: Partial fix (simple hybrid early-exit), prevents spiral for 2-group models only

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the 0% prefix cache hit rate issue with Gemma4 + DFlash speculative decoding is to apply a coordinator-level EAGLE drop, as described in Option A of the Potential Solutions section.

Guidance

  • Identify the root cause of the issue, which is the presence of three distinct KVCacheSpec types due to the DFlash draft model introducing a new attention layer.
  • Consider implementing Option A: Coordinator-level EAGLE drop, which removes use_eagle from individual SingleTypeKVCacheManager.find_longest_cache_hit() calls and applies a single EAGLE drop after all managers converge.
  • Alternatively, consider extending the is_simple_hybrid check to cover draft model groups, as described in Option B of the Potential Solutions section.
  • Verify the fix by running the evaluation script with the modified code and checking the prefix cache hit rate.

Example

# After convergence loop:
if self.use_eagle and hit_length > 0:
    hit_length = (hit_length // self.lcm_block_size - 1) * self.lcm_block_size
    # truncate all groups to new hit_length

Notes

The provided solutions assume that the issue is caused by the presence of multiple attention groups and the application of EAGLE drops to each group independently. The effectiveness of the solutions may depend on the specific use case and model configuration.

Recommendation

Apply Option A: Coordinator-level EAGLE drop, as it eliminates both the spiral and the double/triple-drop, regardless of how many attention groups exist. This solution is more comprehensive and addresses the root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Gemma4 0% prefix cache hits with hybrid attention + DFlash [4 comments, 3 participants]