vllm - ✅(Solved) Fix [Bug]: Gemma 4 E4B extremely slow on v0.19.0 forced TRITON_ATTN fallback yields ~9 tok/s on RTX 4090 (vs ~100+ tok/s for comparable Llama 3B) [1 pull requests, 4 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38887Fetched 2026-04-08 02:34:21
View on GitHub
Comments
4
Participants
5
Timeline
16
Reactions
0
Author
Timeline (top)
subscribed ×9commented ×4cross-referenced ×1labeled ×1

Root Cause

The root cause is that Gemma 4's heterogeneous attention head dimensions force vLLM to disable FlashAttention and fall back to a much slower Triton attention kernel. Additionally, custom_ops is set to ['none'], meaning no vLLM-native CUDA kernels are used.

Fix Action

Fixed

PR fix notes

PR #38891: [Gemma4] Allow per-layer attention backend selection for heterogeneou…

Description (problem / solution / changelog)

Purpose

Fix #38887: Gemma 4 models are extremely slow on vLLM v0.19.0 (~9 tok/s on RTX 4090 for E4B) because Gemma4Config forces all layers to use TRITON_ATTN, even though ~83% of layers (sliding-window, head_dim=256) are fully compatible with FlashAttention.
Root cause: Gemma 4 has heterogeneous head dimensions — sliding-window layers use head_dim=256 and full-attention layers use global_head_dim=512. Since FlashAttention's kernel limit is head_size <= 256, the previous code forced TRITON_ATTN globally to avoid mixed-backend usage. However, vLLM's get_attn_backend() already supports per-layer backend selection via its @cache-decorated selector
(distinct head_size arguments produce distinct backend choices). The global forcing was unnecessary and
penalized the majority of layers.

What this PR does:

  • Removes the forced attention_config.backend = TRITON_ATTN override
  • Adds informational logging showing the sliding/full layer count split
  • Adds an early warning if the user explicitly sets --attention-backend to a backend that cannot handle
    global_head_dim

Layer breakdown across all Gemma 4 variants:

VariantSliding (FlashAttn)Full (Triton)% on FlashAttn
E2B (35 layers)28780%
E4B (42 layers)35783%
26B-A4B (30 layers)25583%
31B (60 layers)501083%

Not a duplicate: Searched open PRs for 38887 in:body and Gemma4 FlashAttention heterogeneous — no
results. PR #38879 optimizes Gemma4 prefill (YOCO fast prefill) and is complementary; it does not address the decode bottleneck caused by the forced TRITON_ATTN backend.

Test Plan

# Lint (pre-commit)
pre-commit run ruff-check --files vllm/model_executor/models/config.py
pre-commit run ruff-format --files vllm/model_executor/models/config.py
pre-commit run typos --files vllm/model_executor/models/config.py

# Unit tests (requires GPU + model access)
.venv/bin/python -m pytest tests/models/multimodal/processing/test_gemma4.py -v

# Serving benchmark (Gemma4 E4B on RTX 4090)
# Before (forced TRITON_ATTN):
vllm serve google/gemma-4-e4b-it --max-model-len 8192 --dtype bfloat16

# After (per-layer backend selection):
vllm serve google/gemma-4-e4b-it --max-model-len 8192 --dtype bfloat16
# Expected log output:
#   Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512).
#   35 sliding-window layers will use FlashAttention;
#   7 full-attention layers will fall back to a compatible backend.

Test Result

Lint: All passed (ruff-check, ruff-format, typos).

## Changed files

- `vllm/model_executor/models/config.py` (modified, +66/-34)

Code Example

INFO [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512).
  Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.

INFO [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.

INFO [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s,
  Avg generation throughput: 9.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs,
  GPU KV cache usage: 1.9%, Prefix cache hit rate: 94.9%
RAW_BUFFERClick to expand / collapse

Your current environment

PyTorch version : 2.10.0+cu129 Python version : 3.12.13 (64-bit runtime) CUDA runtime version : 12.9.86 vLLM Version : 0.19.0

[pip3] torch==2.10.0+cu129 [pip3] transformers==5.5.0 [pip3] triton==3.6.0 [pip3] flashinfer-python==0.6.6

🐛 Describe the bug

Gemma 4 E4B (google/gemma-4-e4b-it, 4.5B effective parameters) generates at only ~9 tokens/s on an RTX 4090 with vLLM v0.19.0. For comparison, a similarly-sized Llama 3.2 3B model on the same hardware with the same vLLM version generates at 100+ tokens/s.

The root cause is that Gemma 4's heterogeneous attention head dimensions force vLLM to disable FlashAttention and fall back to a much slower Triton attention kernel. Additionally, custom_ops is set to ['none'], meaning no vLLM-native CUDA kernels are used.

From vLLM server logs during inference:

INFO [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512).
  Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.

INFO [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.

INFO [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s,
  Avg generation throughput: 9.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs,
  GPU KV cache usage: 1.9%, Prefix cache hit rate: 94.9%

Expected behavior

A 4.5B parameter model on an RTX 4090 (24GB VRAM, BF16) should generate in the range of 50-100+ tokens/s, comparable to other models of similar size (e.g., Llama 3.2 3B at ~100-200 tok/s on the same hardware).

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to modify the model configuration to use homogeneous attention head dimensions, enabling vLLM to utilize FlashAttention and improve performance.

Guidance

  • Verify that the heterogeneous attention head dimensions are the root cause of the performance issue by checking the vLLM server logs for messages indicating the use of the TRITON_ATTN backend.
  • Consider modifying the Gemma 4 E4B model to use homogeneous attention head dimensions, allowing vLLM to use FlashAttention and potentially improving performance.
  • Check the custom_ops setting and consider enabling vLLM-native CUDA kernels to further improve performance.
  • Compare the performance of the modified model with other models of similar size, such as Llama 3.2 3B, to ensure the expected throughput is achieved.

Example

No code snippet is provided as the issue does not specify the exact code changes required to modify the model configuration.

Notes

The performance issue may be specific to the Gemma 4 E4B model and the vLLM version used, and modifying the model configuration may not be feasible or desirable in all cases.

Recommendation

Apply workaround: Modify the model configuration to use homogeneous attention head dimensions, as this is likely to improve performance and allow vLLM to utilize FlashAttention.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma 4 E4B extremely slow on v0.19.0 forced TRITON_ATTN fallback yields ~9 tok/s on RTX 4090 (vs ~100+ tok/s for comparable Llama 3B) [1 pull requests, 4 comments, 5 participants]