vllm - 💡(How to fix) Fix [Feature]: Port Gemma4 vision encoder to MMEncoderAttention with FlashAttention support

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Context: The recently merged batched encoder PR groups encoder calls across images/frames, improving video throughput 1.7-3.8x. However, the encoder itself still uses HF's SDPA attention. The dynamic batch sizing (_encoder_max_batch) exists specifically because SDPA's memory cost makes unbounded batching unsafe. With FA, the memory ceiling would be much higher, enabling larger batches and further throughput gains — especially for high-resolution images (max_soft_tokens=560/1120, which produce 5,040/10,080 patches per image).

Fix Action

Fix / Workaround

Gemma4's vision encoder currently runs via HuggingFace AutoModel.from_config in eager mode, using SDPA (scaled dot-product attention) for the encoder's self-attention layers. SDPA is quadratic in sequence length, which drives high peak VRAM for large patch counts and limits how many images/frames can be batched in a single encoder call.

Context: The recently merged batched encoder PR groups encoder calls across images/frames, improving video throughput 1.7-3.8x. However, the encoder itself still uses HF's SDPA attention. The dynamic batch sizing (_encoder_max_batch) exists specifically because SDPA's memory cost makes unbounded batching unsafe. With FA, the memory ceiling would be much higher, enabling larger batches and further throughput gains — especially for high-resolution images (max_soft_tokens=560/1120, which produce 5,040/10,080 patches per image).

  • Keep SDPA with larger batch budget: Not viable for high-resolution images (560/1120 tokens) where SDPA memory per patch is already the bottleneck.
  • torch.nn.functional.scaled_dot_product_attention with enable_flash=True: Only works on CUDA with compatible head dims and requires contiguous QKV layout. Not portable across vLLM's supported platforms.
  • xFormers memory-efficient attention: Similar benefits to FA but not integrated into vLLM's MMEncoderAttention abstraction.
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Gemma4's vision encoder currently runs via HuggingFace AutoModel.from_config in eager mode, using SDPA (scaled dot-product attention) for the encoder's self-attention layers. SDPA is quadratic in sequence length, which drives high peak VRAM for large patch counts and limits how many images/frames can be batched in a single encoder call.

Porting the Gemma4 vision encoder to vLLM-native layers using MMEncoderAttention with FlashAttention support would:

  • Reduce peak VRAM per encoder call (FA is O(N) memory vs O(N^2) for SDPA), allowing larger encoder batch sizes under the same memory budget
  • Improve throughput by leveraging FA's fused kernels instead of PyTorch's generic SDPA path
  • Enable CUDA graph capture for the encoder path (compile_mm_encoder=True), which is currently not practical with the HF eager model
  • Align with vLLM's direction for other vision models that already use MMEncoderAttention (e.g., Qwen2-VL, Pixtral)

Context: The recently merged batched encoder PR groups encoder calls across images/frames, improving video throughput 1.7-3.8x. However, the encoder itself still uses HF's SDPA attention. The dynamic batch sizing (_encoder_max_batch) exists specifically because SDPA's memory cost makes unbounded batching unsafe. With FA, the memory ceiling would be much higher, enabling larger batches and further throughput gains — especially for high-resolution images (max_soft_tokens=560/1120, which produce 5,040/10,080 patches per image).

Scope: The encoder consists of standard multi-head self-attention + MLP layers (16 layers for E2B/E4B, 27 for 26B/31B) with 2D RoPE via pixel_position_ids and a simple padding mask. The porting work would involve replacing HF attention layers with MMEncoderAttention, adapting the 2D RoPE computation, validating numerical parity, and enabling compile_mm_encoder / cudagraph_mm_encoder support.

Alternatives

  • Keep SDPA with larger batch budget: Not viable for high-resolution images (560/1120 tokens) where SDPA memory per patch is already the bottleneck.
  • torch.nn.functional.scaled_dot_product_attention with enable_flash=True: Only works on CUDA with compatible head dims and requires contiguous QKV layout. Not portable across vLLM's supported platforms.
  • xFormers memory-efficient attention: Similar benefits to FA but not integrated into vLLM's MMEncoderAttention abstraction.

Additional context

Vision encoder configs across Gemma4 variants:

Varianthidden_sizenum_hidden_layersnum_attention_headshead_dim
E2B/E4B768161264
26B/31B1152271864
  • Encoder uses 2D RoPE (not 1D), computed from pixel_position_ids (x, y coordinates per patch)
  • Pooling kernel size is 3 (k^2=9), so patches = max_soft_tokens * 9
  • The _encoder_max_batch dynamic sizing would still serve as a safety net, but the budget could be significantly relaxed with FA

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Port Gemma4 vision encoder to MMEncoderAttention with FlashAttention support