vllm - 💡(How to fix) Fix [Feature]: Port Gemma4 vision encoder to MMEncoderAttention with FlashAttention support

StepCodex · 2026-05-20T02:48:13Z

[vllm] 🚀 The feature, motivation and pitch Gemma4's vision encoder currently runs via HuggingFace AutoModel.from config in eager mode, using SDPA scaled dot-p… ## Fix / Workaround Gemma4's vision encoder currently runs via HuggingFace `AutoModel.from_config` in eager mode, using SDPA (scaled dot-product attention) for the encoder's self-attention layers. SDPA is quadratic in sequence length, which drives high peak VRAM for large patch counts and limits how many images/frames can be batched in a single encoder call. **Context**: The recently merged batched encoder PR groups encoder calls across images/frames, improving video throughput 1.7-3.8x. However, the encoder itself still uses HF's SDPA attention. The dynamic batch sizing (`_encoder_max_batch`) exists specifically because SDPA's memory cost makes unbounded batching unsafe. With FA, the memory ceiling would be much higher, enabling larger batches and further throughput gains — especially for high-resolution images (max_soft_tokens=560/1120, which produce 5,040/10,080 patches per image). - **Keep SDPA with larger batch budget**: Not viable for high-resolution images (560/1120 tokens) where SDPA memory per patch is already the bottleneck. - **`torch.nn.functional.scaled_dot_product_attention` with `enable_flash=True`**: Only works on CUDA with compatible head dims and requires contiguous QKV layout. Not portable across vLLM's supported platforms. - **xFormers memory-efficient attention**: Similar benefits to FA but not integrated into vLLM's `MMEncoderAttention` abstraction. ### 🚀 The feature, motivation and pitch Gemma4's vision encoder currently runs via HuggingFace `AutoModel.from_config` in eager mode, using SDPA (scaled dot-product attention) for the encoder's self-attention layers. SDPA is quadratic in sequence length, which drives high peak VRAM for large patch counts and limits how many images/frames can be batched in a single encoder call. Porting the Gemma4 vision encoder to vLLM-native layers using `MMEncoderAttention` with FlashAttention support would: - **Reduce peak VRAM** per encoder call (FA is O(N) memory vs O(N^2) for SDPA), allowing larger encoder batch sizes under the same memory budget - **Improve throughput** by leveraging FA's fused kernels instead of PyTorch's generic SDPA path - **Enable CUDA graph capture** for the encoder path (`compile_mm_encoder=True`), which is currently not practical with the HF eager model - **Align with vLLM's direction** for other vision models that already use `MMEncoderAttention` (e.g., Qwen2-VL, Pixtral) **Context**: The recently merged batched encoder PR groups encoder calls across images/frames, improving video throughput 1.7-3.8x. However, the encoder itself still uses HF's SDPA attention. The dynamic batch sizing (`_encoder_max_batch`) exists specifically because SDPA's memory cost makes unbounded batching unsafe. With FA, the memory ceiling would be much higher, enabling larger batches and further throughput gains — especially for high-resolution images (max_soft_tokens=560/1120, which produce 5,040/10,080 patches per image). **Scope**: The encoder consists of standard multi-head self-attention + MLP layers (16 layers for E2B/E4B, 27 for 26B/31B) with 2D RoPE via `pixel_position_ids` and a simple padding mask. The porting work would involve replacing HF attention layers with `MMEncoderAttention`, adapting the 2D RoPE computation, validating numerical parity, and enabling `compile_mm_encoder` / `cudagraph_mm_encoder` support. ### Alternatives - **Keep SDPA with larger batch budget**: Not viable for high-resolution images (560/1120 tokens) where SDPA memory per patch is already the bottleneck. - **`torch.nn.functional.scaled_dot_product_attention` with `enable_flash=True`**: Only works on CUDA with compatible head dims and requires contiguous QKV layout. Not portable across vLLM's supported platforms. - **xFormers memory-efficient attention**: Similar benefits to FA but not integrated into vLLM's `MMEncoderAttention` abstraction. ### Additional context Vision encoder configs across Gemma4 variants: | Variant | hidden_size | num_hidden_layers | num_attention_heads | head_dim | |---------|-------------|-------------------|---------------------|----------| | E2B/E4B | 768 | 16 | 12 | 64 | | 26B/31B | 1152 | 27 | 18 | 64 | - Encoder uses 2D RoPE (not 1D), computed from `pixel_position_ids` (x, y coordinates per patch) - Pooling kernel size is 3 (k^2=9), so patches = max_soft_tokens * 9 - The `_encoder_max_batch` dynamic sizing would still serve as a safety net, but the budget could be significantly relaxed with FA ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Root Cause

Context: The recently merged batched encoder PR groups encoder calls across images/frames, improving video throughput 1.7-3.8x. However, the encoder itself still uses HF's SDPA attention. The dynamic batch sizing (_encoder_max_batch) exists specifically because SDPA's memory cost makes unbounded batching unsafe. With FA, the memory ceiling would be much higher, enabling larger batches and further throughput gains — especially for high-resolution images (max_soft_tokens=560/1120, which produce 5,040/10,080 patches per image).

Fix Action

Fix / Workaround

Gemma4's vision encoder currently runs via HuggingFace AutoModel.from_config in eager mode, using SDPA (scaled dot-product attention) for the encoder's self-attention layers. SDPA is quadratic in sequence length, which drives high peak VRAM for large patch counts and limits how many images/frames can be batched in a single encoder call.

Keep SDPA with larger batch budget: Not viable for high-resolution images (560/1120 tokens) where SDPA memory per patch is already the bottleneck.
torch.nn.functional.scaled_dot_product_attention with enable_flash=True: Only works on CUDA with compatible head dims and requires contiguous QKV layout. Not portable across vLLM's supported platforms.
xFormers memory-efficient attention: Similar benefits to FA but not integrated into vLLM's MMEncoderAttention abstraction.

🚀 The feature, motivation and pitch

Porting the Gemma4 vision encoder to vLLM-native layers using MMEncoderAttention with FlashAttention support would:

Reduce peak VRAM per encoder call (FA is O(N) memory vs O(N^2) for SDPA), allowing larger encoder batch sizes under the same memory budget
Improve throughput by leveraging FA's fused kernels instead of PyTorch's generic SDPA path
Enable CUDA graph capture for the encoder path (compile_mm_encoder=True), which is currently not practical with the HF eager model
Align with vLLM's direction for other vision models that already use MMEncoderAttention (e.g., Qwen2-VL, Pixtral)

Scope: The encoder consists of standard multi-head self-attention + MLP layers (16 layers for E2B/E4B, 27 for 26B/31B) with 2D RoPE via pixel_position_ids and a simple padding mask. The porting work would involve replacing HF attention layers with MMEncoderAttention, adapting the 2D RoPE computation, validating numerical parity, and enabling compile_mm_encoder / cudagraph_mm_encoder support.

Alternatives

Keep SDPA with larger batch budget: Not viable for high-resolution images (560/1120 tokens) where SDPA memory per patch is already the bottleneck.
torch.nn.functional.scaled_dot_product_attention with enable_flash=True: Only works on CUDA with compatible head dims and requires contiguous QKV layout. Not portable across vLLM's supported platforms.
xFormers memory-efficient attention: Similar benefits to FA but not integrated into vLLM's MMEncoderAttention abstraction.

Additional context

Vision encoder configs across Gemma4 variants:

Variant	hidden_size	num_hidden_layers	num_attention_heads	head_dim
E2B/E4B	768	16	12	64
26B/31B	1152	27	18	64

Encoder uses 2D RoPE (not 1D), computed from pixel_position_ids (x, y coordinates per patch)
Pooling kernel size is 3 (k^2=9), so patches = max_soft_tokens * 9
The _encoder_max_batch dynamic sizing would still serve as a safety net, but the budget could be significantly relaxed with FA

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Port Gemma4 vision encoder to MMEncoderAttention with FlashAttention support

Recommended Tools

GitHub issue graph ai analysis