vllm - ✅(Solved) Fix [Feature]: [IR] mm_encoder_attn migration on hold pending FlashInfer workspace support [1 pull requests, 1 participants]

vllm2026-05-05 10:37:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41724•Fetched 2026-05-06 06:15:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

harshaljanjani

Participants

harshaljanjani

Timeline (top)

cross-referenced ×1labeled ×1mentioned ×1subscribed ×1

PR #41613 migrates mm_encoder_attn to vLLM IR with three pure backends (flash_attn, triton, native) fully working and tested. Said that, as rightly pointed out by @ProExpertProg, the migration should wait until IR workspace support exists so that all backends including FlashInfer dispatch uniformly via --kernel-config.ir_op_priority.mm_encoder_attn

Root Cause

FlashInfer cannot be a pure functional IR implementation today because it requires:

Fix Action

Fix / Workaround

PR fix notes

PR #41613: [On hold] feat(kernels): Migrate mm_encoder_attn to vLLM IR

Repository: vllm-project/vllm
Author: harshaljanjani
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41613

Description (problem / solution / changelog)

Status

On hold on IR workspace support (per maintainer feedback). Will rebase and complete once that lands.

Purpose

→ This PR migrates the multimodal encoder attention (MMEncoderAttention) to the new vLLM IR framework, as discussed in https://github.com/vllm-project/vllm/issues/32676 → Followed existing IR ops implementation patterns and the spec detailed in https://github.com/vllm-project/vllm/issues/32358 → Added a test file following the IR pattern, and tested inference in native, triton, and default flash_attn modes. Moved all hardware-specific routing in MMEncoderAttention into _call_ir_op and preserved the FlashInfer path to handle FP8 caching.

Test Plan

<ins>A > Existing MHA Tests (Regression)</ins> → The following test harness was executed to check that no regressions were introduced in the existing MHA and FlashInfer paths: tests/kernels/attention/test_mha_attn.py <ins>B > New IR Kernel Tests (Added)</ins> → The following test harness was added in this PR and executed to check the functional robustness of the new IR op across various data types (FP16, BF16), batch sizes, and sequence lengths: tests/kernels/ir/test_mm_encoder_attn.py <ins>C > IR Meta-Tests (Integration)</ins> → The following test harness was added in this PR and executed to check that the new op is correctly registered in the IR framework: tests/kernels/ir/test_ir_ops.py <ins>D > E2E Sanity (Inference)</ins> → Wrote a sanity check script and executed inference to check for E2E sanity across all modes: https://gist.github.com/harshaljanjani/432d9fb7aa0bfdc550a63774872ea073

Test Results

<ins>D:</ins> → <ins>Backend: flash_attn</ins> <img alt="2-1" src="https://github.com/user-attachments/assets/853e7879-2b65-4aea-ba5d-c37662191594" />

→ <ins>Backend: triton</ins> <img alt="2-2" src="https://github.com/user-attachments/assets/57951cb1-2646-48dc-bc2c-1c3858e94d6f" />

→ <ins>Backend: native</ins> <img alt="2-3" src="https://github.com/user-attachments/assets/1171fbf7-bdfd-4f33-a43b-9e7fea868118" />

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Was this discussed/approved via a Github issue?
The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

Changed files

tests/kernels/ir/test_mm_encoder_attn.py (added, +174/-0)
vllm/config/kernel.py (modified, +3/-0)
vllm/ir/ops/__init__.py (modified, +2/-1)
vllm/ir/ops/attention.py (added, +80/-0)
vllm/kernels/__init__.py (modified, +2/-2)
vllm/kernels/attention_ops.py (added, +136/-0)
vllm/model_executor/layers/attention/mm_encoder_attention.py (modified, +25/-141)
vllm/platforms/cuda.py (modified, +6/-1)
vllm/platforms/rocm.py (modified, +6/-1)
vllm/platforms/xpu.py (modified, +3/-1)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

What's done (in #41613)

IR op definition with native semantics
Provider implementations for flash_attn and triton
Per-platform priority lists (CUDA, ROCm, XPU)
IrOpPriorityConfig.mm_encoder_attn field
Test suite: 205 passed, 32 skipped across dtypes, batch sizes, sequence lengths, and head configurations
E2E inference verified on all three backends

What's blocking: FlashInfer as an IR implementation

FlashInfer cannot be a pure functional IR implementation today because it requires:

A workspace buffer, a 128MB persistent GPU allocation (torch.zeros(128*1024*1024, dtype=uint8, device="cuda")) passed to every call, with no IR mechanism for persistent allocations across calls
FP8 quantization state, including per-instance mutable scale buffers (_fp8_q/k/v_scale), amax circular history, and dynamic scale recomputation (_record_amax_and_update_scales) that mutates on every forward pass

The remaining differences are implementation details that an IR impl could handle internally once the above are resolved (good first issues):

CuDNN-specific bucketing including sequence length and batch size padding (bucket_flashinfer_max_seqlen, add_padding_to_seqlens) for CUDA graph compatibility
An extended signature with sequence_lengths, q_scale/k_scale/v_scale, o_data_type, and workspace_buffer parameters not present in the pure op schema
Post-op head dimension slicing, where the FP8 path pads the head dimension for quantization and then slices the output back

References

<ins>Draft PR:</ins> #41613

Before submitting a new issue

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement IR workspace support to enable uniform dispatch of backends, including FlashInfer, via --kernel-config.ir_op_priority.mm_encoder_attn.

Guidance

Implement a mechanism for persistent allocations across calls to support FlashInfer's workspace buffer requirement.
Develop a strategy for handling FP8 quantization state, including mutable scale buffers and dynamic scale recomputation, within the IR implementation.
Review and address the implementation differences between FlashInfer and the pure IR implementation, such as CuDNN-specific bucketing and post-op head dimension slicing.
Verify that the IR workspace support and FlashInfer implementation changes do not introduce any regressions or compatibility issues with other backends.

Notes

The solution requires careful consideration of the FlashInfer implementation details and the IR workspace support mechanism to ensure seamless integration and uniform dispatch of backends.

Recommendation

Apply a workaround by implementing IR workspace support and addressing the FlashInfer implementation differences to enable uniform dispatch of backends. This approach allows for a more comprehensive solution that accommodates the specific requirements of FlashInfer and other backends.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#response parsing #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: [IR] mm_encoder_attn migration on hold pending FlashInfer workspace support [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #41613: [On hold] feat(kernels): Migrate mm_encoder_attn to vLLM IR

Description (problem / solution / changelog)

Status

Purpose

Test Plan

Test Results

Before submitting

Changed files

🚀 The feature, motivation and pitch

Summary

What's done (in #41613)

What's blocking: FlashInfer as an IR implementation

References

Before submitting a new issue

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: [IR] mm_encoder_attn migration on hold pending FlashInfer workspace support [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #41613: [On hold] feat(kernels): Migrate mm_encoder_attn to vLLM IR

Description (problem / solution / changelog)

Status

Purpose

Test Plan

Test Results

Before submitting

Changed files

🚀 The feature, motivation and pitch

Summary

What's done (in #41613)

What's blocking: FlashInfer as an IR implementation

References

Before submitting a new issue

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING