vllm - ✅(Solved) Fix [Feature]: [IR] mm_encoder_attn migration on hold pending FlashInfer workspace support [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41724Fetched 2026-05-06 06:15:15
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1mentioned ×1subscribed ×1

PR #41613 migrates mm_encoder_attn to vLLM IR with three pure backends (flash_attn, triton, native) fully working and tested. Said that, as rightly pointed out by @ProExpertProg, the migration should wait until IR workspace support exists so that all backends including FlashInfer dispatch uniformly via --kernel-config.ir_op_priority.mm_encoder_attn

Root Cause

FlashInfer cannot be a pure functional IR implementation today because it requires:

Fix Action

Fix / Workaround

PR #41613 migrates mm_encoder_attn to vLLM IR with three pure backends (flash_attn, triton, native) fully working and tested. Said that, as rightly pointed out by @ProExpertProg, the migration should wait until IR workspace support exists so that all backends including FlashInfer dispatch uniformly via --kernel-config.ir_op_priority.mm_encoder_attn

PR fix notes

PR #41613: [On hold] feat(kernels): Migrate mm_encoder_attn to vLLM IR

Description (problem / solution / changelog)

Status

On hold on IR workspace support (per maintainer feedback). Will rebase and complete once that lands.

Purpose

→ This PR migrates the multimodal encoder attention (MMEncoderAttention) to the new vLLM IR framework, as discussed in https://github.com/vllm-project/vllm/issues/32676 → Followed existing IR ops implementation patterns and the spec detailed in https://github.com/vllm-project/vllm/issues/32358 → Added a test file following the IR pattern, and tested inference in native, triton, and default flash_attn modes. Moved all hardware-specific routing in MMEncoderAttention into _call_ir_op and preserved the FlashInfer path to handle FP8 caching.

Test Plan

<ins>A > Existing MHA Tests (Regression)</ins> → The following test harness was executed to check that no regressions were introduced in the existing MHA and FlashInfer paths: tests/kernels/attention/test_mha_attn.py <ins>B > New IR Kernel Tests (Added)</ins> → The following test harness was added in this PR and executed to check the functional robustness of the new IR op across various data types (FP16, BF16), batch sizes, and sequence lengths: tests/kernels/ir/test_mm_encoder_attn.py <ins>C > IR Meta-Tests (Integration)</ins> → The following test harness was added in this PR and executed to check that the new op is correctly registered in the IR framework: tests/kernels/ir/test_ir_ops.py <ins>D > E2E Sanity (Inference)</ins> → Wrote a sanity check script and executed inference to check for E2E sanity across all modes: https://gist.github.com/harshaljanjani/432d9fb7aa0bfdc550a63774872ea073

Test Results

<ins>A, B and C:</ins><br> <img alt="Untitled design (3)" src="https://github.com/user-attachments/assets/985b7e78-5a92-4549-99af-c4b1bcb172ad" />

<ins>D:</ins><br><ins>Backend: flash_attn</ins><br> <img alt="2-1" src="https://github.com/user-attachments/assets/853e7879-2b65-4aea-ba5d-c37662191594" />

<ins>Backend: triton</ins><br> <img alt="2-2" src="https://github.com/user-attachments/assets/57951cb1-2646-48dc-bc2c-1c3858e94d6f" />

<ins>Backend: native</ins><br> <img alt="2-3" src="https://github.com/user-attachments/assets/1171fbf7-bdfd-4f33-a43b-9e7fea868118" />

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Was this discussed/approved via a Github issue?
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results

Changed files

  • tests/kernels/ir/test_mm_encoder_attn.py (added, +174/-0)
  • vllm/config/kernel.py (modified, +3/-0)
  • vllm/ir/ops/__init__.py (modified, +2/-1)
  • vllm/ir/ops/attention.py (added, +80/-0)
  • vllm/kernels/__init__.py (modified, +2/-2)
  • vllm/kernels/attention_ops.py (added, +136/-0)
  • vllm/model_executor/layers/attention/mm_encoder_attention.py (modified, +25/-141)
  • vllm/platforms/cuda.py (modified, +6/-1)
  • vllm/platforms/rocm.py (modified, +6/-1)
  • vllm/platforms/xpu.py (modified, +3/-1)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

PR #41613 migrates mm_encoder_attn to vLLM IR with three pure backends (flash_attn, triton, native) fully working and tested. Said that, as rightly pointed out by @ProExpertProg, the migration should wait until IR workspace support exists so that all backends including FlashInfer dispatch uniformly via --kernel-config.ir_op_priority.mm_encoder_attn

What's done (in #41613)

  • IR op definition with native semantics
  • Provider implementations for flash_attn and triton
  • Per-platform priority lists (CUDA, ROCm, XPU)
  • IrOpPriorityConfig.mm_encoder_attn field
  • Test suite: 205 passed, 32 skipped across dtypes, batch sizes, sequence lengths, and head configurations
  • E2E inference verified on all three backends

What's blocking: FlashInfer as an IR implementation

FlashInfer cannot be a pure functional IR implementation today because it requires:

  • A workspace buffer, a 128MB persistent GPU allocation (torch.zeros(128*1024*1024, dtype=uint8, device="cuda")) passed to every call, with no IR mechanism for persistent allocations across calls
  • FP8 quantization state, including per-instance mutable scale buffers (_fp8_q/k/v_scale), amax circular history, and dynamic scale recomputation (_record_amax_and_update_scales) that mutates on every forward pass

The remaining differences are implementation details that an IR impl could handle internally once the above are resolved (good first issues):

  • CuDNN-specific bucketing including sequence length and batch size padding (bucket_flashinfer_max_seqlen, add_padding_to_seqlens) for CUDA graph compatibility
  • An extended signature with sequence_lengths, q_scale/k_scale/v_scale, o_data_type, and workspace_buffer parameters not present in the pure op schema
  • Post-op head dimension slicing, where the FP8 path pads the head dimension for quantization and then slices the output back

References

  • <ins>Draft PR:</ins> #41613

Before submitting a new issue

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement IR workspace support to enable uniform dispatch of backends, including FlashInfer, via --kernel-config.ir_op_priority.mm_encoder_attn.

Guidance

  • Implement a mechanism for persistent allocations across calls to support FlashInfer's workspace buffer requirement.
  • Develop a strategy for handling FP8 quantization state, including mutable scale buffers and dynamic scale recomputation, within the IR implementation.
  • Review and address the implementation differences between FlashInfer and the pure IR implementation, such as CuDNN-specific bucketing and post-op head dimension slicing.
  • Verify that the IR workspace support and FlashInfer implementation changes do not introduce any regressions or compatibility issues with other backends.

Notes

The solution requires careful consideration of the FlashInfer implementation details and the IR workspace support mechanism to ensure seamless integration and uniform dispatch of backends.

Recommendation

Apply a workaround by implementing IR workspace support and addressing the FlashInfer implementation differences to enable uniform dispatch of backends. This approach allows for a more comprehensive solution that accommodates the specific requirements of FlashInfer and other backends.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING