vllm - 💡(How to fix) Fix [Feature]: Helion Kernels for MLA (Decode) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41278Fetched 2026-04-30 06:19:08
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
added_to_project_v2 ×1cross-referenced ×1labeled ×1parent_issue_added ×1
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Sub issue for #25179 to track custom pytorch-helion kernels for MultiHead Latent Attention + fp8 quantization. Helion guarantees to compile down to a single triton kernel, and with helion autotuner a optimal configuration can be found for the kernel. For complex and fused kernels such as MLA + dynamic fp8 quant, this can result in low effort - high reward scenario using abstact and pytorch native module of helion.

Implementation Progress

  • Study FlashMLA and other MLA implementations
  • Implement MLA Decode + dynamic fp8 per token quant
  • Implement MLA Decode + dynamic fp8 per block quant
  • Provide autotuned configs for different hardwares

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing custom pytorch-helion kernels for MultiHead Latent Attention with fp8 quantization may require studying existing MLA implementations and utilizing Helion's autotuner for optimal configuration.

Guidance

  • Study FlashMLA and other MLA implementations to understand their architecture and quantization strategies.
  • Utilize Helion's autotuner to find optimal configurations for the custom kernels on different hardware.
  • Implement MLA Decode with dynamic fp8 quantization per token and per block to achieve the desired functionality.
  • Provide autotuned configs for different hardware to ensure compatibility and performance.

Notes

The implementation progress suggests that the custom kernel development is still in progress, and the autotuner will play a crucial role in optimizing the kernel performance.

Recommendation

Apply workaround by utilizing Helion's autotuner to find optimal configurations for the custom kernels, as it guarantees to compile down to a single triton kernel, resulting in a low-effort, high-reward scenario.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Helion Kernels for MLA (Decode) [1 participants]