vllm - ✅(Solved) Fix [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion [1 pull requests, 1 comments, 1 participants]

aman-coder03 · 2026-03-25T04:58:55Z

[vllm] PR 38211: Feature : fused RMSNorm + fp8 block quantized kernel in Helion - Repository: vllm-project/vllm - Author: aman-coder03 - State: open | merged:… # PR #38211: [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion - Repository: vllm-project/vllm - Author: aman-coder03 - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/38211 ## Description (problem / solution / changelog) ## Purpose add a fused RMSNorm + fp8 block (group) quantization kernel in Helion as a portable alternative to the existing CUDA kernel (`rms_norm_per_block_quant`) this is a sub-task of #25179 and closes #38071 unlike the existing CUDA kernel, this Helion kernel improves portability across hardware (Hopper, AMD, etc.) and avoids combinatorial explosion of platform-specific kernels. Scales are computed **dynamically** inside the kernel, no pre-computed scale is needed. ## Test Plan ```bash pytest tests/kernels/helion/test_rms_norm_fp8_block_quant.py -v ``` ## Test Result - config picker tests: verify correct config selection for exact match, closest hidden size, num_tokens ceiling, fallback to largest, fallback to default, no configs, and malformed key error - correctness tests: Helion output compared against existing CUDA baseline (`rms_norm_per_block_quant`) across batch sizes [1, 8, 32, 128], hidden sizes [2048, 4096, 8192], group_size=128, dtypes [float16, bfloat16] - shape tests: verify output and scale shapes across various input shapes - integration tests: verify kernel registration and input generator cc @ProExpertProg --- Essential Elements of an Effective PR Description Checklist - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `tests/kernels/helion/test_rms_norm_fp8_block_quant.py` (added, +232/-0) - `vllm/kernels/helion/ops/rms_norm_fp8_block_quant.py` (added, +189/-0) ### 🚀 The feature, motivation and pitch I'm working on expanding fused kernel support as tracked in #25179 and would like to implement a fused RMSNorm + fp8 block quantization kernel in Helion. A CUDA kernel for this already exists, but a Helion version would improve portability across hardware (Hopper, AMD, etc.) and help avoid a combinatorial explosion of platform-specific kernels. This is the RMSNorm counterpart to #36972 (fused SiLU + fp8 block quant in Helion). ### Alternatives The existing CUDA kernel works but is not portable. A Triton kernel is another option, but Helion is preferred per the maintainers for better cross-platform support. ### Additional context related issues.... #25179, #36972, #27847 ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-25 04:58:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38071•Fetched 2026-04-08 01:26:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

aman-coder03

Participants

aman-coder03

Timeline (top)

commented ×1cross-referenced ×1labeled ×1mentioned ×1

PR fix notes

PR #38211: [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion

Repository: vllm-project/vllm
Author: aman-coder03
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38211

Description (problem / solution / changelog)

Purpose

add a fused RMSNorm + fp8 block (group) quantization kernel in Helion as a portable alternative to the existing CUDA kernel (rms_norm_per_block_quant)

this is a sub-task of #25179 and closes #38071

unlike the existing CUDA kernel, this Helion kernel improves portability across hardware (Hopper, AMD, etc.) and avoids combinatorial explosion of platform-specific kernels. Scales are computed dynamically inside the kernel, no pre-computed scale is needed.

Test Plan

pytest tests/kernels/helion/test_rms_norm_fp8_block_quant.py -v

Test Result

config picker tests: verify correct config selection for exact match, closest hidden size, num_tokens ceiling, fallback to largest, fallback to default, no configs, and malformed key error
correctness tests: Helion output compared against existing CUDA baseline (rms_norm_per_block_quant) across batch sizes [1, 8, 32, 128], hidden sizes [2048, 4096, 8192], group_size=128, dtypes [float16, bfloat16]
shape tests: verify output and scale shapes across various input shapes
integration tests: verify kernel registration and input generator

cc @ProExpertProg

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

tests/kernels/helion/test_rms_norm_fp8_block_quant.py (added, +232/-0)
vllm/kernels/helion/ops/rms_norm_fp8_block_quant.py (added, +189/-0)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I'm working on expanding fused kernel support as tracked in #25179 and would like to implement a fused RMSNorm + fp8 block quantization kernel in Helion. A CUDA kernel for this already exists, but a Helion version would improve portability across hardware (Hopper, AMD, etc.) and help avoid a combinatorial explosion of platform-specific kernels. This is the RMSNorm counterpart to #36972 (fused SiLU + fp8 block quant in Helion).

Alternatives

The existing CUDA kernel works but is not portable. A Triton kernel is another option, but Helion is preferred per the maintainers for better cross-platform support.

Additional context

related issues.... #25179, #36972, #27847

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement a fused RMSNorm + fp8 block quantization kernel in Helion, follow these steps:

Create a new Helion kernel function that combines RMSNorm and fp8 block quantization.
Use existing CUDA kernel as a reference for the implementation.
Utilize Helion's built-in functions for quantization and normalization.

Example code snippet:

import helion

def fused_rmsnorm_fp8_block_quantization(input_tensor):
    # RMSNorm calculation
    rmsnorm_output = helion.rms_norm(input_tensor)
    
    # fp8 block quantization
    quantization_params = helion.get_quantization_params(input_tensor)
    quantized_output = helion.quantize(rmsnorm_output, quantization_params)
    
    return quantized_output

Integrate the new kernel function into the existing codebase.
Test the new kernel function with various input tensors to ensure correctness and portability.

Verification

To verify the fix, test the new kernel function on different hardware platforms (e.g., Hopper, AMD) and compare the results with the existing CUDA kernel. Ensure that the output is consistent and accurate.

Extra Tips

Refer to the Helion documentation for more information on quantization and normalization functions.
Use the chatbot on the documentation page for frequently asked questions and troubleshooting.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#integration issue #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #38211: [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #38211: [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING