vllm - ✅(Solved) Fix [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38071Fetched 2026-04-08 01:26:47
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1mentioned ×1

PR fix notes

PR #38211: [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion

Description (problem / solution / changelog)

Purpose

add a fused RMSNorm + fp8 block (group) quantization kernel in Helion as a portable alternative to the existing CUDA kernel (rms_norm_per_block_quant)

this is a sub-task of #25179 and closes #38071

unlike the existing CUDA kernel, this Helion kernel improves portability across hardware (Hopper, AMD, etc.) and avoids combinatorial explosion of platform-specific kernels. Scales are computed dynamically inside the kernel, no pre-computed scale is needed.

Test Plan

pytest tests/kernels/helion/test_rms_norm_fp8_block_quant.py -v

Test Result

  • config picker tests: verify correct config selection for exact match, closest hidden size, num_tokens ceiling, fallback to largest, fallback to default, no configs, and malformed key error
  • correctness tests: Helion output compared against existing CUDA baseline (rms_norm_per_block_quant) across batch sizes [1, 8, 32, 128], hidden sizes [2048, 4096, 8192], group_size=128, dtypes [float16, bfloat16]
  • shape tests: verify output and scale shapes across various input shapes
  • integration tests: verify kernel registration and input generator

cc @ProExpertProg


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/kernels/helion/test_rms_norm_fp8_block_quant.py (added, +232/-0)
  • vllm/kernels/helion/ops/rms_norm_fp8_block_quant.py (added, +189/-0)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I'm working on expanding fused kernel support as tracked in #25179 and would like to implement a fused RMSNorm + fp8 block quantization kernel in Helion. A CUDA kernel for this already exists, but a Helion version would improve portability across hardware (Hopper, AMD, etc.) and help avoid a combinatorial explosion of platform-specific kernels. This is the RMSNorm counterpart to #36972 (fused SiLU + fp8 block quant in Helion).

Alternatives

The existing CUDA kernel works but is not portable. A Triton kernel is another option, but Helion is preferred per the maintainers for better cross-platform support.

Additional context

related issues.... #25179, #36972, #27847

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement a fused RMSNorm + fp8 block quantization kernel in Helion, follow these steps:

  • Create a new Helion kernel function that combines RMSNorm and fp8 block quantization.
  • Use existing CUDA kernel as a reference for the implementation.
  • Utilize Helion's built-in functions for quantization and normalization.

Example code snippet:

import helion

def fused_rmsnorm_fp8_block_quantization(input_tensor):
    # RMSNorm calculation
    rmsnorm_output = helion.rms_norm(input_tensor)
    
    # fp8 block quantization
    quantization_params = helion.get_quantization_params(input_tensor)
    quantized_output = helion.quantize(rmsnorm_output, quantization_params)
    
    return quantized_output
  • Integrate the new kernel function into the existing codebase.
  • Test the new kernel function with various input tensors to ensure correctness and portability.

Verification

To verify the fix, test the new kernel function on different hardware platforms (e.g., Hopper, AMD) and compare the results with the existing CUDA kernel. Ensure that the output is consistent and accurate.

Extra Tips

  • Refer to the Helion documentation for more information on quantization and normalization functions.
  • Use the chatbot on the documentation page for frequently asked questions and troubleshooting.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: fused RMSNorm + fp8 block quantized kernel in Helion [1 pull requests, 1 comments, 1 participants]