vllm - 💡(How to fix) Fix [RFC]: Convert Triton kernels from raw pointers to block pointers [1 participants]

ohadeytan · 2026-04-21T11:24:55Z

[vllm] Motivation. vLLM's core dense-inference Triton kernels currently use raw pointer arithmetic tl.load ptr + offset, mask=... . Triton's recommended and fo… ### Motivation. vLLM's core dense-inference Triton kernels currently use **raw pointer arithmetic** (`tl.load(ptr + offset, mask=...)`). Triton's recommended and forward-looking memory access API is **block pointers** (`tl.make_block_ptr` / `tl.advance`), which: 1. **Enable hardware portability** — Block pointers abstract away raw address arithmetic, allowing Triton backends targeting non-NVIDIA hardware (e.g., Intel XPU, IBM Spyre/AIU) to lower these kernels. Some accelerators with tiled memory architectures fundamentally cannot accept raw pointer arithmetic. 2. **Leverage Hopper TMA** — On NVIDIA Hopper (H100/H200), block pointers can be lowered to Tensor Memory Accelerator (TMA) instructions, potentially improving memory throughput. 3. **Align with Triton's direction** — The Triton project is moving toward structured memory access as the primary API. 4. **Improve readability** — Block pointer code separates memory layout (shape, strides, offsets) from computation (load, compute, store). Some newer vLLM kernels (FLA, Mamba, LoRA) already use block pointers, but the core inference-path kernels do not. ### Proposed Change. Convert the following simple, element-wise kernels to use block pointers: | Kernel | File | Notes | |--------|------|-------| | RMSNorm (`_rms_norm_kernel`) | `model_executor/layers/batch_invariant.py` | Row-wise reduction, straightforward 1D block ptr | | SwiGLU (`_swiglustep_and_mul_kernel`) | `model_executor/layers/activation.py` | Element-wise activation | | Log-softmax (`_topk_log_softmax_kernel`) | `v1/worker/gpu/sample/logprob.py` | Row-wise reduction | | Ranks (`_ranks_kernel`) | `v1/worker/gpu/sample/logprob.py` | Simple comparison kernel | | MRoPE (`_triton_mrope_forward`) | `model_executor/layers/rotary_embedding/mrope.py` | Per-element with sin/cos tables | These are mechanical conversions — replacing `tl.load(ptr + offset, mask=...)` with `tl.load(block_ptr, boundary_check=...)` — and carry low risk of behavioral change. ### Validation approach - Per-kernel numerical equivalence tests comparing raw-pointer and block-pointer outputs across multiple shapes and dtypes (fp16, bf16, fp32). - Performance benchmarking on A100/H100 to confirm no regression. - Integration test via `vllm.LLM.generate()` to verify identical output tokens. ### Prior work We have working block-pointer conversions for some of these kernels with numerical equivalence tests, and have validated the conversion approach across kernels of varying complexity. ### Future work If there is interest, the same conversion could be extended to the more complex kernels on the dense inference path — KV cache reshape, decode softmax+reduceV, merge attention states, and the attention kernels (prefill, decode stage 1, GQA decode, chunked prefill/paged decode). ### Questions for maintainers 1. Is there interest in accepting block-pointer conversions of these kernels? 2. Any preference on `tl.make_block_ptr` vs `tl.make_tensor_descriptor`? The former is more stable; the latter is newer and maps to TMA on Hopper. 3. Should the block-pointer versions replace the raw-pointer versions in-place, or coexist as an alternative backend (e.g., selectable via config)? 4. Any specific performance benchmarking requirements or hardware targets we should cover? ### Feedback Period. _No response_ ### CC List. _No response_ ### Any Other Things. _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-21 11:24:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40458•Fetched 2026-04-22 07:45:28

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ohadeytan

Participants

ohadeytan

Timeline (top)

labeled ×1

RAW_BUFFERClick to expand / collapse

Motivation.

vLLM's core dense-inference Triton kernels currently use raw pointer arithmetic (tl.load(ptr + offset, mask=...)). Triton's recommended and forward-looking memory access API is block pointers (tl.make_block_ptr / tl.advance), which:

Enable hardware portability — Block pointers abstract away raw address arithmetic, allowing Triton backends targeting non-NVIDIA hardware (e.g., Intel XPU, IBM Spyre/AIU) to lower these kernels. Some accelerators with tiled memory architectures fundamentally cannot accept raw pointer arithmetic.
Leverage Hopper TMA — On NVIDIA Hopper (H100/H200), block pointers can be lowered to Tensor Memory Accelerator (TMA) instructions, potentially improving memory throughput.
Align with Triton's direction — The Triton project is moving toward structured memory access as the primary API.
Improve readability — Block pointer code separates memory layout (shape, strides, offsets) from computation (load, compute, store).

Some newer vLLM kernels (FLA, Mamba, LoRA) already use block pointers, but the core inference-path kernels do not.

Proposed Change.

Convert the following simple, element-wise kernels to use block pointers:

Kernel	File	Notes
RMSNorm (`_rms_norm_kernel`)	`model_executor/layers/batch_invariant.py`	Row-wise reduction, straightforward 1D block ptr
SwiGLU (`_swiglustep_and_mul_kernel`)	`model_executor/layers/activation.py`	Element-wise activation
Log-softmax (`_topk_log_softmax_kernel`)	`v1/worker/gpu/sample/logprob.py`	Row-wise reduction
Ranks (`_ranks_kernel`)	`v1/worker/gpu/sample/logprob.py`	Simple comparison kernel
MRoPE (`_triton_mrope_forward`)	`model_executor/layers/rotary_embedding/mrope.py`	Per-element with sin/cos tables

These are mechanical conversions — replacing tl.load(ptr + offset, mask=...) with tl.load(block_ptr, boundary_check=...) — and carry low risk of behavioral change.

Validation approach

Per-kernel numerical equivalence tests comparing raw-pointer and block-pointer outputs across multiple shapes and dtypes (fp16, bf16, fp32).
Performance benchmarking on A100/H100 to confirm no regression.
Integration test via vllm.LLM.generate() to verify identical output tokens.

Prior work

We have working block-pointer conversions for some of these kernels with numerical equivalence tests, and have validated the conversion approach across kernels of varying complexity.

Future work

If there is interest, the same conversion could be extended to the more complex kernels on the dense inference path — KV cache reshape, decode softmax+reduceV, merge attention states, and the attention kernels (prefill, decode stage 1, GQA decode, chunked prefill/paged decode).

Questions for maintainers

Is there interest in accepting block-pointer conversions of these kernels?
Any preference on tl.make_block_ptr vs tl.make_tensor_descriptor? The former is more stable; the latter is newer and maps to TMA on Hopper.
Should the block-pointer versions replace the raw-pointer versions in-place, or coexist as an alternative backend (e.g., selectable via config)?
Any specific performance benchmarking requirements or hardware targets we should cover?

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Convert the specified kernels to use Triton's block pointers instead of raw pointer arithmetic to improve hardware portability and leverage Hopper TMA.

Guidance

Identify the kernels to be converted, which include RMSNorm, SwiGLU, Log-softmax, Ranks, and MRoPE, and replace tl.load(ptr + offset, mask=...) with tl.load(block_ptr, boundary_check=...).
Perform numerical equivalence tests to compare raw-pointer and block-pointer outputs across multiple shapes and dtypes.
Conduct performance benchmarking on A100/H100 to confirm no regression.
Consider using tl.make_block_ptr for stability or tl.make_tensor_descriptor for newer hardware support.

Example

No code snippet is provided as the conversion is mechanical and depends on the specific kernel implementation.

Notes

The conversion approach has been validated across kernels of varying complexity, and the block-pointer versions can potentially replace the raw-pointer versions in-place or coexist as an alternative backend.

Recommendation

Apply the block-pointer conversion to the specified kernels, as it aligns with Triton's direction and improves hardware portability and readability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #prompt formatting #chain error #conversation history #tool integration

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Convert Triton kernels from raw pointers to block pointers [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Validation approach

Prior work

Future work

Questions for maintainers

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Convert Triton kernels from raw pointers to block pointers [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Validation approach

Prior work

Future work

Questions for maintainers

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING