vllm - 💡(How to fix) Fix [RFC]: Convert Triton kernels from raw pointers to block pointers [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40458Fetched 2026-04-22 07:45:28
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1
RAW_BUFFERClick to expand / collapse

Motivation.

vLLM's core dense-inference Triton kernels currently use raw pointer arithmetic (tl.load(ptr + offset, mask=...)). Triton's recommended and forward-looking memory access API is block pointers (tl.make_block_ptr / tl.advance), which:

  1. Enable hardware portability — Block pointers abstract away raw address arithmetic, allowing Triton backends targeting non-NVIDIA hardware (e.g., Intel XPU, IBM Spyre/AIU) to lower these kernels. Some accelerators with tiled memory architectures fundamentally cannot accept raw pointer arithmetic.

  2. Leverage Hopper TMA — On NVIDIA Hopper (H100/H200), block pointers can be lowered to Tensor Memory Accelerator (TMA) instructions, potentially improving memory throughput.

  3. Align with Triton's direction — The Triton project is moving toward structured memory access as the primary API.

  4. Improve readability — Block pointer code separates memory layout (shape, strides, offsets) from computation (load, compute, store).

Some newer vLLM kernels (FLA, Mamba, LoRA) already use block pointers, but the core inference-path kernels do not.

Proposed Change.

Convert the following simple, element-wise kernels to use block pointers:

KernelFileNotes
RMSNorm (_rms_norm_kernel)model_executor/layers/batch_invariant.pyRow-wise reduction, straightforward 1D block ptr
SwiGLU (_swiglustep_and_mul_kernel)model_executor/layers/activation.pyElement-wise activation
Log-softmax (_topk_log_softmax_kernel)v1/worker/gpu/sample/logprob.pyRow-wise reduction
Ranks (_ranks_kernel)v1/worker/gpu/sample/logprob.pySimple comparison kernel
MRoPE (_triton_mrope_forward)model_executor/layers/rotary_embedding/mrope.pyPer-element with sin/cos tables

These are mechanical conversions — replacing tl.load(ptr + offset, mask=...) with tl.load(block_ptr, boundary_check=...) — and carry low risk of behavioral change.

Validation approach

  • Per-kernel numerical equivalence tests comparing raw-pointer and block-pointer outputs across multiple shapes and dtypes (fp16, bf16, fp32).
  • Performance benchmarking on A100/H100 to confirm no regression.
  • Integration test via vllm.LLM.generate() to verify identical output tokens.

Prior work

We have working block-pointer conversions for some of these kernels with numerical equivalence tests, and have validated the conversion approach across kernels of varying complexity.

Future work

If there is interest, the same conversion could be extended to the more complex kernels on the dense inference path — KV cache reshape, decode softmax+reduceV, merge attention states, and the attention kernels (prefill, decode stage 1, GQA decode, chunked prefill/paged decode).

Questions for maintainers

  1. Is there interest in accepting block-pointer conversions of these kernels?
  2. Any preference on tl.make_block_ptr vs tl.make_tensor_descriptor? The former is more stable; the latter is newer and maps to TMA on Hopper.
  3. Should the block-pointer versions replace the raw-pointer versions in-place, or coexist as an alternative backend (e.g., selectable via config)?
  4. Any specific performance benchmarking requirements or hardware targets we should cover?

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Convert the specified kernels to use Triton's block pointers instead of raw pointer arithmetic to improve hardware portability and leverage Hopper TMA.

Guidance

  • Identify the kernels to be converted, which include RMSNorm, SwiGLU, Log-softmax, Ranks, and MRoPE, and replace tl.load(ptr + offset, mask=...) with tl.load(block_ptr, boundary_check=...).
  • Perform numerical equivalence tests to compare raw-pointer and block-pointer outputs across multiple shapes and dtypes.
  • Conduct performance benchmarking on A100/H100 to confirm no regression.
  • Consider using tl.make_block_ptr for stability or tl.make_tensor_descriptor for newer hardware support.

Example

No code snippet is provided as the conversion is mechanical and depends on the specific kernel implementation.

Notes

The conversion approach has been validated across kernels of varying complexity, and the block-pointer versions can potentially replace the raw-pointer versions in-place or coexist as an alternative backend.

Recommendation

Apply the block-pointer conversion to the specified kernels, as it aligns with Triton's direction and improves hardware portability and readability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Convert Triton kernels from raw pointers to block pointers [1 participants]