🚀 The feature, motivation and pitch

Feature Request: Gluon Backend for Inductor's Triton Codegen

1. Summary

Add a Gluon codegen path to Inductor's Triton backend. Gluon (triton.experimental.gluon) is a lower-level companion to Triton that exposes hardware features normally hidden by the Triton compiler — shared memory placement, async data movement, swizzled layouts, barrier synchronization, and warp-level scheduling. By generating Gluon code from Inductor, we gain fine-grained control over memory hierarchy and data movement that standard Triton cannot express.

2. Motivation and Capabilities

Triton's compiler makes implicit decisions about shared memory, data movement, and register pressure. These defaults leave performance on the table when reuse across loop phases is not inferred, when optimal smem layout requires explicit swizzling, or when data movement should overlap compute via async copies. Gluon (triton.experimental.gluon) exposes these decisions as explicit primitives within Triton's JIT framework.

Aspect	Standard Triton	With Gluon
Shared memory	Compiler-managed, implicit	Explicit allocation with typed layouts, controlled lifetime
Data movement	`tl.load` from global, compiler decides caching	TMA async copy with structured descriptors, smem staging
Reused inputs	Loaded from DRAM each pass (or hope for L2)	Loaded once into smem, served to all passes
Dtype in cache	Promoted to compute dtype in registers	Kept narrow in smem, converted only at point of use
Layout	Compiler chooses	Explicit swizzle patterns (`NVMMASharedLayout`), bank-conflict-free
Pipelining	`num_stages` hint (coarse)	Explicit multi-buffer + barrier phases
Synchronization	`tl.debug_barrier` (full CTA)	MBarrier per-buffer, per-phase, async completion tracking
Warp scheduling	Uniform across CTA	Warp-specialized producer/consumer groups

3. Steps

3.1 MVP: RMSNorm with TMA-Cached Reused Input, see https://github.com/pytorch/pytorch/issues/179711

The first step targets the simplest high-value pattern: a single reused reduction input cached in shared memory via one TMA row copy. RMSNorm is the reference workload — x is read during reduction (x*x accumulation) and again post-reduction (x * rstd * weight). The Gluon kernel TMA-copies x once into smem per CTA; both passes gather from smem and convert to fp32 only at point of compute.

Implementation: GluonTMAKernel as a TritonKernel subclass with backend codegen hooks (codegen_dtype, codegen_full, codegen_zeros, codegen_cast, codegen_program_id, etc.) and a TMA preamble injected before the inherited reduction body. See https://github.com/liqiangxl/pytorch/pull/13

Eligibility: SM90+, static power-of-2 inner reduction (4096–32768), bf16, row-contiguous reused input.

Performance (GB200, bf16, RMSNorm)

SOL% against 7928 GB/s peak bandwidth. Each cell: baseline -> gluon-tma (delta).

M \ N	4096	8192	16384
4096	67.66 -> 64.21 (-3.45)	71.79 -> 73.80 (+2.01)	65.65 -> 83.53 (+17.88)
8192	73.89 -> 74.10 (+0.21)	79.38 -> 81.27 (+1.89)	76.57 -> 85.37 (+8.80)
16384	81.40 -> 81.83 (+0.43)	83.78 -> 85.85 (+2.07)	81.41 -> 88.75 (+7.34)

The benefit scales with row width (N >= 8192), where bandwidth savings dominate TMA setup cost. At N=16384 the kernel approaches 89% SOL.

3.2 All Normalization Kernels

> Note: Sections 3.2–3.3 are AI-generated speculation directions without performance validation yet.

Extend the TMA-cached input pattern to the full family of normalization reductions:

LayerNorm — reuses input in both mean/variance reduction and the normalize+affine pass.
GroupNorm — same structure but with grouped reduction dimensions.
L2Norm — single reduction + scale by inverse norm.
InstanceNorm — per-channel reduction with reused input.

This step broadens eligibility: support fp16 inputs, non-power-of-2 reduction sizes (with chunked TMA), multiple reused inputs, and higher-rank tensors. The codegen hooks and TMA preamble generalize without structural changes — the main work is relaxing the scheduling heuristic and validating correctness across more IR patterns.

3.3 Warp-Specialized Persistent Kernels

Use Gluon's warp-group specialization to generate persistent kernels with producer/consumer separation:

Producer warp group — issues TMA copies into multi-buffered shared memory, manages barriers.
Consumer warp group — waits on barriers, computes from smem, stores results.
Software pipelining — multiple TMA tiles in flight, overlapping the next copy with current compute.

This is the architecture used by high-performance GEMM/attention kernels (e.g., CUTLASS 3.x). Generating it from Inductor enables persistent matmul and fused attention kernels that saturate both memory bandwidth and compute simultaneously, without hand-written CUDA.

Alternatives

No response

Additional context

No response

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [Inductor] Codegen Gluon

Recommended Tools

GitHub issue graph ai analysis