pytorch - 💡(How to fix) Fix [Inductor] Codegen Gluon

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Feature Request: Gluon Backend for Inductor's Triton Codegen

1. Summary

Add a Gluon codegen path to Inductor's Triton backend. Gluon (triton.experimental.gluon) is a lower-level companion to Triton that exposes hardware features normally hidden by the Triton compiler — shared memory placement, async data movement, swizzled layouts, barrier synchronization, and warp-level scheduling. By generating Gluon code from Inductor, we gain fine-grained control over memory hierarchy and data movement that standard Triton cannot express.

2. Motivation and Capabilities

Triton's compiler makes implicit decisions about shared memory, data movement, and register pressure. These defaults leave performance on the table when reuse across loop phases is not inferred, when optimal smem layout requires explicit swizzling, or when data movement should overlap compute via async copies. Gluon (triton.experimental.gluon) exposes these decisions as explicit primitives within Triton's JIT framework.

AspectStandard TritonWith Gluon
Shared memoryCompiler-managed, implicitExplicit allocation with typed layouts, controlled lifetime
Data movementtl.load from global, compiler decides cachingTMA async copy with structured descriptors, smem staging
Reused inputsLoaded from DRAM each pass (or hope for L2)Loaded once into smem, served to all passes
Dtype in cachePromoted to compute dtype in registersKept narrow in smem, converted only at point of use
LayoutCompiler choosesExplicit swizzle patterns (NVMMASharedLayout), bank-conflict-free
Pipeliningnum_stages hint (coarse)Explicit multi-buffer + barrier phases
Synchronizationtl.debug_barrier (full CTA)MBarrier per-buffer, per-phase, async completion tracking
Warp schedulingUniform across CTAWarp-specialized producer/consumer groups

3. Steps

3.1 MVP: RMSNorm with TMA-Cached Reused Input, see https://github.com/pytorch/pytorch/issues/179711

The first step targets the simplest high-value pattern: a single reused reduction input cached in shared memory via one TMA row copy. RMSNorm is the reference workload — x is read during reduction (x*x accumulation) and again post-reduction (x * rstd * weight). The Gluon kernel TMA-copies x once into smem per CTA; both passes gather from smem and convert to fp32 only at point of compute.

Implementation: GluonTMAKernel as a TritonKernel subclass with backend codegen hooks (codegen_dtype, codegen_full, codegen_zeros, codegen_cast, codegen_program_id, etc.) and a TMA preamble injected before the inherited reduction body. See https://github.com/liqiangxl/pytorch/pull/13

Eligibility: SM90+, static power-of-2 inner reduction (4096–32768), bf16, row-contiguous reused input.

Performance (GB200, bf16, RMSNorm)

SOL% against 7928 GB/s peak bandwidth. Each cell: baseline -> gluon-tma (delta).

M \ N4096819216384
409667.66 -> 64.21 (-3.45)71.79 -> 73.80 (+2.01)65.65 -> 83.53 (+17.88)
819273.89 -> 74.10 (+0.21)79.38 -> 81.27 (+1.89)76.57 -> 85.37 (+8.80)
1638481.40 -> 81.83 (+0.43)83.78 -> 85.85 (+2.07)81.41 -> 88.75 (+7.34)

The benefit scales with row width (N >= 8192), where bandwidth savings dominate TMA setup cost. At N=16384 the kernel approaches 89% SOL.

3.2 All Normalization Kernels

> Note: Sections 3.2–3.3 are AI-generated speculation directions without performance validation yet.

Extend the TMA-cached input pattern to the full family of normalization reductions:

  • LayerNorm — reuses input in both mean/variance reduction and the normalize+affine pass.
  • GroupNorm — same structure but with grouped reduction dimensions.
  • L2Norm — single reduction + scale by inverse norm.
  • InstanceNorm — per-channel reduction with reused input.

This step broadens eligibility: support fp16 inputs, non-power-of-2 reduction sizes (with chunked TMA), multiple reused inputs, and higher-rank tensors. The codegen hooks and TMA preamble generalize without structural changes — the main work is relaxing the scheduling heuristic and validating correctness across more IR patterns.

3.3 Warp-Specialized Persistent Kernels

Use Gluon's warp-group specialization to generate persistent kernels with producer/consumer separation:

  • Producer warp group — issues TMA copies into multi-buffered shared memory, manages barriers.
  • Consumer warp group — waits on barriers, computes from smem, stores results.
  • Software pipelining — multiple TMA tiles in flight, overlapping the next copy with current compute.

This is the architecture used by high-performance GEMM/attention kernels (e.g., CUTLASS 3.x). Generating it from Inductor enables persistent matmul and fused attention kernels that saturate both memory bandwidth and compute simultaneously, without hand-written CUDA.

Alternatives

No response

Additional context

No response

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [Inductor] Codegen Gluon