pytorch - ✅(Solved) Fix Divisibility constraints are not propagated through symbolic expressions in Inductor [3 pull requests, 6 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177146Fetched 2026-04-08 00:21:54
View on GitHub
Comments
6
Participants
4
Timeline
31
Reactions
0
Author
Assignees
Timeline (top)
referenced ×10commented ×6mentioned ×4subscribed ×4

torch._check(x.shape[-1] % 16 == 0) improves torch.compile(dyanmic=True) performance for RMS Norm, see #175755. However, the divisibiliy is not propagated to other symblic expres such as xnumel = shape[0] * shape[-1] in pointwise kernels. This causes significant performance loss when normalization kernel is segmented into reduction + pointwise.

Error Message

Error logs

Root Cause

statically_known_multiple_of(s27*s77, 16) in torch/_inductor/sizevars.py asks sympy to evaluate Eq(Mod(s27*s77, 16), 0). Even though the ShapeEnv has the axiom Eq(Mod(s27, 16), 0) (from the torch._check), sympy cannot deduce that a product s27*s77 is divisible by 16 when one of its factors s27 is — it does not perform the algebraic step: if a % n == 0 then (a*b) % n == 0.

Fix Action

Fixed

PR fix notes

PR #3: [inductor] Add structural divisibility analysis to statically_known_m…

Description (problem / solution / changelog)

…ultiple_of (#177146)

Add _is_multiple_of() that recurses over sympy expression structure (Mul, Add, FloorDiv, Mod) to prove divisibility, before falling back to statically_known_true. This is a demand-driven variant of the modular arithmetic analysis used in Halide and TVM.

Unlocks tt.divisibility hints for product expressions like shape[0] * shape[1] when torch._check(shape[1] % 16 == 0).

Fixes https://github.com/pytorch/pytorch/issues/177146

Changed files

  • test/inductor/test_codegen_triton.py (modified, +42/-0)
  • torch/_inductor/sizevars.py (modified, +54/-0)

PR #177156: inductor: improve divisibility propagation for symbolic products in S…

Description (problem / solution / changelog)

…izeVarAllocator

Description

This PR enhances statically_known_multiple_of in torch/_inductor/sizevars.py to better handle symbolic products (sympy.Mul).

Currently, SymPy can fail to deduce that a product (e.g., xnumel = shape[0] * shape[-1]) is divisible by a denominator, even when one of its factors (e.g., shape[-1]) is explicitly known to be divisible through a torch._check() assertion.

Technical Changes

  • Updated statically_known_multiple_of to decompose sympy.Mul objects.
  • Added logic to check each individual factor of a product for divisibility against the denominator.
  • Maintained existing symbol length constraints to ensure no regression in compilation time.

Impact

This fix allows Triton to correctly identify divisibility alignment in pointwise kernels. In cases like RMS Norm followed by a pointwise kernel, this can lead to a significant performance improvement (up to 1.66x on SM 10.0/GB200) by enabling optimized loads and stores.

Verification

  • Verified that the symbolic product is correctly identified as divisible when at least one factor satisfies the constraint.
  • Confirmed that automated checks for existing Inductor tests remain unaffected.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • test/inductor/test_utils.py (modified, +47/-0)
  • torch/_inductor/sizevars.py (modified, +22/-1)

PR #177214: [inductor] Add structural divisibility analysis to statically_known_multiple_of

Description (problem / solution / changelog)

Add _is_multiple_of() that recurses over sympy expression structure (Mul, Add, FloorDiv, Mod) to prove divisibility, before falling back to statically_known_true. This is a demand-driven variant of the modular arithmetic analysis used in Halide and TVM.

Unlocks tt.divisibility hints for product expressions like shape[0] * shape[1] when torch._check(shape[1] % 16 == 0).

Fixes https://github.com/pytorch/pytorch/issues/177146

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • test/inductor/test_codegen_triton.py (modified, +41/-0)
  • torch/_inductor/sizevars.py (modified, +52/-0)

Code Example

import torch

@torch.compile(dynamic=True, backend="inductor")
def f(x, w):
    torch._check(x.shape[-1] % 16 == 0)  # hidden_dim is always aligned
    rms = torch.sqrt(torch.mean(x.float() ** 2, dim=-1, keepdim=True) + 1e-6)
    return torch.nn.functional.gelu((x.float() / rms) * w.float()).half()

x = torch.randn([127, 262144], device="cuda", dtype=torch.float16)
w = torch.randn([262144], device="cuda", dtype=torch.float16)
f(x, w)

---

Rule 1Constant:     is_multiple_of(k, n)  where k and n are concrete ints
                        → k % n == 0

Rule 2Product:      is_multiple_of(a * b * ..., n)
                        → any factor f where is_multiple_of(f, n)
                        (sympy.Mul — iterate .args)

Rule 3Sum:          is_multiple_of(a + b + ..., n)
                        → all terms t where is_multiple_of(t, n)
                        (sympy.Add — iterate .args)

Rule 4FloorDiv:     is_multiple_of(FloorDiv(a, b), n)
is_multiple_of(a, b * n)
                        (only when a is known to be a multiple of b,
                         which we can check recursively)

Rule 5Mod:          is_multiple_of(Mod(a, b), n)
is_multiple_of(b, n)
                        (Mod(a, b){0, ..., b-1}*sign, so if b % n == 0
                         then Mod(a, b) % n == 0)

Rule 6Axiom:        is_multiple_of(symbol, n)
                        → fall back to statically_known_true(Eq(Mod(symbol, n), 0))
                        (picks up torch._check constraints from ShapeEnv)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

torch._check(x.shape[-1] % 16 == 0) improves torch.compile(dyanmic=True) performance for RMS Norm, see #175755. However, the divisibiliy is not propagated to other symblic expres such as xnumel = shape[0] * shape[-1] in pointwise kernels. This causes significant performance loss when normalization kernel is segmented into reduction + pointwise.

Root Cause

statically_known_multiple_of(s27*s77, 16) in torch/_inductor/sizevars.py asks sympy to evaluate Eq(Mod(s27*s77, 16), 0). Even though the ShapeEnv has the axiom Eq(Mod(s27, 16), 0) (from the torch._check), sympy cannot deduce that a product s27*s77 is divisible by 16 when one of its factors s27 is — it does not perform the algebraic step: if a % n == 0 then (a*b) % n == 0.

Repro (on GB200)

import torch

@torch.compile(dynamic=True, backend="inductor")
def f(x, w):
    torch._check(x.shape[-1] % 16 == 0)  # hidden_dim is always aligned
    rms = torch.sqrt(torch.mean(x.float() ** 2, dim=-1, keepdim=True) + 1e-6)
    return torch.nn.functional.gelu((x.float() / rms) * w.float()).half()

x = torch.randn([127, 262144], device="cuda", dtype=torch.float16)
w = torch.randn([262144], device="cuda", dtype=torch.float16)
f(x, w)

The graph segments into:

  1. Reduction kernels (2 kernels): computes mean along dim=-1rnumel = shape[-1] gets tt.divisibility=16 correctly
  2. Pointwise kernel (1 kernel): normalize + scale + GELU — xnumel = shape[0] * shape[-1] does not get tt.divisibility=16

The pointwise kernel's xnumel argument missing tt.divisibility prevents Triton from optimizing loads/stores, causing a 1.66x slowdown on the pointwise kernel.

Proposed Fix

Add a single new function _is_multiple_of(numerator, denominator) -> bool in sizevars.py that does structural reasoning on sympy expressions before falling back to statically_known_true. Replace the body of statically_known_multiple_of to call it.

No new files, no new classes, no new dependencies. Just one function (~60 lines) that pattern-matches on sympy expression types.

File to modify

/opt/pytorch/pytorch/torch/_inductor/sizevars.py

Rules to implement

All rules follow from one theorem: if a is a multiple of n, then any expression that has a as a multiplicative factor is also a multiple of n.

Rule 1 — Constant:     is_multiple_of(k, n)  where k and n are concrete ints
                        → k % n == 0

Rule 2 — Product:      is_multiple_of(a * b * ..., n)
                        → any factor f where is_multiple_of(f, n)
                        (sympy.Mul — iterate .args)

Rule 3 — Sum:          is_multiple_of(a + b + ..., n)
                        → all terms t where is_multiple_of(t, n)
                        (sympy.Add — iterate .args)

Rule 4 — FloorDiv:     is_multiple_of(FloorDiv(a, b), n)
                        → is_multiple_of(a, b * n)
                        (only when a is known to be a multiple of b,
                         which we can check recursively)

Rule 5 — Mod:          is_multiple_of(Mod(a, b), n)
                        → is_multiple_of(b, n)
                        (Mod(a, b) ∈ {0, ..., b-1}*sign, so if b % n == 0
                         then Mod(a, b) % n == 0)

Rule 6 — Axiom:        is_multiple_of(symbol, n)
                        → fall back to statically_known_true(Eq(Mod(symbol, n), 0))
                        (picks up torch._check constraints from ShapeEnv)

Benchmark Results (GB200, SM 10.0)

[127, 262144] fp16, RMSNorm + GELU with torch._check(x.shape[-1] % 16 == 0):

KernelBefore fixAfter fixSpeedup
reduction (triton_red)39.4 us34.3 us1.15x
persistent (triton_per)1.9 us1.4 us1.36x
pointwise (triton_poi)108.7 us65.3 us1.66x
Total150.0 us101.1 us1.48x

Error logs

No response

Versions

PyTorch version: 2.12.0a0+git5710dd2 Is debug build: False CUDA used to build PyTorch: 13.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (aarch64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.14.0-1008-nvidia-64k-aarch64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 13.2.46

cc @chauhang @penguinwu

extent analysis

Problem Summary

The problem is a performance issue in PyTorch's torch.compile function when using the inductor backend. The issue arises from the fact that the divisibility of a symbolic expression is not properly propagated to other parts of the kernel, leading to a significant slowdown.

Root Cause Analysis

The root cause is due to the way SymPy handles algebraic steps, specifically the fact that it does not perform the step: if a % n == 0 then (a*b) % n == 0.

Fix Plan

To fix this issue, we need to add a new function _is_multiple_of(numerator, denominator) -> bool in sizevars.py that performs structural reasoning on SymPy expressions before falling back to statically_known_true. We will replace the body of statically_known_multiple_of to call this new function.

Here are the concrete steps:

  1. Add a new function _is_multiple_of(numerator, denominator) -> bool in sizevars.py:
def _is_multiple_of(numerator, denominator):
    if isinstance(numerator, int):
        return numerator % denominator == 0
    elif isinstance(numerator, sympy.Mul):
        return all(_is_multiple_of(factor, denominator) for factor in numerator.args)
    elif isinstance(numerator, sympy.Add):
        return all(_is_multiple_of(term, denominator) for term in numerator.args)
    elif isinstance(numerator, sympy.FloorDiv):
        return _is_multiple_of(numerator.args[0], denominator * numerator.args[1])
    elif isinstance(numerator, sympy.Mod):
        return _is_multiple_of(denominator, denominator)
    else:
        return statically_known_true(Eq(Mod(numerator, denominator), 0))
  1. Replace the body of statically_known_multiple_of to call the new function:

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix Divisibility constraints are not propagated through symbolic expressions in Inductor [3 pull requests, 6 comments, 4 participants]