pytorch - ✅(Solved) Fix Divisibility constraints are not propagated through symbolic expressions in Inductor [3 pull requests, 6 comments, 4 participants]

pytorch2026-03-11 17:04:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177146•Fetched 2026-04-08 00:21:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

referenced ×10commented ×6mentioned ×4subscribed ×4

torch._check(x.shape[-1] % 16 == 0) improves torch.compile(dyanmic=True) performance for RMS Norm, see #175755. However, the divisibiliy is not propagated to other symblic expres such as xnumel = shape[0] * shape[-1] in pointwise kernels. This causes significant performance loss when normalization kernel is segmented into reduction + pointwise.

Error Message

Error logs

Root Cause

statically_known_multiple_of(s27*s77, 16) in torch/_inductor/sizevars.py asks sympy to evaluate Eq(Mod(s27*s77, 16), 0). Even though the ShapeEnv has the axiom Eq(Mod(s27, 16), 0) (from the torch._check), sympy cannot deduce that a product s27*s77 is divisible by 16 when one of its factors s27 is — it does not perform the algebraic step: if a % n == 0 then (a*b) % n == 0.

Fix Action

Fixed

Fixed by PR: [inductor] Add structural divisibility analysis to statically_known_m… (https://github.com/liqiangxl/pytorch/pull/3)
Fixed by PR: inductor: improve divisibility propagation for symbolic products in S… (https://github.com/pytorch/pytorch/pull/177156)
Fixed by PR: [inductor] Add structural divisibility analysis to statically_known_multiple_of (https://github.com/pytorch/pytorch/pull/177214)

PR fix notes

PR #3: [inductor] Add structural divisibility analysis to statically_known_m…

Repository: liqiangxl/pytorch
Author: liqiangxl
State: closed | merged: False
Link: https://github.com/liqiangxl/pytorch/pull/3

Description (problem / solution / changelog)

…ultiple_of (#177146)

Add _is_multiple_of() that recurses over sympy expression structure (Mul, Add, FloorDiv, Mod) to prove divisibility, before falling back to statically_known_true. This is a demand-driven variant of the modular arithmetic analysis used in Halide and TVM.

Unlocks tt.divisibility hints for product expressions like shape[0] * shape[1] when torch._check(shape[1] % 16 == 0).

Fixes https://github.com/pytorch/pytorch/issues/177146

Changed files

test/inductor/test_codegen_triton.py (modified, +42/-0)
torch/_inductor/sizevars.py (modified, +54/-0)

PR #177156: inductor: improve divisibility propagation for symbolic products in S…

Repository: pytorch/pytorch
Author: mhusnain-tech
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/177156

Description (problem / solution / changelog)

…izeVarAllocator

Description

This PR enhances statically_known_multiple_of in torch/_inductor/sizevars.py to better handle symbolic products (sympy.Mul).

Currently, SymPy can fail to deduce that a product (e.g., xnumel = shape[0] * shape[-1]) is divisible by a denominator, even when one of its factors (e.g., shape[-1]) is explicitly known to be divisible through a torch._check() assertion.

Technical Changes

Updated statically_known_multiple_of to decompose sympy.Mul objects.
Added logic to check each individual factor of a product for divisibility against the denominator.
Maintained existing symbol length constraints to ensure no regression in compilation time.

Impact

This fix allows Triton to correctly identify divisibility alignment in pointwise kernels. In cases like RMS Norm followed by a pointwise kernel, this can lead to a significant performance improvement (up to 1.66x on SM 10.0/GB200) by enabling optimized loads and stores.

Verification

Verified that the symbolic product is correctly identified as divisible when at least one factor satisfies the constraint.
Confirmed that automated checks for existing Inductor tests remain unaffected.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

test/inductor/test_utils.py (modified, +47/-0)
torch/_inductor/sizevars.py (modified, +22/-1)

PR #177214: [inductor] Add structural divisibility analysis to statically_known_multiple_of

Repository: pytorch/pytorch
Author: liqiangxl
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/177214

Description (problem / solution / changelog)

Add _is_multiple_of() that recurses over sympy expression structure (Mul, Add, FloorDiv, Mod) to prove divisibility, before falling back to statically_known_true. This is a demand-driven variant of the modular arithmetic analysis used in Halide and TVM.

Unlocks tt.divisibility hints for product expressions like shape[0] * shape[1] when torch._check(shape[1] % 16 == 0).

Fixes https://github.com/pytorch/pytorch/issues/177146

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

test/inductor/test_codegen_triton.py (modified, +41/-0)
torch/_inductor/sizevars.py (modified, +52/-0)

Code Example

import torch

@torch.compile(dynamic=True, backend="inductor")
def f(x, w):
    torch._check(x.shape[-1] % 16 == 0)  # hidden_dim is always aligned
    rms = torch.sqrt(torch.mean(x.float() ** 2, dim=-1, keepdim=True) + 1e-6)
    return torch.nn.functional.gelu((x.float() / rms) * w.float()).half()

x = torch.randn([127, 262144], device="cuda", dtype=torch.float16)
w = torch.randn([262144], device="cuda", dtype=torch.float16)
f(x, w)

---

Rule 1 — Constant:     is_multiple_of(k, n)  where k and n are concrete ints
                        → k % n == 0

Rule 2 — Product:      is_multiple_of(a * b * ..., n)
                        → any factor f where is_multiple_of(f, n)
                        (sympy.Mul — iterate .args)

Rule 3 — Sum:          is_multiple_of(a + b + ..., n)
                        → all terms t where is_multiple_of(t, n)
                        (sympy.Add — iterate .args)

Rule 4 — FloorDiv:     is_multiple_of(FloorDiv(a, b), n)
                        → is_multiple_of(a, b * n)
                        (only when a is known to be a multiple of b,
                         which we can check recursively)

Rule 5 — Mod:          is_multiple_of(Mod(a, b), n)
                        → is_multiple_of(b, n)
                        (Mod(a, b) ∈ {0, ..., b-1}*sign, so if b % n == 0
                         then Mod(a, b) % n == 0)

Rule 6 — Axiom:        is_multiple_of(symbol, n)
                        → fall back to statically_known_true(Eq(Mod(symbol, n), 0))
                        (picks up torch._check constraints from ShapeEnv)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

Root Cause

Repro (on GB200)

import torch

@torch.compile(dynamic=True, backend="inductor")
def f(x, w):
    torch._check(x.shape[-1] % 16 == 0)  # hidden_dim is always aligned
    rms = torch.sqrt(torch.mean(x.float() ** 2, dim=-1, keepdim=True) + 1e-6)
    return torch.nn.functional.gelu((x.float() / rms) * w.float()).half()

x = torch.randn([127, 262144], device="cuda", dtype=torch.float16)
w = torch.randn([262144], device="cuda", dtype=torch.float16)
f(x, w)

The graph segments into:

Reduction kernels (2 kernels): computes mean along dim=-1 — rnumel = shape[-1] gets tt.divisibility=16 correctly
Pointwise kernel (1 kernel): normalize + scale + GELU — xnumel = shape[0] * shape[-1] does not get tt.divisibility=16

The pointwise kernel's xnumel argument missing tt.divisibility prevents Triton from optimizing loads/stores, causing a 1.66x slowdown on the pointwise kernel.

Proposed Fix

Add a single new function _is_multiple_of(numerator, denominator) -> bool in sizevars.py that does structural reasoning on sympy expressions before falling back to statically_known_true. Replace the body of statically_known_multiple_of to call it.

No new files, no new classes, no new dependencies. Just one function (~60 lines) that pattern-matches on sympy expression types.

File to modify

/opt/pytorch/pytorch/torch/_inductor/sizevars.py

Rules to implement

All rules follow from one theorem: if a is a multiple of n, then any expression that has a as a multiplicative factor is also a multiple of n.

Rule 1 — Constant:     is_multiple_of(k, n)  where k and n are concrete ints
                        → k % n == 0

Rule 2 — Product:      is_multiple_of(a * b * ..., n)
                        → any factor f where is_multiple_of(f, n)
                        (sympy.Mul — iterate .args)

Rule 3 — Sum:          is_multiple_of(a + b + ..., n)
                        → all terms t where is_multiple_of(t, n)
                        (sympy.Add — iterate .args)

Rule 4 — FloorDiv:     is_multiple_of(FloorDiv(a, b), n)
                        → is_multiple_of(a, b * n)
                        (only when a is known to be a multiple of b,
                         which we can check recursively)

Rule 5 — Mod:          is_multiple_of(Mod(a, b), n)
                        → is_multiple_of(b, n)
                        (Mod(a, b) ∈ {0, ..., b-1}*sign, so if b % n == 0
                         then Mod(a, b) % n == 0)

Rule 6 — Axiom:        is_multiple_of(symbol, n)
                        → fall back to statically_known_true(Eq(Mod(symbol, n), 0))
                        (picks up torch._check constraints from ShapeEnv)

Benchmark Results (GB200, SM 10.0)

[127, 262144] fp16, RMSNorm + GELU with torch._check(x.shape[-1] % 16 == 0):

Kernel	Before fix	After fix	Speedup
reduction (triton_red)	39.4 us	34.3 us	1.15x
persistent (triton_per)	1.9 us	1.4 us	1.36x
pointwise (triton_poi)	108.7 us	65.3 us	1.66x
Total	150.0 us	101.1 us	1.48x

Error logs

No response

Versions

PyTorch version: 2.12.0a0+git5710dd2 Is debug build: False CUDA used to build PyTorch: 13.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (aarch64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.14.0-1008-nvidia-64k-aarch64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 13.2.46

cc @chauhang @penguinwu

extent analysis

Problem Summary

The problem is a performance issue in PyTorch's torch.compile function when using the inductor backend. The issue arises from the fact that the divisibility of a symbolic expression is not properly propagated to other parts of the kernel, leading to a significant slowdown.

Root Cause Analysis

The root cause is due to the way SymPy handles algebraic steps, specifically the fact that it does not perform the step: if a % n == 0 then (a*b) % n == 0.

Fix Plan

To fix this issue, we need to add a new function _is_multiple_of(numerator, denominator) -> bool in sizevars.py that performs structural reasoning on SymPy expressions before falling back to statically_known_true. We will replace the body of statically_known_multiple_of to call this new function.

Here are the concrete steps:

Add a new function _is_multiple_of(numerator, denominator) -> bool in sizevars.py:

def _is_multiple_of(numerator, denominator):
    if isinstance(numerator, int):
        return numerator % denominator == 0
    elif isinstance(numerator, sympy.Mul):
        return all(_is_multiple_of(factor, denominator) for factor in numerator.args)
    elif isinstance(numerator, sympy.Add):
        return all(_is_multiple_of(term, denominator) for term in numerator.args)
    elif isinstance(numerator, sympy.FloorDiv):
        return _is_multiple_of(numerator.args[0], denominator * numerator.args[1])
    elif isinstance(numerator, sympy.Mod):
        return _is_multiple_of(denominator, denominator)
    else:
        return statically_known_true(Eq(Mod(numerator, denominator), 0))

Replace the body of statically_known_multiple_of to call the new function:

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #tool integration #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

pytorch - ✅(Solved) Fix Divisibility constraints are not propagated through symbolic expressions in Inductor [3 pull requests, 6 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

Fix Action

Fixed

PR fix notes

PR #3: [inductor] Add structural divisibility analysis to statically_known_m…

Description (problem / solution / changelog)

Changed files

PR #177156: inductor: improve divisibility propagation for symbolic products in S…

Description (problem / solution / changelog)

Description

Technical Changes

Impact

Verification

Changed files

PR #177214: [inductor] Add structural divisibility analysis to statically_known_multiple_of

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Summary

Root Cause

Repro (on GB200)

Proposed Fix

File to modify

Rules to implement

Benchmark Results (GB200, SM 10.0)

Error logs

Versions

extent analysis

Problem Summary

Root Cause Analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING