pytorch - 💡(How to fix) Fix torch.compile + TMA path : Illegal Memory Access

pytorch2026-05-11 16:17:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

"""
Minimal reproducer: Inductor TMA codegen IMA at >= ~43M elements.

Environment:
  - PyTorch: built from source (main)
  - GPU: NVIDIA GB200
  - Inductor config: triton.use_tensor_descriptor=True, assume_aligned_inputs=True

Usage:
  python repro_inductor_tma_ima.py
  CUDA_LAUNCH_BLOCKING=1 python repro_inductor_tma_ima.py
"""

import torch
import torch._inductor.config as inductor_config


def test_compile_tma(M, N):
    """Test torch.compile + TMA at a given shape."""
    inductor_config.triton.use_tensor_descriptor = True
    inductor_config.assume_aligned_inputs = True

    torch._dynamo.reset()
    fn = torch.compile(
        lambda x, r: x + r,
        fullgraph=True,
    )

    x = torch.randn(M, N, dtype=torch.bfloat16, device="cuda")
    r = torch.randn(M, N, dtype=torch.bfloat16, device="cuda")
    n_elem = M * N

    try:
        for _ in range(3):
            out = fn(x, r)
        torch.cuda.synchronize()
        ref = (x.float() + r.float()).bfloat16()
        torch.testing.assert_close(out, ref, rtol=1e-2, atol=2e-2)
        print(f"  ({M:>5}, {N:>5}) = {n_elem:>12,} elements: OK")
        return True
    except (torch.AcceleratorError, RuntimeError) as e:
        print(f"  ({M:>5}, {N:>5}) = {n_elem:>12,} elements: FAIL (IMA)", e)
        return False


def main():

    for M, N in [
        (4096, 4096),   # 16M elements -- OK
        (4096, 8192),   # 32M elements -- OK
        (8192, 4096),   # 32M elements -- OK
        (4096, 12288),  # ~49M elements -- FAIL
        (8192, 8192),   # 64M elements -- FAIL
    ]:
        ok = test_compile_tma(M, N)
        if not ok:
            break

if __name__ == "__main__":
    main()

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

"""
Minimal reproducer: Inductor TMA codegen IMA at >= ~43M elements.

Environment:
  - PyTorch: built from source (main)
  - GPU: NVIDIA GB200
  - Inductor config: triton.use_tensor_descriptor=True, assume_aligned_inputs=True

Usage:
  python repro_inductor_tma_ima.py
  CUDA_LAUNCH_BLOCKING=1 python repro_inductor_tma_ima.py
"""

import torch
import torch._inductor.config as inductor_config


def test_compile_tma(M, N):
    """Test torch.compile + TMA at a given shape."""
    inductor_config.triton.use_tensor_descriptor = True
    inductor_config.assume_aligned_inputs = True

    torch._dynamo.reset()
    fn = torch.compile(
        lambda x, r: x + r,
        fullgraph=True,
    )

    x = torch.randn(M, N, dtype=torch.bfloat16, device="cuda")
    r = torch.randn(M, N, dtype=torch.bfloat16, device="cuda")
    n_elem = M * N

    try:
        for _ in range(3):
            out = fn(x, r)
        torch.cuda.synchronize()
        ref = (x.float() + r.float()).bfloat16()
        torch.testing.assert_close(out, ref, rtol=1e-2, atol=2e-2)
        print(f"  ({M:>5}, {N:>5}) = {n_elem:>12,} elements: OK")
        return True
    except (torch.AcceleratorError, RuntimeError) as e:
        print(f"  ({M:>5}, {N:>5}) = {n_elem:>12,} elements: FAIL (IMA)", e)
        return False


def main():

    for M, N in [
        (4096, 4096),   # 16M elements -- OK
        (4096, 8192),   # 32M elements -- OK
        (8192, 4096),   # 32M elements -- OK
        (4096, 12288),  # ~49M elements -- FAIL
        (8192, 8192),   # 64M elements -- FAIL
    ]:
        ok = test_compile_tma(M, N)
        if not ok:
            break

if __name__ == "__main__":
    main()

cc: @kiya00

Versions

main (https://github.com/pytorch/pytorch/commit/8f8409cae86d725a75e2ac54ce8f93def107ced7)

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix torch.compile + TMA path : Illegal Memory Access

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix torch.compile + TMA path : Illegal Memory Access

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING