pytorch - 💡(How to fix) Fix [TMA] NameError in `sum` when TMA enabled

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

/opt/pytorch/pytorch/torch/jit/_script.py:365: DeprecationWarning: torch.jit.script_method is deprecated. Please switch to torch.compile or torch.export. warnings.warn( E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] Triton compilation failed: triton_per_fused_0 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] def triton_per_fused_0(in_ptr0, in_ptr1, out_ptr0, out_ptr1, XBLOCK : tl.constexpr): E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] pid = tl.program_id(0) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] if pid % 2 == 0: E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] pid_offset = pid // 2 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xnumel = 1024 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_numel = 128 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] R0_BLOCK_0: tl.constexpr = 128 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] rnumel = r0_numel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] RBLOCK: tl.constexpr = R0_BLOCK_0 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xoffset = pid_offset * XBLOCK E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xindex = xoffset + tl.arange(0, XBLOCK)[:, None] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xmask = xindex < xnumel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_index = tl.arange(0, R0_BLOCK_0)[None, :] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_offset = 0 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_mask = r0_index < r0_numel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] roffset = r0_offset E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] rindex = r0_index E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_1 = r0_index E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] x0 = xindex E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset]) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK_0]) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp3 = tl.where(r0_mask & xmask, tmp1, 0) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp4 = tl.sum(tmp3, 1)[:, None].to(tl.float32) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tl.make_tensor_descriptor(out_ptr0, shape=[1024], strides=[1], block_shape=[XBLOCK]).store([xoffset], tl.reshape(tl.broadcast_to(tmp4, [XBLOCK, 1]), [XBLOCK]).to(tl.float32)) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] elif pid % 2 == 1: E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] pid_offset = pid // 2 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xnumel = 1024 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_numel = 256 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] R0_BLOCK_1: tl.constexpr = 256 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] rnumel = r0_numel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] RBLOCK: tl.constexpr = R0_BLOCK_1 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xoffset = pid_offset * XBLOCK E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xindex = xoffset + tl.arange(0, XBLOCK)[:, None] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xmask = xindex < xnumel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_index = tl.arange(0, R0_BLOCK_1)[None, :] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_offset = 0 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_mask = r0_index < r0_numel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] roffset = r0_offset E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] rindex = r0_index E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_3 = r0_index E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] x2 = xindex E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp5 = tl.make_tensor_descriptor(in_ptr1, shape=[1024, 256], strides=[256, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset]) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp6 = tl.broadcast_to(tmp5, [XBLOCK, R0_BLOCK_1]) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp8 = tl.where(r0_mask & xmask, tmp6, 0) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp9 = tl.sum(tmp8, 1)[:, None].to(tl.float32) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tl.make_tensor_descriptor(out_ptr1, shape=[1024], strides=[1], block_shape=[XBLOCK]).store([xoffset], tl.reshape(tl.broadcast_to(tmp9, [XBLOCK, 1]), [XBLOCK]).to(tl.float32)) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] else: E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] pass E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] metadata: {'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'out_ptr0': '*fp32', 'out_ptr1': '*fp32', 'XBLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 4}, 'enable_fp_fusion': True, 'launch_pdl': False, 'disable_ftz': False, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 1, 'num_stages': 1, 'debug': True, 'cc': 90} E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] Traceback (most recent call last): E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 1090, in _precompile_config E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] binary = triton.compile(*compile_args, **compile_kwargs) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 307, in compile E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] module = src.make_ir(target, options, codegen_fns, module_map, context) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] triton.compiler.errors.CompilationError: at 20:108: E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] RBLOCK: tl.constexpr = R0_BLOCK_0 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xoffset = pid_offset * XBLOCK E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xindex = xoffset + tl.arange(0, XBLOCK)[:, None] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] xmask = xindex < xnumel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_index = tl.arange(0, R0_BLOCK_0)[None, :] E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_offset = 0 E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_mask = r0_index < r0_numel E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] roffset = r0_offset E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] rindex = r0_index E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] r0_1 = r0_index E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] x0 = xindex E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset]) E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] ^ E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] NameError('R0_BLOCK is not defined') E

ERROR: test_combo_kernel_per_subkernel_rblock_name (main.TestTMAComboKernelNameError.test_combo_kernel_per_subkernel_rblock_name)

Traceback (most recent call last): File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3558, in wrapper method(*args, **kwargs) File "/usr/lib/python3.12/contextlib.py", line 81, in inner return func(*args, **kwds) ^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/name_repro.py", line 25, in test_combo_kernel_per_subkernel_rblock_name actual = compiled_fn(*inps) ^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 1183, in compile_wrapper raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1078, in _compile_fx_inner raise InductorError(e, currentframe()).with_traceback( File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1058, in _compile_fx_inner mb_compiled_graph = fx_codegen_and_compile( ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1846, in fx_codegen_and_compile return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1607, in codegen_and_compile compiled_module = graph.compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2661, in compile_to_module return self._compile_to_module() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2671, in _compile_to_module mod = self._compile_to_module_lines(wrapper_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2746, in _compile_to_module_lines mod = PyCodeCache.load_by_key_path( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/codecache.py", line 4412, in load_by_key_path mod = _reload_python_module(key, path, set_sys_modules=in_toplevel) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 35, in _reload_python_module exec(code, mod.dict, mod.dict) File "/tmp/torchinductor_root/rs/crsvukvqtx6mswi7ikxqj7jkaudkohiaimweuv7cdehmbjwm2v7o.py", line 42, in <module> triton_per_fused_0 = async_compile.triton('triton_per_fused_0', ''' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/async_compile.py", line 514, in triton kernel.precompile( File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 626, in precompile self._precompile_worker() File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 648, in _precompile_worker compile_results.append(self._precompile_config(c)) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 1090, in _precompile_config binary = triton.compile(*compile_args, **compile_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 307, in compile module = src.make_ir(target, options, codegen_fns, module_map, context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch._inductor.exc.InductorError: CompilationError: at 20:108: RBLOCK: tl.constexpr = R0_BLOCK_0 xoffset = pid_offset * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel r0_index = tl.arange(0, R0_BLOCK_0)[None, :] r0_offset = 0 r0_mask = r0_index < r0_numel roffset = r0_offset rindex = r0_index r0_1 = r0_index x0 = xindex tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset]) ^ NameError('R0_BLOCK is not defined')

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

To execute this test, run the following from the base repo dir: python name_repro.py TestTMAComboKernelNameError.test_combo_kernel_per_subkernel_rblock_name

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0


Ran 1 test in 2.638s

FAILED (errors=1)

Fix Action

Fix / Workaround

class TestTMAComboKernelNameError(TestCase): @inductor_config.patch( { "triton.use_tensor_descriptor": True, "assume_aligned_inputs": True, "combo_kernels": True, "combo_kernel_per_subkernel_blocks": False, } ) def test_combo_kernel_per_subkernel_rblock_name(self): def fn(a, b): return a.sum(dim=-1), b.sum(dim=-1)

Code Example

import torch
import torch._inductor.config as inductor_config
from torch.testing._internal.common_utils import run_tests, TestCase


class TestTMAComboKernelNameError(TestCase):
    @inductor_config.patch(
        {
            "triton.use_tensor_descriptor": True,
            "assume_aligned_inputs": True,
            "combo_kernels": True,
            "combo_kernel_per_subkernel_blocks": False,
        }
    )
    def test_combo_kernel_per_subkernel_rblock_name(self):
        def fn(a, b):
            return a.sum(dim=-1), b.sum(dim=-1)

        inps = [
            torch.randn(1024, 128, device="cuda"),
            torch.randn(1024, 256, device="cuda"),
        ]
        expected = fn(*inps)
        compiled_fn = torch.compile(fn)
        actual = compiled_fn(*inps)
        torch.testing.assert_close(expected, actual, atol=1e-4, rtol=1e-4)


if __name__ == "__main__":
    run_tests()

---

/opt/pytorch/pytorch/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
  warnings.warn(
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] Triton compilation failed: triton_per_fused_0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] def triton_per_fused_0(in_ptr0, in_ptr1, out_ptr0, out_ptr1, XBLOCK : tl.constexpr):
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     pid = tl.program_id(0)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     if pid % 2 == 0:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         pid_offset = pid // 2
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xnumel = 1024
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_numel = 128
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         R0_BLOCK_0: tl.constexpr = 128
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rnumel = r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         RBLOCK: tl.constexpr = R0_BLOCK_0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xoffset = pid_offset * XBLOCK
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xmask = xindex < xnumel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_index = tl.arange(0, R0_BLOCK_0)[None, :]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_offset = 0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_mask = r0_index < r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         roffset = r0_offset
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rindex = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_1 = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         x0 = xindex
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK_0])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp3 = tl.where(r0_mask & xmask, tmp1, 0)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp4 = tl.sum(tmp3, 1)[:, None].to(tl.float32)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tl.make_tensor_descriptor(out_ptr0, shape=[1024], strides=[1], block_shape=[XBLOCK]).store([xoffset], tl.reshape(tl.broadcast_to(tmp4, [XBLOCK, 1]), [XBLOCK]).to(tl.float32))
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     elif pid % 2 == 1:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         pid_offset = pid // 2
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xnumel = 1024
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_numel = 256
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         R0_BLOCK_1: tl.constexpr = 256
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rnumel = r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         RBLOCK: tl.constexpr = R0_BLOCK_1
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xoffset = pid_offset * XBLOCK
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xmask = xindex < xnumel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_index = tl.arange(0, R0_BLOCK_1)[None, :]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_offset = 0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_mask = r0_index < r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         roffset = r0_offset
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rindex = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_3 = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         x2 = xindex
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp5 = tl.make_tensor_descriptor(in_ptr1, shape=[1024, 256], strides=[256, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp6 = tl.broadcast_to(tmp5, [XBLOCK, R0_BLOCK_1])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp8 = tl.where(r0_mask & xmask, tmp6, 0)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp9 = tl.sum(tmp8, 1)[:, None].to(tl.float32)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tl.make_tensor_descriptor(out_ptr1, shape=[1024], strides=[1], block_shape=[XBLOCK]).store([xoffset], tl.reshape(tl.broadcast_to(tmp9, [XBLOCK, 1]), [XBLOCK]).to(tl.float32))
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     else:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         pass
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] 
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] metadata: {'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'out_ptr0': '*fp32', 'out_ptr1': '*fp32', 'XBLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 4}, 'enable_fp_fusion': True, 'launch_pdl': False, 'disable_ftz': False, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 1, 'num_stages': 1, 'debug': True, 'cc': 90}
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] Traceback (most recent call last):
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]   File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 1090, in _precompile_config
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     binary = triton.compile(*compile_args, **compile_kwargs)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 307, in compile
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     module = src.make_ir(target, options, codegen_fns, module_map, context)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] triton.compiler.errors.CompilationError: at 20:108:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         RBLOCK: tl.constexpr = R0_BLOCK_0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xoffset = pid_offset * XBLOCK
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xmask = xindex < xnumel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_index = tl.arange(0, R0_BLOCK_0)[None, :]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_offset = 0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_mask = r0_index < r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         roffset = r0_offset
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rindex = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_1 = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         x0 = xindex
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]                                                                                                             ^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] NameError('R0_BLOCK is not defined')
E
======================================================================
ERROR: test_combo_kernel_per_subkernel_rblock_name (__main__.TestTMAComboKernelNameError.test_combo_kernel_per_subkernel_rblock_name)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3558, in wrapper
    method(*args, **kwargs)
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/name_repro.py", line 25, in test_combo_kernel_per_subkernel_rblock_name
    actual = compiled_fn(*inps)
             ^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 1183, in compile_wrapper
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1078, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1058, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1846, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1607, in codegen_and_compile
    compiled_module = graph.compile_to_module()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2661, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2671, in _compile_to_module
    mod = self._compile_to_module_lines(wrapper_code)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2746, in _compile_to_module_lines
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/codecache.py", line 4412, in load_by_key_path
    mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 35, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_root/rs/crsvukvqtx6mswi7ikxqj7jkaudkohiaimweuv7cdehmbjwm2v7o.py", line 42, in <module>
    triton_per_fused_0 = async_compile.triton('triton_per_fused_0', '''
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/async_compile.py", line 514, in triton
    kernel.precompile(
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 626, in precompile
    self._precompile_worker()
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 648, in _precompile_worker
    compile_results.append(self._precompile_config(c))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 1090, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 307, in compile
    module = src.make_ir(target, options, codegen_fns, module_map, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._inductor.exc.InductorError: CompilationError: at 20:108:
        RBLOCK: tl.constexpr = R0_BLOCK_0
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
        xmask = xindex < xnumel
        r0_index = tl.arange(0, R0_BLOCK_0)[None, :]
        r0_offset = 0
        r0_mask = r0_index < r0_numel
        roffset = r0_offset
        rindex = r0_index
        r0_1 = r0_index
        x0 = xindex
        tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
                                                                                                            ^
NameError('R0_BLOCK is not defined')

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


To execute this test, run the following from the base repo dir:
    python name_repro.py TestTMAComboKernelNameError.test_combo_kernel_per_subkernel_rblock_name

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 2.638s

FAILED (errors=1)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Repro

import torch
import torch._inductor.config as inductor_config
from torch.testing._internal.common_utils import run_tests, TestCase


class TestTMAComboKernelNameError(TestCase):
    @inductor_config.patch(
        {
            "triton.use_tensor_descriptor": True,
            "assume_aligned_inputs": True,
            "combo_kernels": True,
            "combo_kernel_per_subkernel_blocks": False,
        }
    )
    def test_combo_kernel_per_subkernel_rblock_name(self):
        def fn(a, b):
            return a.sum(dim=-1), b.sum(dim=-1)

        inps = [
            torch.randn(1024, 128, device="cuda"),
            torch.randn(1024, 256, device="cuda"),
        ]
        expected = fn(*inps)
        compiled_fn = torch.compile(fn)
        actual = compiled_fn(*inps)
        torch.testing.assert_close(expected, actual, atol=1e-4, rtol=1e-4)


if __name__ == "__main__":
    run_tests()

Error logs

/opt/pytorch/pytorch/torch/jit/_script.py:365: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
  warnings.warn(
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] Triton compilation failed: triton_per_fused_0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] def triton_per_fused_0(in_ptr0, in_ptr1, out_ptr0, out_ptr1, XBLOCK : tl.constexpr):
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     pid = tl.program_id(0)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     if pid % 2 == 0:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         pid_offset = pid // 2
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xnumel = 1024
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_numel = 128
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         R0_BLOCK_0: tl.constexpr = 128
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rnumel = r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         RBLOCK: tl.constexpr = R0_BLOCK_0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xoffset = pid_offset * XBLOCK
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xmask = xindex < xnumel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_index = tl.arange(0, R0_BLOCK_0)[None, :]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_offset = 0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_mask = r0_index < r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         roffset = r0_offset
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rindex = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_1 = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         x0 = xindex
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK_0])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp3 = tl.where(r0_mask & xmask, tmp1, 0)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp4 = tl.sum(tmp3, 1)[:, None].to(tl.float32)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tl.make_tensor_descriptor(out_ptr0, shape=[1024], strides=[1], block_shape=[XBLOCK]).store([xoffset], tl.reshape(tl.broadcast_to(tmp4, [XBLOCK, 1]), [XBLOCK]).to(tl.float32))
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     elif pid % 2 == 1:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         pid_offset = pid // 2
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xnumel = 1024
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_numel = 256
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         R0_BLOCK_1: tl.constexpr = 256
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rnumel = r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         RBLOCK: tl.constexpr = R0_BLOCK_1
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xoffset = pid_offset * XBLOCK
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xmask = xindex < xnumel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_index = tl.arange(0, R0_BLOCK_1)[None, :]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_offset = 0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_mask = r0_index < r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         roffset = r0_offset
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rindex = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_3 = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         x2 = xindex
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp5 = tl.make_tensor_descriptor(in_ptr1, shape=[1024, 256], strides=[256, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp6 = tl.broadcast_to(tmp5, [XBLOCK, R0_BLOCK_1])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp8 = tl.where(r0_mask & xmask, tmp6, 0)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp9 = tl.sum(tmp8, 1)[:, None].to(tl.float32)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tl.make_tensor_descriptor(out_ptr1, shape=[1024], strides=[1], block_shape=[XBLOCK]).store([xoffset], tl.reshape(tl.broadcast_to(tmp9, [XBLOCK, 1]), [XBLOCK]).to(tl.float32))
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     else:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         pass
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] 
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] metadata: {'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'out_ptr0': '*fp32', 'out_ptr1': '*fp32', 'XBLOCK': 'constexpr'}, 'device': 0, 'constants': {'XBLOCK': 4}, 'enable_fp_fusion': True, 'launch_pdl': False, 'disable_ftz': False, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]]}], 'device_type': 'cuda', 'num_warps': 1, 'num_stages': 1, 'debug': True, 'cc': 90}
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] Traceback (most recent call last):
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]   File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 1090, in _precompile_config
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     binary = triton.compile(*compile_args, **compile_kwargs)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 307, in compile
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     module = src.make_ir(target, options, codegen_fns, module_map, context)
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] triton.compiler.errors.CompilationError: at 20:108:
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         RBLOCK: tl.constexpr = R0_BLOCK_0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xoffset = pid_offset * XBLOCK
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         xmask = xindex < xnumel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_index = tl.arange(0, R0_BLOCK_0)[None, :]
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_offset = 0
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_mask = r0_index < r0_numel
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         roffset = r0_offset
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         rindex = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         r0_1 = r0_index
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         x0 = xindex
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]         tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0]                                                                                                             ^
E0528 06:26:02.498000 168264 torch/_inductor/runtime/triton_heuristics.py:1092] [0/0] NameError('R0_BLOCK is not defined')
E
======================================================================
ERROR: test_combo_kernel_per_subkernel_rblock_name (__main__.TestTMAComboKernelNameError.test_combo_kernel_per_subkernel_rblock_name)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3558, in wrapper
    method(*args, **kwargs)
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/name_repro.py", line 25, in test_combo_kernel_per_subkernel_rblock_name
    actual = compiled_fn(*inps)
             ^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 1183, in compile_wrapper
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1078, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1058, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1846, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1607, in codegen_and_compile
    compiled_module = graph.compile_to_module()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2661, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2671, in _compile_to_module
    mod = self._compile_to_module_lines(wrapper_code)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/graph.py", line 2746, in _compile_to_module_lines
    mod = PyCodeCache.load_by_key_path(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/codecache.py", line 4412, in load_by_key_path
    mod = _reload_python_module(key, path, set_sys_modules=in_toplevel)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 35, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_root/rs/crsvukvqtx6mswi7ikxqj7jkaudkohiaimweuv7cdehmbjwm2v7o.py", line 42, in <module>
    triton_per_fused_0 = async_compile.triton('triton_per_fused_0', '''
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/async_compile.py", line 514, in triton
    kernel.precompile(
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 626, in precompile
    self._precompile_worker()
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 648, in _precompile_worker
    compile_results.append(self._precompile_config(c))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/torch/_inductor/runtime/triton_heuristics.py", line 1090, in _precompile_config
    binary = triton.compile(*compile_args, **compile_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 307, in compile
    module = src.make_ir(target, options, codegen_fns, module_map, context)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
    return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch._inductor.exc.InductorError: CompilationError: at 20:108:
        RBLOCK: tl.constexpr = R0_BLOCK_0
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
        xmask = xindex < xnumel
        r0_index = tl.arange(0, R0_BLOCK_0)[None, :]
        r0_offset = 0
        r0_mask = r0_index < r0_numel
        roffset = r0_offset
        rindex = r0_index
        r0_1 = r0_index
        x0 = xindex
        tmp0 = tl.make_tensor_descriptor(in_ptr0, shape=[1024, 128], strides=[128, 1], block_shape=[XBLOCK, R0_BLOCK]).load([xoffset, r0_offset])
                                                                                                            ^
NameError('R0_BLOCK is not defined')

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


To execute this test, run the following from the base repo dir:
    python name_repro.py TestTMAComboKernelNameError.test_combo_kernel_per_subkernel_rblock_name

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 2.638s

FAILED (errors=1)

Versions

main: 339e39645a25968c7cd3fb81dd7e48ca3a31f9c2 H100

cc @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING