pytorch - ✅(Solved) Fix `torch.compile` crashes on `cumsum(x + broadcast_bias)` when scan dim >= 129 [1 pull requests, 1 comments, 2 participants]

pytorch2026-04-13 16:14:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180221•Fetched 2026-04-15 06:19:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

wuyii8941

Participants

liqiangxl

wuyii8941

Timeline (top)

mentioned ×36subscribed ×36labeled ×4commented ×1

Error Message

torch._inductor.exc.InductorError: TypeError: list indices must be integers or slices, not NoneType

Root Cause

The crash is in Inductor's SplitScan codegen path. When the scan dimension is <= 128, Inductor uses a single-block scan (works fine). When >= 129, it switches to SplitScan (multi-block), which crashes because TritonSplitScanKernel.initialize_range_tree sets tensor_dim=None for pointwise range trees, but the codegen later indexes a list with this None value when loading broadcast variables.

PR fix notes

PR #180369: [inductor] Fix torch.compile crash on cumsum with broadcast input when scan dim >= 129

Repository: pytorch/pytorch
Author: liqiangxl
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/180369

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/180221

SplitScan kernels set no_x_dim=True, so the "x" range tree gets tensor_dim=None. When a broadcast variable (e.g. bias) references this tree, get_block_shape does shape[None] and crashes with TypeError.

Fix: treat tensor_dim=None as scalar shape () in get_block_shape, matching how other scalar symbols (e.g. SymT.SIZE) are handled.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

test/inductor/test_torchinductor.py (modified, +9/-0)
torch/_inductor/codegen/triton.py (modified, +9/-2)

Code Example

import torch

x = torch.randn(1, 129, 64, device='cuda')
b = torch.randn(64, device='cuda')

# Works
print(torch.cumsum(x + b, dim=1).shape)

# Crashes
print(torch.compile(lambda x, b: torch.cumsum(x + b, dim=1))(x, b).shape)

---

torch._inductor.exc.InductorError: TypeError: list indices must be integers or slices, not NoneType

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile(backend='inductor') crashes with TypeError: list indices must be integers or slices, not NoneType when compiling torch.cumsum over a tensor that includes a broadcast addition with a bias vector, and the scan dimension size is >= 129.

The same crash affects torch.cumprod, torch.logcumsumexp, and torch._higher_order_ops.associative_scan.

Reproducer

import torch

x = torch.randn(1, 129, 64, device='cuda')
b = torch.randn(64, device='cuda')

# Works
print(torch.cumsum(x + b, dim=1).shape)

# Crashes
print(torch.compile(lambda x, b: torch.cumsum(x + b, dim=1))(x, b).shape)

Error:

torch._inductor.exc.InductorError: TypeError: list indices must be integers or slices, not NoneType

Trigger conditions

The crash requires both:

A scan op (cumsum/cumprod/logcumsumexp) with scan dim size >= 129
A broadcast variable in the fused input (e.g., adding a bias vector)

Without broadcast (standalone cumsum(x)): compiles fine at any size. With broadcast but scan dim <= 128: compiles fine (uses single-block scan, not SplitScan).

Scan dim	`cumsum(x)`	`cumsum(x + bias)`
128	OK	OK
129	OK	CRASH
256	OK	CRASH
512	OK	CRASH

Versions

PyTorch 2.11.0+cu126: CRASH
PyTorch 2.12.0.dev20260410+cu126 (nightly): CRASH
GPU: NVIDIA Tesla T4 (sm_75), but not GPU-specific (codegen crash before kernel launch)

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The issue can be worked around by avoiding the use of torch.compile with backend='inductor' for scan operations with a dimension size of 129 or more when a broadcast variable is involved.

Guidance

Identify the specific scan operations (torch.cumsum, torch.cumprod, torch.logcumsumexp) and check if the scan dimension size is 129 or more.
Verify if a broadcast variable is used in the operation, as this is a required condition for the crash to occur.
Consider using a different backend or avoiding compilation for these specific operations until a fix is available.
Test with smaller scan dimension sizes (less than 129) to confirm if the issue is specific to the SplitScan codegen path.

Example

import torch

# Avoid using torch.compile for scan operations with dimension size 129 or more and broadcast variables
x = torch.randn(1, 128, 64, device='cuda')
b = torch.randn(64, device='cuda')

# This should work without torch.compile
print(torch.cumsum(x + b, dim=1).shape)

# Alternatively, avoid using broadcast variables
x = torch.randn(1, 129, 64, device='cuda')
print(torch.cumsum(x, dim=1).shape)

Notes

The issue is not GPU-specific and occurs with different PyTorch versions.
The crash happens before kernel launch, indicating a codegen issue.

Recommendation

Apply workaround: Avoid using torch.compile with backend='inductor' for affected scan operations until a fix is available, as this allows the code to run without crashing.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#training loop #device allocation #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix `torch.compile` crashes on `cumsum(x + broadcast_bias)` when scan dim >= 129 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #180369: [inductor] Fix torch.compile crash on cumsum with broadcast input when scan dim >= 129

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Reproducer

Trigger conditions

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix `torch.compile` crashes on `cumsum(x + broadcast_bias)` when scan dim >= 129 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #180369: [inductor] Fix torch.compile crash on cumsum with broadcast input when scan dim >= 129

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Reproducer

Trigger conditions

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING