pytorch - 💡(How to fix) Fix Incorrect output for non-power-of-2-sized fusion of scalar broadcast and scatter-add [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178871Fetched 2026-04-08 01:57:23
View on GitHub
Comments
2
Participants
3
Timeline
159
Reactions
0
Author
Timeline (top)
mentioned ×74subscribed ×74labeled ×6cross-referenced ×3

Error Message

When inductor fuses a scalar broadcast with scatter_reduce(..., reduce="sum"), the generated Triton kernel uses mask=None on tl.atomic_add instead of mask=xmask. XBLOCK is always a power of 2, so when N is not a power of 2 there are XBLOCK - N out-of-bounds threads — they each add the broadcast scalar to the output, giving a wrong result with no error raised. The error is exactly (next_power_of_2(N) - N) * scalar and is zero whenever N is a power of 2. OOB threads add bias unconditionally. Total error: (XBLOCK - N) * bias.

Error logs

No exception — the output is silently wrong.

Code Example

import torch


def f(x, bias, idx):
    src = x + bias  # scalar broadcast fused with scatter by inductor
    out = torch.zeros(1, 1, device=x.device, dtype=x.dtype)
    return out.scatter_reduce(
        0, idx.unsqueeze(-1).expand_as(src), src, "sum", include_self=True
    )


for N in range(1, 65):
    device = torch.device("cuda")

    x    = torch.randn(N, 1, device=device)
    bias = torch.tensor([0.020368], device=device)
    idx  = torch.zeros(N, dtype=torch.long, device=device)

    ref = f(x, bias, idx)
    torch.compiler.reset()
    test = torch.compile(f, backend="inductor", fullgraph=True, dynamic=False)(x, bias, idx)

    diff      = test.item() - ref.item()
    next_pow2 = 1 << (N - 1).bit_length()
    expected  = (next_pow2 - N) * bias.item()
    print(f"N={N:2d}  diff={diff:+.6f}  expected={expected:+.6f}")

---

tmp0 = tl.load(in_ptr0 + x0, xmask)  # x     — masked, 0.0 for OOB threads
tmp1 = tl.load(in_ptr1 + 0)           # bias  — scalar, no mask needed
tmp2 = tmp0 + tmp1                    # bias for OOB threads (tmp0 == 0.0)

tl.atomic_add(out_ptr0 + ..., tmp2, None,  sem='relaxed')  # BUG: should be xmask

---

N= 1  diff=+0.000000  expected=+0.000000
N= 2  diff=+0.000000  expected=+0.000000
N= 3  diff=+0.020368  expected=+0.020368
N= 4  diff=+0.000000  expected=+0.000000
N= 5  diff=+0.061104  expected=+0.061104
N= 6  diff=+0.040736  expected=+0.040736
N= 7  diff=+0.020368  expected=+0.020368
N= 8  diff=+0.000000  expected=+0.000000
N= 9  diff=+0.142576  expected=+0.142576
...
N=16  diff=+0.000000  expected=+0.000000
N=17  diff=+0.305520  expected=+0.305520
...
N=32  diff=+0.000000  expected=+0.000000
N=33  diff=+0.631407  expected=+0.631408
...
N=48  diff=+0.325887  expected=+0.325888
...
N=64  diff=+0.000000  expected=+0.000000

---

PyTorch version: 2.10.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Python version: 3.11.15 | packaged by conda-forge | (main, Mar  5 2026, 16:45:40) [GCC 14.3.0] (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 12.8.93
GPU models and configuration: GPU 0: NVIDIA TITAN V
Nvidia driver version: 570.195.03
triton: 3.6.0
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When inductor fuses a scalar broadcast with scatter_reduce(..., reduce="sum"), the generated Triton kernel uses mask=None on tl.atomic_add instead of mask=xmask. XBLOCK is always a power of 2, so when N is not a power of 2 there are XBLOCK - N out-of-bounds threads — they each add the broadcast scalar to the output, giving a wrong result with no error raised.

The error is exactly (next_power_of_2(N) - N) * scalar and is zero whenever N is a power of 2.

Minimal reproducer:

import torch


def f(x, bias, idx):
    src = x + bias  # scalar broadcast fused with scatter by inductor
    out = torch.zeros(1, 1, device=x.device, dtype=x.dtype)
    return out.scatter_reduce(
        0, idx.unsqueeze(-1).expand_as(src), src, "sum", include_self=True
    )


for N in range(1, 65):
    device = torch.device("cuda")

    x    = torch.randn(N, 1, device=device)
    bias = torch.tensor([0.020368], device=device)
    idx  = torch.zeros(N, dtype=torch.long, device=device)

    ref = f(x, bias, idx)
    torch.compiler.reset()
    test = torch.compile(f, backend="inductor", fullgraph=True, dynamic=False)(x, bias, idx)

    diff      = test.item() - ref.item()
    next_pow2 = 1 << (N - 1).bit_length()
    expected  = (next_pow2 - N) * bias.item()
    print(f"N={N:2d}  diff={diff:+.6f}  expected={expected:+.6f}")

The generated kernel (trimmed):

tmp0 = tl.load(in_ptr0 + x0, xmask)  # x     — masked, 0.0 for OOB threads
tmp1 = tl.load(in_ptr1 + 0)           # bias  — scalar, no mask needed
tmp2 = tmp0 + tmp1                    # bias for OOB threads (tmp0 == 0.0)

tl.atomic_add(out_ptr0 + ..., tmp2, None,  sem='relaxed')  # BUG: should be xmask

OOB threads add bias unconditionally. Total error: (XBLOCK - N) * bias.

Error logs

No exception — the output is silently wrong.

N= 1  diff=+0.000000  expected=+0.000000
N= 2  diff=+0.000000  expected=+0.000000
N= 3  diff=+0.020368  expected=+0.020368
N= 4  diff=+0.000000  expected=+0.000000
N= 5  diff=+0.061104  expected=+0.061104
N= 6  diff=+0.040736  expected=+0.040736
N= 7  diff=+0.020368  expected=+0.020368
N= 8  diff=+0.000000  expected=+0.000000
N= 9  diff=+0.142576  expected=+0.142576
...
N=16  diff=+0.000000  expected=+0.000000
N=17  diff=+0.305520  expected=+0.305520
...
N=32  diff=+0.000000  expected=+0.000000
N=33  diff=+0.631407  expected=+0.631408
...
N=48  diff=+0.325887  expected=+0.325888
...
N=64  diff=+0.000000  expected=+0.000000

Versions

PyTorch version: 2.10.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Python version: 3.11.15 | packaged by conda-forge | (main, Mar  5 2026, 16:45:40) [GCC 14.3.0] (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 12.8.93
GPU models and configuration: GPU 0: NVIDIA TITAN V
Nvidia driver version: 570.195.03
triton: 3.6.0

cc @pragupta @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix is to modify the generated Triton kernel to use mask=xmask on tl.atomic_add instead of mask=None to prevent out-of-bounds threads from adding the broadcast scalar to the output.

Guidance

  • Identify the line of code in the generated Triton kernel where tl.atomic_add is used and modify it to include mask=xmask to ensure that only valid threads contribute to the sum.
  • Verify that the modification fixes the issue by re-running the minimal reproducer and checking that the output is correct for all values of N.
  • Consider adding a check to ensure that N is a power of 2, as this would eliminate the need for masking and prevent the error from occurring.
  • Review the inductor code to determine why it is generating a kernel with mask=None instead of mask=xmask and consider submitting a bug report or pull request to fix the issue.

Example

tl.atomic_add(out_ptr0 + ..., tmp2, xmask, sem='relaxed')  # Fix: use xmask instead of None

Notes

The fix assumes that the xmask variable is correctly defined and initialized in the generated Triton kernel. If this is not the case, additional modifications may be necessary to ensure that the mask is correctly applied.

Recommendation

Apply the workaround by modifying the generated Triton kernel to use mask=xmask on tl.atomic_add. This fix should prevent the error from occurring and produce the correct output for all values of N.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING