pytorch - 💡(How to fix) Fix Incorrect output for non-power-of-2-sized fusion of scalar broadcast and scatter-add [2 comments, 3 participants]

pytorch2026-03-31 13:53:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178871•Fetched 2026-04-08 01:57:23

View on GitHub

Comments

Participants

Timeline

159

Reactions

Author

Participants

Timeline (top)

mentioned ×74subscribed ×74labeled ×6cross-referenced ×3

Error Message

When inductor fuses a scalar broadcast with scatter_reduce(..., reduce="sum"), the generated Triton kernel uses mask=None on tl.atomic_add instead of mask=xmask. XBLOCK is always a power of 2, so when N is not a power of 2 there are XBLOCK - N out-of-bounds threads — they each add the broadcast scalar to the output, giving a wrong result with no error raised. The error is exactly (next_power_of_2(N) - N) * scalar and is zero whenever N is a power of 2. OOB threads add bias unconditionally. Total error: (XBLOCK - N) * bias.

Error logs

No exception — the output is silently wrong.

Code Example

import torch


def f(x, bias, idx):
    src = x + bias  # scalar broadcast fused with scatter by inductor
    out = torch.zeros(1, 1, device=x.device, dtype=x.dtype)
    return out.scatter_reduce(
        0, idx.unsqueeze(-1).expand_as(src), src, "sum", include_self=True
    )


for N in range(1, 65):
    device = torch.device("cuda")

    x    = torch.randn(N, 1, device=device)
    bias = torch.tensor([0.020368], device=device)
    idx  = torch.zeros(N, dtype=torch.long, device=device)

    ref = f(x, bias, idx)
    torch.compiler.reset()
    test = torch.compile(f, backend="inductor", fullgraph=True, dynamic=False)(x, bias, idx)

    diff      = test.item() - ref.item()
    next_pow2 = 1 << (N - 1).bit_length()
    expected  = (next_pow2 - N) * bias.item()
    print(f"N={N:2d}  diff={diff:+.6f}  expected={expected:+.6f}")

---

tmp0 = tl.load(in_ptr0 + x0, xmask)  # x     — masked, 0.0 for OOB threads
tmp1 = tl.load(in_ptr1 + 0)           # bias  — scalar, no mask needed
tmp2 = tmp0 + tmp1                    # bias for OOB threads (tmp0 == 0.0)

tl.atomic_add(out_ptr0 + ..., tmp2, None,  sem='relaxed')  # BUG: should be xmask

---

N= 1  diff=+0.000000  expected=+0.000000
N= 2  diff=+0.000000  expected=+0.000000
N= 3  diff=+0.020368  expected=+0.020368
N= 4  diff=+0.000000  expected=+0.000000
N= 5  diff=+0.061104  expected=+0.061104
N= 6  diff=+0.040736  expected=+0.040736
N= 7  diff=+0.020368  expected=+0.020368
N= 8  diff=+0.000000  expected=+0.000000
N= 9  diff=+0.142576  expected=+0.142576
...
N=16  diff=+0.000000  expected=+0.000000
N=17  diff=+0.305520  expected=+0.305520
...
N=32  diff=+0.000000  expected=+0.000000
N=33  diff=+0.631407  expected=+0.631408
...
N=48  diff=+0.325887  expected=+0.325888
...
N=64  diff=+0.000000  expected=+0.000000

---

PyTorch version: 2.10.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Python version: 3.11.15 | packaged by conda-forge | (main, Mar  5 2026, 16:45:40) [GCC 14.3.0] (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 12.8.93
GPU models and configuration: GPU 0: NVIDIA TITAN V
Nvidia driver version: 570.195.03
triton: 3.6.0

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

The error is exactly (next_power_of_2(N) - N) * scalar and is zero whenever N is a power of 2.

Minimal reproducer:

import torch


def f(x, bias, idx):
    src = x + bias  # scalar broadcast fused with scatter by inductor
    out = torch.zeros(1, 1, device=x.device, dtype=x.dtype)
    return out.scatter_reduce(
        0, idx.unsqueeze(-1).expand_as(src), src, "sum", include_self=True
    )


for N in range(1, 65):
    device = torch.device("cuda")

    x    = torch.randn(N, 1, device=device)
    bias = torch.tensor([0.020368], device=device)
    idx  = torch.zeros(N, dtype=torch.long, device=device)

    ref = f(x, bias, idx)
    torch.compiler.reset()
    test = torch.compile(f, backend="inductor", fullgraph=True, dynamic=False)(x, bias, idx)

    diff      = test.item() - ref.item()
    next_pow2 = 1 << (N - 1).bit_length()
    expected  = (next_pow2 - N) * bias.item()
    print(f"N={N:2d}  diff={diff:+.6f}  expected={expected:+.6f}")

The generated kernel (trimmed):

tmp0 = tl.load(in_ptr0 + x0, xmask)  # x     — masked, 0.0 for OOB threads
tmp1 = tl.load(in_ptr1 + 0)           # bias  — scalar, no mask needed
tmp2 = tmp0 + tmp1                    # bias for OOB threads (tmp0 == 0.0)

tl.atomic_add(out_ptr0 + ..., tmp2, None,  sem='relaxed')  # BUG: should be xmask

OOB threads add bias unconditionally. Total error: (XBLOCK - N) * bias.

Error logs

No exception — the output is silently wrong.

N= 1  diff=+0.000000  expected=+0.000000
N= 2  diff=+0.000000  expected=+0.000000
N= 3  diff=+0.020368  expected=+0.020368
N= 4  diff=+0.000000  expected=+0.000000
N= 5  diff=+0.061104  expected=+0.061104
N= 6  diff=+0.040736  expected=+0.040736
N= 7  diff=+0.020368  expected=+0.020368
N= 8  diff=+0.000000  expected=+0.000000
N= 9  diff=+0.142576  expected=+0.142576
...
N=16  diff=+0.000000  expected=+0.000000
N=17  diff=+0.305520  expected=+0.305520
...
N=32  diff=+0.000000  expected=+0.000000
N=33  diff=+0.631407  expected=+0.631408
...
N=48  diff=+0.325887  expected=+0.325888
...
N=64  diff=+0.000000  expected=+0.000000

Versions

PyTorch version: 2.10.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Python version: 3.11.15 | packaged by conda-forge | (main, Mar  5 2026, 16:45:40) [GCC 14.3.0] (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 12.8.93
GPU models and configuration: GPU 0: NVIDIA TITAN V
Nvidia driver version: 570.195.03
triton: 3.6.0

cc @pragupta @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix is to modify the generated Triton kernel to use mask=xmask on tl.atomic_add instead of mask=None to prevent out-of-bounds threads from adding the broadcast scalar to the output.

Guidance

Identify the line of code in the generated Triton kernel where tl.atomic_add is used and modify it to include mask=xmask to ensure that only valid threads contribute to the sum.
Verify that the modification fixes the issue by re-running the minimal reproducer and checking that the output is correct for all values of N.
Consider adding a check to ensure that N is a power of 2, as this would eliminate the need for masking and prevent the error from occurring.
Review the inductor code to determine why it is generating a kernel with mask=None instead of mask=xmask and consider submitting a bug report or pull request to fix the issue.

Example

tl.atomic_add(out_ptr0 + ..., tmp2, xmask, sem='relaxed')  # Fix: use xmask instead of None

Notes

The fix assumes that the xmask variable is correctly defined and initialized in the generated Triton kernel. If this is not the case, additional modifications may be necessary to ensure that the mask is correctly applied.

Recommendation

Apply the workaround by modifying the generated Triton kernel to use mask=xmask on tl.atomic_add. This fix should prevent the error from occurring and produce the correct output for all values of N.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory optimization #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Incorrect output for non-power-of-2-sized fusion of scalar broadcast and scatter-add [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Incorrect output for non-power-of-2-sized fusion of scalar broadcast and scatter-add [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING