pytorch - 💡(How to fix) Fix [Inductor] CPU backend produces incorrect numerical outputs when combining tensor slices and torch.stack

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Error logs

Max abs error : 0.211432158947 print(f"Max abs error : {max_abs:.12f}")

Code Example

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
PyTorch Inductor CPU Correctness Repro
----------------------------------------
PyTorch version : 2.13.0.dev20260521+cu130
Result shape    : (4, 8, 3)
Max abs error   : 0.211432158947

Mismatch Details:
Flat Index [33]:
  Eager value    :  0.007949173450
  Compiled value :  0.219381332397

[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.

---

import torch
import torch.nn.functional as F

def fn(l_c, y, alpha, beta):
    x = l_c[..., 16]

    positive = F.softplus(x) + 0.0001
    rates = F.softplus(y) + 0.0001

    gammaln_term = torch.special.gammaln(positive + alpha.view(1, -1))
    erf_term = torch.special.erf(x * beta.view(1, -1))

    xlogy_term = torch.special.xlogy(positive, rates)
    logadd = torch.logaddexp(gammaln_term, xlogy_term)

    centered = logadd - logadd.mean(dim=-1, keepdim=True)

    return torch.stack([centered, erf_term, centered * erf_term], dim=-1)

def main():
    torch.manual_seed(20592561)

    l_c = torch.randn([4, 8, 64], dtype=torch.float32) * 0.1
    y = torch.randn([4, 8], dtype=torch.float32) * 0.1
    alpha = torch.randn([8], dtype=torch.float32) * 0.1
    beta = torch.zeros([8], dtype=torch.float32)

    def get_cloned_args():
        return (
            l_c.detach().clone(),
            y.detach().clone(),
            alpha.detach().clone(),
            beta.detach().clone()
        )

    with torch.no_grad():
        eager_out = fn(*get_cloned_args())

    compiled_fn = torch.compile(fn, backend="inductor")
    with torch.no_grad():
        compiled_out = compiled_fn(*get_cloned_args())

    diff = (eager_out - compiled_out).abs()
    max_abs = diff.max().item()

    print("PyTorch Inductor CPU Correctness Repro")
    print("-" * 40)
    print(f"PyTorch version : {torch.__version__}")
    print(f"Result shape    : {tuple(eager_out.shape)}")
    print(f"Max abs error   : {max_abs:.12f}")

    if max_abs > 1e-3:
        max_idx = diff.argmax().item()
        eager_val = eager_out.flatten()[max_idx].item()
        compiled_val = compiled_out.flatten()[max_idx].item()

        print("\nMismatch Details:")
        print(f"Flat Index [{max_idx}]:")
        print(f"  Eager value    :  {eager_val:.12f}")
        print(f"  Compiled value :  {compiled_val:.12f}")
        print("\n[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.")
    else:
        print("\n[VERDICT] BUG NOT REPRODUCED.")

if __name__ == "__main__":
    main()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When running a function that takes a slice of a larger tensor (l_c[..., 16]) and passes the results through several math operations before applying torch.stack(..., dim=-1), torch.compile with the inductor CPU backend produces incorrect numerical results.

The eager mode and compiled mode outputs show a significant absolute difference (~0.2).

Error logs

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
PyTorch Inductor CPU Correctness Repro
----------------------------------------
PyTorch version : 2.13.0.dev20260521+cu130
Result shape    : (4, 8, 3)
Max abs error   : 0.211432158947

Mismatch Details:
Flat Index [33]:
  Eager value    :  0.007949173450
  Compiled value :  0.219381332397

[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.

Reproduction Script

import torch
import torch.nn.functional as F

def fn(l_c, y, alpha, beta):
    x = l_c[..., 16]

    positive = F.softplus(x) + 0.0001
    rates = F.softplus(y) + 0.0001

    gammaln_term = torch.special.gammaln(positive + alpha.view(1, -1))
    erf_term = torch.special.erf(x * beta.view(1, -1))

    xlogy_term = torch.special.xlogy(positive, rates)
    logadd = torch.logaddexp(gammaln_term, xlogy_term)

    centered = logadd - logadd.mean(dim=-1, keepdim=True)

    return torch.stack([centered, erf_term, centered * erf_term], dim=-1)

def main():
    torch.manual_seed(20592561)

    l_c = torch.randn([4, 8, 64], dtype=torch.float32) * 0.1
    y = torch.randn([4, 8], dtype=torch.float32) * 0.1
    alpha = torch.randn([8], dtype=torch.float32) * 0.1
    beta = torch.zeros([8], dtype=torch.float32)

    def get_cloned_args():
        return (
            l_c.detach().clone(),
            y.detach().clone(),
            alpha.detach().clone(),
            beta.detach().clone()
        )

    with torch.no_grad():
        eager_out = fn(*get_cloned_args())

    compiled_fn = torch.compile(fn, backend="inductor")
    with torch.no_grad():
        compiled_out = compiled_fn(*get_cloned_args())

    diff = (eager_out - compiled_out).abs()
    max_abs = diff.max().item()

    print("PyTorch Inductor CPU Correctness Repro")
    print("-" * 40)
    print(f"PyTorch version : {torch.__version__}")
    print(f"Result shape    : {tuple(eager_out.shape)}")
    print(f"Max abs error   : {max_abs:.12f}")

    if max_abs > 1e-3:
        max_idx = diff.argmax().item()
        eager_val = eager_out.flatten()[max_idx].item()
        compiled_val = compiled_out.flatten()[max_idx].item()

        print("\nMismatch Details:")
        print(f"Flat Index [{max_idx}]:")
        print(f"  Eager value    :  {eager_val:.12f}")
        print(f"  Compiled value :  {compiled_val:.12f}")
        print("\n[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.")
    else:
        print("\n[VERDICT] BUG NOT REPRODUCED.")

if __name__ == "__main__":
    main()

Expected Behavior

The values returned by the compiled function should be identical to the eager execution outputs within normal floating-point tolerance.

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01 @mruberry @kshitij12345 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [Inductor] CPU backend produces incorrect numerical outputs when combining tensor slices and torch.stack