pytorch - 💡(How to fix) Fix [Inductor] CPU backend produces incorrect numerical outputs when combining tensor slices and torch.stack

Code Example

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
PyTorch Inductor CPU Correctness Repro
----------------------------------------
PyTorch version : 2.13.0.dev20260521+cu130
Result shape    : (4, 8, 3)
Max abs error   : 0.211432158947

Mismatch Details:
Flat Index [33]:
  Eager value    :  0.007949173450
  Compiled value :  0.219381332397

[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.

---

import torch
import torch.nn.functional as F

def fn(l_c, y, alpha, beta):
    x = l_c[..., 16]

    positive = F.softplus(x) + 0.0001
    rates = F.softplus(y) + 0.0001

    gammaln_term = torch.special.gammaln(positive + alpha.view(1, -1))
    erf_term = torch.special.erf(x * beta.view(1, -1))

    xlogy_term = torch.special.xlogy(positive, rates)
    logadd = torch.logaddexp(gammaln_term, xlogy_term)

    centered = logadd - logadd.mean(dim=-1, keepdim=True)

    return torch.stack([centered, erf_term, centered * erf_term], dim=-1)

def main():
    torch.manual_seed(20592561)

    l_c = torch.randn([4, 8, 64], dtype=torch.float32) * 0.1
    y = torch.randn([4, 8], dtype=torch.float32) * 0.1
    alpha = torch.randn([8], dtype=torch.float32) * 0.1
    beta = torch.zeros([8], dtype=torch.float32)

    def get_cloned_args():
        return (
            l_c.detach().clone(),
            y.detach().clone(),
            alpha.detach().clone(),
            beta.detach().clone()
        )

    with torch.no_grad():
        eager_out = fn(*get_cloned_args())

    compiled_fn = torch.compile(fn, backend="inductor")
    with torch.no_grad():
        compiled_out = compiled_fn(*get_cloned_args())

    diff = (eager_out - compiled_out).abs()
    max_abs = diff.max().item()

    print("PyTorch Inductor CPU Correctness Repro")
    print("-" * 40)
    print(f"PyTorch version : {torch.__version__}")
    print(f"Result shape    : {tuple(eager_out.shape)}")
    print(f"Max abs error   : {max_abs:.12f}")

    if max_abs > 1e-3:
        max_idx = diff.argmax().item()
        eager_val = eager_out.flatten()[max_idx].item()
        compiled_val = compiled_out.flatten()[max_idx].item()

        print("\nMismatch Details:")
        print(f"Flat Index [{max_idx}]:")
        print(f"  Eager value    :  {eager_val:.12f}")
        print(f"  Compiled value :  {compiled_val:.12f}")
        print("\n[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.")
    else:
        print("\n[VERDICT] BUG NOT REPRODUCED.")

if __name__ == "__main__":
    main()

🐛 Describe the bug

When running a function that takes a slice of a larger tensor (l_c[..., 16]) and passes the results through several math operations before applying torch.stack(..., dim=-1), torch.compile with the inductor CPU backend produces incorrect numerical results.

The eager mode and compiled mode outputs show a significant absolute difference (~0.2).

Error logs

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
PyTorch Inductor CPU Correctness Repro
----------------------------------------
PyTorch version : 2.13.0.dev20260521+cu130
Result shape    : (4, 8, 3)
Max abs error   : 0.211432158947

Mismatch Details:
Flat Index [33]:
  Eager value    :  0.007949173450
  Compiled value :  0.219381332397

[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.

Reproduction Script

import torch
import torch.nn.functional as F

def fn(l_c, y, alpha, beta):
    x = l_c[..., 16]

    positive = F.softplus(x) + 0.0001
    rates = F.softplus(y) + 0.0001

    gammaln_term = torch.special.gammaln(positive + alpha.view(1, -1))
    erf_term = torch.special.erf(x * beta.view(1, -1))

    xlogy_term = torch.special.xlogy(positive, rates)
    logadd = torch.logaddexp(gammaln_term, xlogy_term)

    centered = logadd - logadd.mean(dim=-1, keepdim=True)

    return torch.stack([centered, erf_term, centered * erf_term], dim=-1)

def main():
    torch.manual_seed(20592561)

    l_c = torch.randn([4, 8, 64], dtype=torch.float32) * 0.1
    y = torch.randn([4, 8], dtype=torch.float32) * 0.1
    alpha = torch.randn([8], dtype=torch.float32) * 0.1
    beta = torch.zeros([8], dtype=torch.float32)

    def get_cloned_args():
        return (
            l_c.detach().clone(),
            y.detach().clone(),
            alpha.detach().clone(),
            beta.detach().clone()
        )

    with torch.no_grad():
        eager_out = fn(*get_cloned_args())

    compiled_fn = torch.compile(fn, backend="inductor")
    with torch.no_grad():
        compiled_out = compiled_fn(*get_cloned_args())

    diff = (eager_out - compiled_out).abs()
    max_abs = diff.max().item()

    print("PyTorch Inductor CPU Correctness Repro")
    print("-" * 40)
    print(f"PyTorch version : {torch.__version__}")
    print(f"Result shape    : {tuple(eager_out.shape)}")
    print(f"Max abs error   : {max_abs:.12f}")

    if max_abs > 1e-3:
        max_idx = diff.argmax().item()
        eager_val = eager_out.flatten()[max_idx].item()
        compiled_val = compiled_out.flatten()[max_idx].item()

        print("\nMismatch Details:")
        print(f"Flat Index [{max_idx}]:")
        print(f"  Eager value    :  {eager_val:.12f}")
        print(f"  Compiled value :  {compiled_val:.12f}")
        print("\n[VERDICT] BUG REPRODUCED: torch.compile produces wrong CPU output.")
    else:
        print("\n[VERDICT] BUG NOT REPRODUCED.")

if __name__ == "__main__":
    main()

Expected Behavior

The values returned by the compiled function should be identical to the eager execution outputs within normal floating-point tolerance.

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01 @mruberry @kshitij12345 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [Inductor] CPU backend produces incorrect numerical outputs when combining tensor slices and torch.stack

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Error logs

Reproduction Script

Expected Behavior

Versions

Still need to ship something?

TRENDING