pytorch - 💡(How to fix) Fix [Inductor][CUDA][bf16] torch.add with a 0-dim float64 scalar returns nonzero where eager returns zero

Code Example

import torch

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

device = "cuda"


def fn(x, s):
    return torch.add(x, s)


def make_inputs():
    x = torch.tensor(
        [
            [0.0, 0.0, 0.1],
            [-0.2, 1.0, -1.0],
        ],
        device=device,
        dtype=torch.bfloat16,
        requires_grad=True,
    )

    s = torch.tensor(-0.1, device=device, dtype=torch.float64)

    return x, s


def run(label, f):
    x, s = make_inputs()

    y = f(x, s)
    loss = y.sum()
    loss.backward()

    print(f"\n=== {label} ===")
    print("output dtype:", y.dtype)
    print("output:")
    print(y)
    print("zero mask:")
    print(y == 0)
    print("grad:")
    print(x.grad)

    return y.detach().clone(), x.grad.detach().clone()


y_eager, g_eager = run("eager", fn)

compiled_inductor = torch.compile(fn, backend="inductor")
y_inductor, g_inductor = run("inductor", compiled_inductor)

print("\n=== comparison ===")
print("torch.equal(output):", torch.equal(y_eager, y_inductor))
print("torch.allclose(output):", torch.allclose(y_eager, y_inductor, rtol=0, atol=0))
print("max abs diff:", (y_eager.float() - y_inductor.float()).abs().max())

print("torch.equal(grad):", torch.equal(g_eager, g_inductor))
print("grad max abs diff:", (g_eager.float() - g_inductor.float()).abs().max())

print("\ninteresting element [0, 2]:")
print("eager:   ", y_eager[0, 2].item(), "zero:", bool(y_eager[0, 2] == 0))
print("inductor:", y_inductor[0, 2].item(), "zero:", bool(y_inductor[0, 2] == 0))

---

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0

=== eager ===
output dtype: torch.bfloat16
output:
tensor([[-0.1001, -0.1001,  0.0000],
        [-0.3008,  0.8984, -1.1016]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<AddBackward0>)
zero mask:
tensor([[False, False,  True],
        [False, False, False]], device='cuda:0')
grad:
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0', dtype=torch.bfloat16)

=== inductor ===
output dtype: torch.bfloat16
output:
tensor([[-1.0010e-01, -1.0010e-01,  9.7752e-05],
        [-3.0078e-01,  8.9844e-01, -1.1016e+00]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<CompiledFunctionBackward>)
zero mask:
tensor([[False, False, False],
        [False, False, False]], device='cuda:0')
grad:
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0', dtype=torch.bfloat16)

=== comparison ===
torch.equal(output): False
torch.allclose(output): False
max abs diff: tensor(9.7752e-05, device='cuda:0')
torch.equal(grad): True
grad max abs diff: tensor(0., device='cuda:0')

interesting element [0, 2]:
eager:    0.0 zero: True
inductor: 9.775161743164062e-05 zero: False

🐛 Describe the bug

There is an accuracy/computation mismatch between Eager mode and torch.compile(backend="inductor") when adding a multi-dimensional bfloat16 tensor and a 0-dimensional float64 tensor.

In Eager mode, adding 0.1 (initialized as bfloat16) and -0.1 (initialized as a 0-dim float64 tensor) yields exactly 0.0. However, with Inductor, the result is 9.7752e-05. This indicates a potential divergence in how type promotion or precision casting is handled for 0-dimensional scalar tensors between the two modes.

Expected vs Actual behavior

Expected behavior (Eager): For the element [0, 2], the result of 0.1 + (-0.1) evaluates to 0.0 exactly.

Actual behavior (Inductor): For the element [0, 2], the result evaluates to 9.775161743164062e-05. torch.allclose fails between the two modes.

Repro script

import torch

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

device = "cuda"


def fn(x, s):
    return torch.add(x, s)


def make_inputs():
    x = torch.tensor(
        [
            [0.0, 0.0, 0.1],
            [-0.2, 1.0, -1.0],
        ],
        device=device,
        dtype=torch.bfloat16,
        requires_grad=True,
    )

    s = torch.tensor(-0.1, device=device, dtype=torch.float64)

    return x, s


def run(label, f):
    x, s = make_inputs()

    y = f(x, s)
    loss = y.sum()
    loss.backward()

    print(f"\n=== {label} ===")
    print("output dtype:", y.dtype)
    print("output:")
    print(y)
    print("zero mask:")
    print(y == 0)
    print("grad:")
    print(x.grad)

    return y.detach().clone(), x.grad.detach().clone()


y_eager, g_eager = run("eager", fn)

compiled_inductor = torch.compile(fn, backend="inductor")
y_inductor, g_inductor = run("inductor", compiled_inductor)

print("\n=== comparison ===")
print("torch.equal(output):", torch.equal(y_eager, y_inductor))
print("torch.allclose(output):", torch.allclose(y_eager, y_inductor, rtol=0, atol=0))
print("max abs diff:", (y_eager.float() - y_inductor.float()).abs().max())

print("torch.equal(grad):", torch.equal(g_eager, g_inductor))
print("grad max abs diff:", (g_eager.float() - g_inductor.float()).abs().max())

print("\ninteresting element [0, 2]:")
print("eager:   ", y_eager[0, 2].item(), "zero:", bool(y_eager[0, 2] == 0))
print("inductor:", y_inductor[0, 2].item(), "zero:", bool(y_inductor[0, 2] == 0))

Error logs (Output of the script)

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0

=== eager ===
output dtype: torch.bfloat16
output:
tensor([[-0.1001, -0.1001,  0.0000],
        [-0.3008,  0.8984, -1.1016]], device='cuda:0', dtype=torch.bfloat16,
       grad_fn=<AddBackward0>)
zero mask:
tensor([[False, False,  True],
        [False, False, False]], device='cuda:0')
grad:
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0', dtype=torch.bfloat16)

=== inductor ===
output dtype: torch.bfloat16
output:
tensor([[-1.0010e-01, -1.0010e-01,  9.7752e-05],
        [-3.0078e-01,  8.9844e-01, -1.1016e+00]], device='cuda:0',
       dtype=torch.bfloat16, grad_fn=<CompiledFunctionBackward>)
zero mask:
tensor([[False, False, False],
        [False, False, False]], device='cuda:0')
grad:
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='cuda:0', dtype=torch.bfloat16)

=== comparison ===
torch.equal(output): False
torch.allclose(output): False
max abs diff: tensor(9.7752e-05, device='cuda:0')
torch.equal(grad): True
grad max abs diff: tensor(0., device='cuda:0')

interesting element [0, 2]:
eager:    0.0 zero: True
inductor: 9.775161743164062e-05 zero: False

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 Nvidia driver version: 596.49 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_tensor_ir.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.21.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A ersions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.13.0.dev20260521+cu130 [pip3] torchaudio==2.11.0.dev20260525+cu130 [pip3] torchvision==0.28.0.dev20260525+cu130 [pip3] triton==3.7.0+git88b227e2 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas 13.1.1.3 pypi_0 pypi [conda] nvidia-cuda-cupti 13.0.85 pypi_0 pypi [conda] nvidia-cuda-nvrtc 13.0.88 pypi_0 pypi [conda] nvidia-cuda-runtime 13.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu13 9.20.0.48 pypi_0 pypi [conda] nvidia-cufft 12.0.0.61 pypi_0 pypi [conda] nvidia-curand 10.4.0.35 pypi_0 pypi [conda] nvidia-cusolver 12.0.4.66 pypi_0 pypi [conda] nvidia-cusparse 12.6.3.3 pypi_0 pypi [conda] nvidia-cusparselt-cu13 0.8.1 pypi_0 pypi [conda] nvidia-nccl-cu13 2.29.7 pypi_0 pypi [conda] nvidia-nvjitlink 13.0.88 pypi_0 pypi [conda] nvidia-nvtx 13.0.85 pypi_0 pypi [conda] torch 2.13.0.dev20260521+cu130 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260525+cu130 pypi_0 pypi [conda] torchvision 0.28.0.dev20260525+cu130 pypi_0 pypi [conda] triton 3.7.0+git88b227e2 pypi_0 pypi

cc @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @nairbv @mruberry @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [Inductor][CUDA][bf16] torch.add with a 0-dim float64 scalar returns nonzero where eager returns zero

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs (Output of the script)

Code Example

🐛 Describe the bug

Expected vs Actual behavior

Repro script

Error logs (Output of the script)

Versions

Still need to ship something?

TRENDING