pytorch - 💡(How to fix) Fix [Inductor][CUDA][bf16] `F.celu` returns `+0.0` where eager returns a negative bf16 subnormal

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py torch: 2.13.0.dev20260521+cu130 cuda: 13.0 input: value = [-9.183549615799121e-41] zero = [False] signbit = [True] eager output: value = [-9.183549615799121e-41] zero = [False] signbit = [True] inductor output: value = [0.0] zero = [True] signbit = [False] eager grad: value = [1.0] zero = [False] signbit = [False] inductor grad: value = [1.0] zero = [False] signbit = [False] torch.equal(output): False Traceback (most recent call last): File "/tmp/bug.py", line 46, in <module> assert torch.equal(y_eager, y_comp), "BUG: Inductor changed negative bf16 subnormal CELU output to zero" AssertionError: BUG: Inductor changed negative bf16 subnormal CELU output to zero

Code Example

import torch
import torch.nn.functional as F

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

def f(x):
    return F.celu(x)  # alpha=1.0

def dump(name, t):
    torch.cuda.synchronize()
    tf = t.detach().float().cpu()
    print(
        f"{name}:",
        "value =", tf.tolist(),
        "zero =", (t.detach() == 0).cpu().tolist(),
        "signbit =", torch.signbit(tf).tolist(),
    )

x_val = -9.183549615799121e-41

x_eager = torch.tensor([x_val], device="cuda", dtype=torch.bfloat16, requires_grad=True)
x_comp = x_eager.detach().clone().requires_grad_(True)

compiled_f = torch.compile(f, backend="inductor", fullgraph=True)

y_eager = f(x_eager)
y_comp = compiled_f(x_comp)

dump("input", x_eager)
dump("eager output", y_eager)
dump("inductor output", y_comp)

y_eager.sum().backward()
y_comp.sum().backward()

dump("eager grad", x_eager.grad)
dump("inductor grad", x_comp.grad)

print("torch.equal(output):", torch.equal(y_eager, y_comp))

assert torch.equal(y_eager, y_comp), "BUG: Inductor changed negative bf16 subnormal CELU output to zero"

---

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0
input: value = [-9.183549615799121e-41] zero = [False] signbit = [True]
eager output: value = [-9.183549615799121e-41] zero = [False] signbit = [True]
inductor output: value = [0.0] zero = [True] signbit = [False]
eager grad: value = [1.0] zero = [False] signbit = [False]
inductor grad: value = [1.0] zero = [False] signbit = [False]
torch.equal(output): False
Traceback (most recent call last):
  File "/tmp/bug.py", line 46, in <module>
    assert torch.equal(y_eager, y_comp), "BUG: Inductor changed negative bf16 subnormal CELU output to zero"
AssertionError: BUG: Inductor changed negative bf16 subnormal CELU output to zero
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

There is a behavior mismatch between Eager mode and torch.compile (Inductor backend) when applying torch.nn.functional.celu to negative subnormal values in bfloat16.

For the negative bfloat16 subnormal value (e.g., -9.183549615799121e-41):

  • Eager mode returns the subnormal value and preserves the negative sign bit.
  • Inductor returns exactly 0.0 and the sign bit becomes positive.

Both modes compute the correct gradient (1.0), but the forward output diverges.

To Reproduce Steps to reproduce the behavior:

import torch
import torch.nn.functional as F

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

def f(x):
    return F.celu(x)  # alpha=1.0

def dump(name, t):
    torch.cuda.synchronize()
    tf = t.detach().float().cpu()
    print(
        f"{name}:",
        "value =", tf.tolist(),
        "zero =", (t.detach() == 0).cpu().tolist(),
        "signbit =", torch.signbit(tf).tolist(),
    )

x_val = -9.183549615799121e-41

x_eager = torch.tensor([x_val], device="cuda", dtype=torch.bfloat16, requires_grad=True)
x_comp = x_eager.detach().clone().requires_grad_(True)

compiled_f = torch.compile(f, backend="inductor", fullgraph=True)

y_eager = f(x_eager)
y_comp = compiled_f(x_comp)

dump("input", x_eager)
dump("eager output", y_eager)
dump("inductor output", y_comp)

y_eager.sum().backward()
y_comp.sum().backward()

dump("eager grad", x_eager.grad)
dump("inductor grad", x_comp.grad)

print("torch.equal(output):", torch.equal(y_eager, y_comp))

assert torch.equal(y_eager, y_comp), "BUG: Inductor changed negative bf16 subnormal CELU output to zero"

Expected behavior Inductor output should match eager mode output.

Actual Output

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0
input: value = [-9.183549615799121e-41] zero = [False] signbit = [True]
eager output: value = [-9.183549615799121e-41] zero = [False] signbit = [True]
inductor output: value = [0.0] zero = [True] signbit = [False]
eager grad: value = [1.0] zero = [False] signbit = [False]
inductor grad: value = [1.0] zero = [False] signbit = [False]
torch.equal(output): False
Traceback (most recent call last):
  File "/tmp/bug.py", line 46, in <module>
    assert torch.equal(y_eager, y_comp), "BUG: Inductor changed negative bf16 subnormal CELU output to zero"
AssertionError: BUG: Inductor changed negative bf16 subnormal CELU output to zero

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 Nvidia driver version: 596.49 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_tensor_ir.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.21.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A ersions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.13.0.dev20260521+cu130 [pip3] torchaudio==2.11.0.dev20260525+cu130 [pip3] torchvision==0.28.0.dev20260525+cu130 [pip3] triton==3.7.0+git88b227e2 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas 13.1.1.3 pypi_0 pypi [conda] nvidia-cuda-cupti 13.0.85 pypi_0 pypi [conda] nvidia-cuda-nvrtc 13.0.88 pypi_0 pypi [conda] nvidia-cuda-runtime 13.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu13 9.20.0.48 pypi_0 pypi [conda] nvidia-cufft 12.0.0.61 pypi_0 pypi [conda] nvidia-curand 10.4.0.35 pypi_0 pypi [conda] nvidia-cusolver 12.0.4.66 pypi_0 pypi [conda] nvidia-cusparse 12.6.3.3 pypi_0 pypi [conda] nvidia-cusparselt-cu13 0.8.1 pypi_0 pypi [conda] nvidia-nccl-cu13 2.29.7 pypi_0 pypi [conda] nvidia-nvjitlink 13.0.88 pypi_0 pypi [conda] nvidia-nvtx 13.0.85 pypi_0 pypi [conda] torch 2.13.0.dev20260521+cu130 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260525+cu130 pypi_0 pypi [conda] torchvision 0.28.0.dev20260525+cu130 pypi_0 pypi [conda] triton 3.7.0+git88b227e2 pypi_0 pypi

cc @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING