pytorch - 💡(How to fix) Fix [Inductor] torch.atan2 yields incorrect result for x=0.0 with float16 on CUDA (loses signed zero)

Error Message

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py torch: 2.13.0.dev20260521+cu130 cuda: 13.0 x: tensor([ 0.0000e+00, 5.9605e-08, -5.9605e-08], device='cuda:0', dtype=torch.float16) x bits: tensor([ 0, 1, -32767], device='cuda:0', dtype=torch.int16)

eager: tensor([ 3.1406, 2.3555, -0.7852], device='cuda:0', dtype=torch.float16) eager bits: tensor([ 16968, 16566, -17848], device='cuda:0', dtype=torch.int16)

compiled: tensor([ 0.0000, 2.3555, -0.7852], device='cuda:0', dtype=torch.float16) compiled bits: tensor([ 0, 16566, -17848], device='cuda:0', dtype=torch.int16)

max abs diff: tensor(3.1406, device='cuda:0') Traceback (most recent call last): File "/tmp/bug.py", line 38, in <module> assert torch.equal(eager, compiled), "Inductor output differs from eager" AssertionError: Inductor output differs from eager

Code Example

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0
x:
tensor([ 0.0000e+00,  5.9605e-08, -5.9605e-08], device='cuda:0',
       dtype=torch.float16)
x bits:
tensor([     0,      1, -32767], device='cuda:0', dtype=torch.int16)

eager:
tensor([ 3.1406,  2.3555, -0.7852], device='cuda:0', dtype=torch.float16)
eager bits:
tensor([ 16968,  16566, -17848], device='cuda:0', dtype=torch.int16)

compiled:
tensor([ 0.0000,  2.3555, -0.7852], device='cuda:0', dtype=torch.float16)
compiled bits:
tensor([     0,  16566, -17848], device='cuda:0', dtype=torch.int16)

max abs diff: tensor(3.1406, device='cuda:0')
Traceback (most recent call last):
  File "/tmp/bug.py", line 38, in <module>
    assert torch.equal(eager, compiled), "Inductor output differs from eager"
AssertionError: Inductor output differs from eager

---

import torch

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

TINY = 5.960464477539063e-08  

def f(x):
    return torch.atan2(x, -x)

x = torch.tensor([0.0, TINY, -TINY], device="cuda", dtype=torch.float16)

compiled_f = torch.compile(f, backend="inductor", fullgraph=True)

eager = f(x)
compiled = compiled_f(x)
torch.cuda.synchronize()

print("x:")
print(x)
print("x bits:")
print(x.view(torch.int16))

print("\neager:")
print(eager)
print("eager bits:")
print(eager.view(torch.int16))

print("\ncompiled:")
print(compiled)
print("compiled bits:")
print(compiled.view(torch.int16))

print("\nmax abs diff:", (eager.float() - compiled.float()).abs().max())

assert torch.equal(eager, compiled), "Inductor output differs from eager"

🐛 Describe the bug

When calculating torch.atan2(x, -x) where x = 0.0 in torch.float16 on CUDA, torch.compile (Inductor) produces completely different results compared to eager mode.

Eager mode correctly respects -0.0 and returns 3.1406 (approx. $\pi$), while torch.compile returns 0.0000. This suggests that either the signed zero is lost during compiler optimizations (e.g., -x for 0.0 is evaluated as 0.0 instead of -0.0), or the underlying Triton atan2 implementation mishandles signed zeros for float16.

Error logs

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0
x:
tensor([ 0.0000e+00,  5.9605e-08, -5.9605e-08], device='cuda:0',
       dtype=torch.float16)
x bits:
tensor([     0,      1, -32767], device='cuda:0', dtype=torch.int16)

eager:
tensor([ 3.1406,  2.3555, -0.7852], device='cuda:0', dtype=torch.float16)
eager bits:
tensor([ 16968,  16566, -17848], device='cuda:0', dtype=torch.int16)

compiled:
tensor([ 0.0000,  2.3555, -0.7852], device='cuda:0', dtype=torch.float16)
compiled bits:
tensor([     0,  16566, -17848], device='cuda:0', dtype=torch.int16)

max abs diff: tensor(3.1406, device='cuda:0')
Traceback (most recent call last):
  File "/tmp/bug.py", line 38, in <module>
    assert torch.equal(eager, compiled), "Inductor output differs from eager"
AssertionError: Inductor output differs from eager

Minified repro

import torch

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

TINY = 5.960464477539063e-08  

def f(x):
    return torch.atan2(x, -x)

x = torch.tensor([0.0, TINY, -TINY], device="cuda", dtype=torch.float16)

compiled_f = torch.compile(f, backend="inductor", fullgraph=True)

eager = f(x)
compiled = compiled_f(x)
torch.cuda.synchronize()

print("x:")
print(x)
print("x bits:")
print(x.view(torch.int16))

print("\neager:")
print(eager)
print("eager bits:")
print(eager.view(torch.int16))

print("\ncompiled:")
print(compiled)
print("compiled bits:")
print(compiled.view(torch.int16))

print("\nmax abs diff:", (eager.float() - compiled.float()).abs().max())

assert torch.equal(eager, compiled), "Inductor output differs from eager"

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 Nvidia driver version: 596.49 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_tensor_ir.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.21.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A ersions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.13.0.dev20260521+cu130 [pip3] torchaudio==2.11.0.dev20260525+cu130 [pip3] torchvision==0.28.0.dev20260525+cu130 [pip3] triton==3.7.0+git88b227e2 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas 13.1.1.3 pypi_0 pypi [conda] nvidia-cuda-cupti 13.0.85 pypi_0 pypi [conda] nvidia-cuda-nvrtc 13.0.88 pypi_0 pypi [conda] nvidia-cuda-runtime 13.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu13 9.20.0.48 pypi_0 pypi [conda] nvidia-cufft 12.0.0.61 pypi_0 pypi [conda] nvidia-curand 10.4.0.35 pypi_0 pypi [conda] nvidia-cusolver 12.0.4.66 pypi_0 pypi [conda] nvidia-cusparse 12.6.3.3 pypi_0 pypi [conda] nvidia-cusparselt-cu13 0.8.1 pypi_0 pypi [conda] nvidia-nccl-cu13 2.29.7 pypi_0 pypi [conda] nvidia-nvjitlink 13.0.88 pypi_0 pypi [conda] nvidia-nvtx 13.0.85 pypi_0 pypi [conda] torch 2.13.0.dev20260521+cu130 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260525+cu130 pypi_0 pypi [conda] torchvision 0.28.0.dev20260525+cu130 pypi_0 pypi [conda] triton 3.7.0+git88b227e2 pypi_0 pypi

cc @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [Inductor] torch.atan2 yields incorrect result for x=0.0 with float16 on CUDA (loses signed zero)

Recommended Tools

GitHub issue graph ai analysis