pytorch - 💡(How to fix) Fix `F.normalize` returns ~1e12 gradient at zero-vector input instead of NaN, while the forward pass already returns NaN

Code Example

import torch
import torch.nn.functional as F

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

print(f"input:    {x.data}")
print(f"output:   {y.data}")
print(f"gradient: {x.grad}")
print(f"grad finite: {x.grad.isfinite().all().item()}")
print(f"grad max:    {x.grad.abs().max().item():.3e}")

---

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([1.0000e+12, 1.0000e+12, 1.0000e+12])
grad finite: True
grad max:    1.000e+12

---

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([nan, nan, nan])
grad finite: False

---

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

# Standard NaN/Inf guard — does NOT catch the bug
print(torch.isnan(x.grad).any())   # False — misses it
print(torch.isinf(x.grad).any())   # False — misses it
print(x.grad.abs().max())          # 1e12  — silent explosion

---

grad ≈ 1 / eps = 1 / 1e-12 = 1e12

---

import torch
import torch.nn.functional as F

# Batch with one zero vector mixed in
x = torch.tensor([[1.0, 0.0, 0.0],
                  [0.0, 0.0, 0.0],   # zero vector
                  [0.0, 1.0, 0.0]], requires_grad=True)
y = F.normalize(x, dim=1)
y.sum().backward()

for i in range(3):
    g = x.grad[i]
    print(f"row {i}: norm={x.data[i].norm().item():.1f}  grad={g.tolist()}  max_abs={g.abs().max().item():.3e}")

---

row 0: norm=1.0  grad=[0.0, 1.0, 1.0]          max_abs=1.000e+00
row 1: norm=0.0  grad=[1e12, 1e12, 1e12]        max_abs=1.000e+12   ✗
row 2: norm=1.0  grad=[1.0, 0.0, 1.0]           max_abs=1.000e+00

---

PyTorch version: 2.13.0.dev20260512+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: 18.1.3 (1ubuntu1)
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version: 590.48.01

[pip3] numpy==2.4.4
[pip3] torch==2.13.0.dev20260512+cu130
[pip3] triton==3.7.0+git88b227e2

🐛 Describe the bug

torch.nn.functional.normalize computes x / ‖x‖. At x = 0, the norm is 0 and the result is mathematically undefined. PyTorch's own autograd documentation states that undefined-input gradients should be NaN, but F.normalize instead returns a finite gradient of magnitude ~1e12, which silently corrupts gradient-based training.

Minimal reproducer

import torch
import torch.nn.functional as F

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

print(f"input:    {x.data}")
print(f"output:   {y.data}")
print(f"gradient: {x.grad}")
print(f"grad finite: {x.grad.isfinite().all().item()}")
print(f"grad max:    {x.grad.abs().max().item():.3e}")

Observed output

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([1.0000e+12, 1.0000e+12, 1.0000e+12])
grad finite: True
grad max:    1.000e+12

Expected output

According to PyTorch's autograd documentation, the gradient at an undefined input should be NaN:

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([nan, nan, nan])
grad finite: False

Why this is a bug

1. PyTorch's own autograd documentation specifies NaN for undefined inputs

From the PyTorch autograd documentation:

"If the input is not in the domain of the function, the output is also not meaningful and we return NaN gradients in this case."

F.normalize(zeros, dim=0) computes 0 / ‖0‖ = 0/0, which is undefined (NaN). The forward output is already nan. The gradient should also be nan, not a finite number.

2. The forward output is NaN, but the gradient is finite — a dangerous inconsistency

When the forward pass returns nan, any gradient-based update that uses this gradient will be numerically corrupted without warning. A gradient of 1e12 is effectively a silent numerical explosion: torch.isfinite(grad) returns True, so standard NaN-guard checks (torch.isnan, torch.isinf) will not catch it.

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

# Standard NaN/Inf guard — does NOT catch the bug
print(torch.isnan(x.grad).any())   # False — misses it
print(torch.isinf(x.grad).any())   # False — misses it
print(x.grad.abs().max())          # 1e12  — silent explosion

3. The ~1e12 value is an artifact of epsilon clamping, not a mathematical result

Internally, F.normalize clamps the norm to eps (default 1e-12) to avoid division by zero:

grad ≈ 1 / eps = 1 / 1e-12 = 1e12

This is an implementation artifact. The mathematically correct answer for an undefined input is NaN, not 1/eps.

4. Reproducible with batch inputs

import torch
import torch.nn.functional as F

# Batch with one zero vector mixed in
x = torch.tensor([[1.0, 0.0, 0.0],
                  [0.0, 0.0, 0.0],   # zero vector
                  [0.0, 1.0, 0.0]], requires_grad=True)
y = F.normalize(x, dim=1)
y.sum().backward()

for i in range(3):
    g = x.grad[i]
    print(f"row {i}: norm={x.data[i].norm().item():.1f}  grad={g.tolist()}  max_abs={g.abs().max().item():.3e}")

row 0: norm=1.0  grad=[0.0, 1.0, 1.0]          max_abs=1.000e+00
row 1: norm=0.0  grad=[1e12, 1e12, 1e12]        max_abs=1.000e+12   ✗
row 2: norm=1.0  grad=[1.0, 0.0, 1.0]           max_abs=1.000e+00

A single zero row silently injects a gradient 12 orders of magnitude larger than all other rows.

5. Practical impact

This pattern arises naturally in:

Embedding layers at initialization, when an embedding vector is all-zeros
L2-normalization layers in metric learning, when a representation collapses to zero
Gradient penalty terms in GANs, when a sample happens to have zero norm

In all these cases, the ~1e12 gradient causes a single large weight update that can destabilize an otherwise healthy training run, and standard NaN/Inf guards will not catch it.

Summary table

Behavior	Observed	Expected (per docs)
Forward output at `x=0`	`nan` (correct)	`nan`
Gradient at `x=0`	`1e12` (finite)	`nan`
`torch.isnan(grad)`	`False`	`True`
`torch.isinf(grad)`	`False`	`True`
Standard NaN guard catches it?	No	Yes

Versions

PyTorch version: 2.13.0.dev20260512+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: 18.1.3 (1ubuntu1)
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version: 590.48.01

[pip3] numpy==2.4.4
[pip3] torch==2.13.0.dev20260512+cu130
[pip3] triton==3.7.0+git88b227e2

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering