pytorch - 💡(How to fix) Fix `F.normalize` returns ~1e12 gradient at zero-vector input instead of NaN, while the forward pass already returns NaN

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

import torch
import torch.nn.functional as F

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

print(f"input:    {x.data}")
print(f"output:   {y.data}")
print(f"gradient: {x.grad}")
print(f"grad finite: {x.grad.isfinite().all().item()}")
print(f"grad max:    {x.grad.abs().max().item():.3e}")

---

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([1.0000e+12, 1.0000e+12, 1.0000e+12])
grad finite: True
grad max:    1.000e+12

---

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([nan, nan, nan])
grad finite: False

---

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

# Standard NaN/Inf guard — does NOT catch the bug
print(torch.isnan(x.grad).any())   # False — misses it
print(torch.isinf(x.grad).any())   # False — misses it
print(x.grad.abs().max())          # 1e12  — silent explosion

---

grad ≈ 1 / eps = 1 / 1e-12 = 1e12

---

import torch
import torch.nn.functional as F

# Batch with one zero vector mixed in
x = torch.tensor([[1.0, 0.0, 0.0],
                  [0.0, 0.0, 0.0],   # zero vector
                  [0.0, 1.0, 0.0]], requires_grad=True)
y = F.normalize(x, dim=1)
y.sum().backward()

for i in range(3):
    g = x.grad[i]
    print(f"row {i}: norm={x.data[i].norm().item():.1f}  grad={g.tolist()}  max_abs={g.abs().max().item():.3e}")

---

row 0: norm=1.0  grad=[0.0, 1.0, 1.0]          max_abs=1.000e+00
row 1: norm=0.0  grad=[1e12, 1e12, 1e12]        max_abs=1.000e+12row 2: norm=1.0  grad=[1.0, 0.0, 1.0]           max_abs=1.000e+00

---

PyTorch version: 2.13.0.dev20260512+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: 18.1.3 (1ubuntu1)
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version: 590.48.01

[pip3] numpy==2.4.4
[pip3] torch==2.13.0.dev20260512+cu130
[pip3] triton==3.7.0+git88b227e2
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.nn.functional.normalize computes x / ‖x‖. At x = 0, the norm is 0 and the result is mathematically undefined. PyTorch's own autograd documentation states that undefined-input gradients should be NaN, but F.normalize instead returns a finite gradient of magnitude ~1e12, which silently corrupts gradient-based training.

Minimal reproducer

import torch
import torch.nn.functional as F

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

print(f"input:    {x.data}")
print(f"output:   {y.data}")
print(f"gradient: {x.grad}")
print(f"grad finite: {x.grad.isfinite().all().item()}")
print(f"grad max:    {x.grad.abs().max().item():.3e}")

Observed output

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([1.0000e+12, 1.0000e+12, 1.0000e+12])
grad finite: True
grad max:    1.000e+12

Expected output

According to PyTorch's autograd documentation, the gradient at an undefined input should be NaN:

input:    tensor([0., 0., 0.])
output:   tensor([nan, nan, nan])
gradient: tensor([nan, nan, nan])
grad finite: False

Why this is a bug

1. PyTorch's own autograd documentation specifies NaN for undefined inputs

From the PyTorch autograd documentation:

"If the input is not in the domain of the function, the output is also not meaningful and we return NaN gradients in this case."

F.normalize(zeros, dim=0) computes 0 / ‖0‖ = 0/0, which is undefined (NaN). The forward output is already nan. The gradient should also be nan, not a finite number.

2. The forward output is NaN, but the gradient is finite — a dangerous inconsistency

When the forward pass returns nan, any gradient-based update that uses this gradient will be numerically corrupted without warning. A gradient of 1e12 is effectively a silent numerical explosion: torch.isfinite(grad) returns True, so standard NaN-guard checks (torch.isnan, torch.isinf) will not catch it.

x = torch.zeros(3, requires_grad=True)
y = F.normalize(x, dim=0)
y.sum().backward()

# Standard NaN/Inf guard — does NOT catch the bug
print(torch.isnan(x.grad).any())   # False — misses it
print(torch.isinf(x.grad).any())   # False — misses it
print(x.grad.abs().max())          # 1e12  — silent explosion

3. The ~1e12 value is an artifact of epsilon clamping, not a mathematical result

Internally, F.normalize clamps the norm to eps (default 1e-12) to avoid division by zero:

grad ≈ 1 / eps = 1 / 1e-12 = 1e12

This is an implementation artifact. The mathematically correct answer for an undefined input is NaN, not 1/eps.

4. Reproducible with batch inputs

import torch
import torch.nn.functional as F

# Batch with one zero vector mixed in
x = torch.tensor([[1.0, 0.0, 0.0],
                  [0.0, 0.0, 0.0],   # zero vector
                  [0.0, 1.0, 0.0]], requires_grad=True)
y = F.normalize(x, dim=1)
y.sum().backward()

for i in range(3):
    g = x.grad[i]
    print(f"row {i}: norm={x.data[i].norm().item():.1f}  grad={g.tolist()}  max_abs={g.abs().max().item():.3e}")
row 0: norm=1.0  grad=[0.0, 1.0, 1.0]          max_abs=1.000e+00
row 1: norm=0.0  grad=[1e12, 1e12, 1e12]        max_abs=1.000e+12   ✗
row 2: norm=1.0  grad=[1.0, 0.0, 1.0]           max_abs=1.000e+00

A single zero row silently injects a gradient 12 orders of magnitude larger than all other rows.

5. Practical impact

This pattern arises naturally in:

  • Embedding layers at initialization, when an embedding vector is all-zeros
  • L2-normalization layers in metric learning, when a representation collapses to zero
  • Gradient penalty terms in GANs, when a sample happens to have zero norm

In all these cases, the ~1e12 gradient causes a single large weight update that can destabilize an otherwise healthy training run, and standard NaN/Inf guards will not catch it.


Summary table

BehaviorObservedExpected (per docs)
Forward output at x=0nan (correct)nan
Gradient at x=01e12 (finite)nan
torch.isnan(grad)FalseTrue
torch.isinf(grad)FalseTrue
Standard NaN guard catches it?NoYes

Versions

PyTorch version: 2.13.0.dev20260512+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: 18.1.3 (1ubuntu1)
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version: 590.48.01

[pip3] numpy==2.4.4
[pip3] torch==2.13.0.dev20260512+cu130
[pip3] triton==3.7.0+git88b227e2

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `F.normalize` returns ~1e12 gradient at zero-vector input instead of NaN, while the forward pass already returns NaN