pytorch - 💡(How to fix) Fix Numerical inconsistency: nn.Conv2d produces NaN/Inf on CUDA but finite values on CPU for inputs near float32 limits [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177018Fetched 2026-04-08 00:22:46
View on GitHub
Comments
2
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
commented ×2closed ×1

Error Message

The core issue is the Inconsistency of the Error State across different execution modes. As noted in PyTorch's internal philosophy regarding extremal values (similar to discussions on intermediate overflows in norm()), while precision may vary across backends, the existence of the error state itself should be consistent. Switching the execution device (CPU vs. CUDA) changes whether an error occurs at all, which violates the principle of Backend Parity. This discrepancy goes beyond "expected numerical variance" and represents a functional inconsistency where one backend fails (reaches an error state) while the other succeeds on identical input data. This warrants a re-evaluation of how intermediate accumulations are handled in the CUDA/cuDNN implementation for convolutional layers.

Code Example

import torch
import torch.nn as nn
import os

def repro():
    inp = torch.load("input.pt", map_location="cpu")
    weights = torch.load("weight_bias.pt", map_location="cpu")

    conv = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1, bias=True)
    conv.weight.data.copy_(weights["weight"])
    conv.bias.data.copy_(weights["bias"])
    conv.eval()

    input_cuda = inp.cuda()
    input_cpu = inp.cpu()
    input_diff = (input_cuda.cpu() - input_cpu).abs().max().item()
    print(f"Input Consistency: {input_diff:.6e}")
    print(f"Input Range: min={inp.min().item():.2e}, max={inp.max().item():.2e}, nan={torch.isnan(inp).any().item()}")

    conv.to("cuda")
    with torch.no_grad():
        out_cuda = conv(input_cuda).cpu()
    print(f"[CUDA] Out: min={out_cuda.min().item():.2e}, max={out_cuda.max().item():.2e}, nan={torch.isnan(out_cuda).any().item()}, inf={torch.isinf(out_cuda).any().item()}")

    conv.to("cpu")
    with torch.no_grad():
        out_cpu = conv(input_cpu)
    print(f"[CPU] Out: min={out_cpu.min().item():.2e}, max={out_cpu.max().item():.2e}, nan={torch.isnan(out_cpu).any().item()}, inf={torch.isinf(out_cpu).any().item()}")

if __name__ == "__main__":
    repro()

---

Input Consistency: 0.000000e+00
Input Range: min=0.00e+00, max=3.17e+38, nan=False
[CUDA] Out: min=nan, max=nan, nan=True, inf=True
[CPU] Out: min=-2.89e+38, max=3.15e+38, nan=False, inf=False
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

The core issue is the Inconsistency of the Error State across different execution modes. For the same FP32 input with values near the representable limit (~3.17e+38), we observed a functional divergence: the CPU backend produces a finite result ($3.15 \times 10^{38}$), while the CUDA backend produces NaN and Inf. As noted in PyTorch's internal philosophy regarding extremal values (similar to discussions on intermediate overflows in norm()), while precision may vary across backends, the existence of the error state itself should be consistent. Switching the execution device (CPU vs. CUDA) changes whether an error occurs at all, which violates the principle of Backend Parity. This discrepancy goes beyond "expected numerical variance" and represents a functional inconsistency where one backend fails (reaches an error state) while the other succeeds on identical input data. This warrants a re-evaluation of how intermediate accumulations are handled in the CUDA/cuDNN implementation for convolutional layers.

import torch
import torch.nn as nn
import os

def repro():
    inp = torch.load("input.pt", map_location="cpu")
    weights = torch.load("weight_bias.pt", map_location="cpu")

    conv = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1, bias=True)
    conv.weight.data.copy_(weights["weight"])
    conv.bias.data.copy_(weights["bias"])
    conv.eval()

    input_cuda = inp.cuda()
    input_cpu = inp.cpu()
    input_diff = (input_cuda.cpu() - input_cpu).abs().max().item()
    print(f"Input Consistency: {input_diff:.6e}")
    print(f"Input Range: min={inp.min().item():.2e}, max={inp.max().item():.2e}, nan={torch.isnan(inp).any().item()}")

    conv.to("cuda")
    with torch.no_grad():
        out_cuda = conv(input_cuda).cpu()
    print(f"[CUDA] Out: min={out_cuda.min().item():.2e}, max={out_cuda.max().item():.2e}, nan={torch.isnan(out_cuda).any().item()}, inf={torch.isinf(out_cuda).any().item()}")

    conv.to("cpu")
    with torch.no_grad():
        out_cpu = conv(input_cpu)
    print(f"[CPU] Out: min={out_cpu.min().item():.2e}, max={out_cpu.max().item():.2e}, nan={torch.isnan(out_cpu).any().item()}, inf={torch.isinf(out_cpu).any().item()}")

if __name__ == "__main__":
    repro()
Input Consistency: 0.000000e+00
Input Range: min=0.00e+00, max=3.17e+38, nan=False
[CUDA] Out: min=nan, max=nan, nan=True, inf=True
[CPU] Out: min=-2.89e+38, max=3.15e+38, nan=False, inf=False

repro.zip

Versions

PyTorch version: 2.6.0+cu126 Is debug build: False CUDA used to build PyTorch: 12.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.6.20 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.35.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: QEMU Virtual CPU version 2.5+ CPU family: 15 Model: 107 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 1 Stepping: 1 BogoMIPS: 4190.15 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c hypervisor lahf_lm abm cpuid_fault pti bmi1 avx2 bmi2 avx512f avx512dq avx512cd avx512bw avx512vl Hypervisor vendor: KVM Virtualization type: full

Versions of relevant libraries: [pip3] numpy==2.0.2 [pip3] nvidia-cublas-cu12==12.6.4.1 [pip3] nvidia-cuda-cupti-cu12==12.6.80 [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 [pip3] nvidia-cuda-runtime-cu12==12.6.77 [pip3] nvidia-cudnn-cu12==9.5.1.17 [pip3] nvidia-cufft-cu12==11.3.0.4 [pip3] nvidia-curand-cu12==10.3.7.77 [pip3] nvidia-cusolver-cu12==11.7.1.2 [pip3] nvidia-cusparse-cu12==12.5.4.2 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.6.85 [pip3] nvidia-nvtx-cu12==12.6.77 [pip3] torch==2.6.0+cu126 [pip3] torchaudio==2.6.0+cu126 [pip3] torchvision==0.21.0+cu126 [pip3] triton==3.2.0

extent analysis

Fix Plan

Intermediate Accumulation Fix

Step 1: Update cuDNN Implementation

Update the cuDNN implementation for convolutional layers to handle intermediate accumulations correctly.

Step 2: Modify CUDA Backend

Modify the CUDA backend to match the CPU backend's behavior for handling intermediate accumulations.

Step 3: Test and Verify

Test the updated implementation with the provided repro script and verify that the issue is resolved.

Code Snippet

import torch
import torch.nn as nn

class Conv2d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, bias):
        super(Conv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, bias=bias)

    def forward(self, x):
        # Update intermediate accumulation handling
        x = torch.clamp(x, -1e38, 1e38)
        return self.conv(x)

# Example usage
conv = Conv2d(64, 64, 3, 1, 1, True)
input_cuda = torch.randn(1, 64, 224, 224).cuda()
output_cuda = conv(input_cuda)
print(output_cuda)

Verification

To verify that the fix worked, run the repro script with the updated implementation and check that the output is consistent across CPU and CUDA backends.

Extra Tips

  • Regularly update cuDNN and CUDA versions to ensure you have the latest bug fixes and performance improvements.
  • Use torch.clamp to prevent intermediate accumulations from exceeding the representable limit.
  • Test your implementation thoroughly to ensure it handles edge cases correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix Numerical inconsistency: nn.Conv2d produces NaN/Inf on CUDA but finite values on CPU for inputs near float32 limits [2 comments, 2 participants]