Error Message

The core issue is the Inconsistency of the Error State across different execution modes. As noted in PyTorch's internal philosophy regarding extremal values (similar to discussions on intermediate overflows in norm()), while precision may vary across backends, the existence of the error state itself should be consistent. Switching the execution device (CPU vs. CUDA) changes whether an error occurs at all, which violates the principle of Backend Parity. This discrepancy goes beyond "expected numerical variance" and represents a functional inconsistency where one backend fails (reaches an error state) while the other succeeds on identical input data. This warrants a re-evaluation of how intermediate accumulations are handled in the CUDA/cuDNN implementation for convolutional layers.

Code Example

import torch
import torch.nn as nn
import os

def repro():
    inp = torch.load("input.pt", map_location="cpu")
    weights = torch.load("weight_bias.pt", map_location="cpu")

    conv = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1, bias=True)
    conv.weight.data.copy_(weights["weight"])
    conv.bias.data.copy_(weights["bias"])
    conv.eval()

    input_cuda = inp.cuda()
    input_cpu = inp.cpu()
    input_diff = (input_cuda.cpu() - input_cpu).abs().max().item()
    print(f"Input Consistency: {input_diff:.6e}")
    print(f"Input Range: min={inp.min().item():.2e}, max={inp.max().item():.2e}, nan={torch.isnan(inp).any().item()}")

    conv.to("cuda")
    with torch.no_grad():
        out_cuda = conv(input_cuda).cpu()
    print(f"[CUDA] Out: min={out_cuda.min().item():.2e}, max={out_cuda.max().item():.2e}, nan={torch.isnan(out_cuda).any().item()}, inf={torch.isinf(out_cuda).any().item()}")

    conv.to("cpu")
    with torch.no_grad():
        out_cpu = conv(input_cpu)
    print(f"[CPU] Out: min={out_cpu.min().item():.2e}, max={out_cpu.max().item():.2e}, nan={torch.isnan(out_cpu).any().item()}, inf={torch.isinf(out_cpu).any().item()}")

if __name__ == "__main__":
    repro()

---

Input Consistency: 0.000000e+00
Input Range: min=0.00e+00, max=3.17e+38, nan=False
[CUDA] Out: min=nan, max=nan, nan=True, inf=True
[CPU] Out: min=-2.89e+38, max=3.15e+38, nan=False, inf=False

🐛 Describe the bug

The core issue is the Inconsistency of the Error State across different execution modes. For the same FP32 input with values near the representable limit (~3.17e+38), we observed a functional divergence: the CPU backend produces a finite result ($3.15 \times 10^{38}$), while the CUDA backend produces NaN and Inf. As noted in PyTorch's internal philosophy regarding extremal values (similar to discussions on intermediate overflows in norm()), while precision may vary across backends, the existence of the error state itself should be consistent. Switching the execution device (CPU vs. CUDA) changes whether an error occurs at all, which violates the principle of Backend Parity. This discrepancy goes beyond "expected numerical variance" and represents a functional inconsistency where one backend fails (reaches an error state) while the other succeeds on identical input data. This warrants a re-evaluation of how intermediate accumulations are handled in the CUDA/cuDNN implementation for convolutional layers.

import torch
import torch.nn as nn
import os

def repro():
    inp = torch.load("input.pt", map_location="cpu")
    weights = torch.load("weight_bias.pt", map_location="cpu")

    conv = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1, bias=True)
    conv.weight.data.copy_(weights["weight"])
    conv.bias.data.copy_(weights["bias"])
    conv.eval()

    input_cuda = inp.cuda()
    input_cpu = inp.cpu()
    input_diff = (input_cuda.cpu() - input_cpu).abs().max().item()
    print(f"Input Consistency: {input_diff:.6e}")
    print(f"Input Range: min={inp.min().item():.2e}, max={inp.max().item():.2e}, nan={torch.isnan(inp).any().item()}")

    conv.to("cuda")
    with torch.no_grad():
        out_cuda = conv(input_cuda).cpu()
    print(f"[CUDA] Out: min={out_cuda.min().item():.2e}, max={out_cuda.max().item():.2e}, nan={torch.isnan(out_cuda).any().item()}, inf={torch.isinf(out_cuda).any().item()}")

    conv.to("cpu")
    with torch.no_grad():
        out_cpu = conv(input_cpu)
    print(f"[CPU] Out: min={out_cpu.min().item():.2e}, max={out_cpu.max().item():.2e}, nan={torch.isnan(out_cpu).any().item()}, inf={torch.isinf(out_cpu).any().item()}")

if __name__ == "__main__":
    repro()

Input Consistency: 0.000000e+00
Input Range: min=0.00e+00, max=3.17e+38, nan=False
[CUDA] Out: min=nan, max=nan, nan=True, inf=True
[CPU] Out: min=-2.89e+38, max=3.15e+38, nan=False, inf=False

repro.zip

Versions

PyTorch version: 2.6.0+cu126 Is debug build: False CUDA used to build PyTorch: 12.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.6.20 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.35.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: QEMU Virtual CPU version 2.5+ CPU family: 15 Model: 107 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 1 Stepping: 1 BogoMIPS: 4190.15 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c hypervisor lahf_lm abm cpuid_fault pti bmi1 avx2 bmi2 avx512f avx512dq avx512cd avx512bw avx512vl Hypervisor vendor: KVM Virtualization type: full

Versions of relevant libraries: [pip3] numpy==2.0.2 [pip3] nvidia-cublas-cu12==12.6.4.1 [pip3] nvidia-cuda-cupti-cu12==12.6.80 [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 [pip3] nvidia-cuda-runtime-cu12==12.6.77 [pip3] nvidia-cudnn-cu12==9.5.1.17 [pip3] nvidia-cufft-cu12==11.3.0.4 [pip3] nvidia-curand-cu12==10.3.7.77 [pip3] nvidia-cusolver-cu12==11.7.1.2 [pip3] nvidia-cusparse-cu12==12.5.4.2 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.6.85 [pip3] nvidia-nvtx-cu12==12.6.77 [pip3] torch==2.6.0+cu126 [pip3] torchaudio==2.6.0+cu126 [pip3] torchvision==0.21.0+cu126 [pip3] triton==3.2.0

extent analysis

Fix Plan

Intermediate Accumulation Fix

Step 1: Update cuDNN Implementation

Update the cuDNN implementation for convolutional layers to handle intermediate accumulations correctly.

Step 2: Modify CUDA Backend

Modify the CUDA backend to match the CPU backend's behavior for handling intermediate accumulations.

Step 3: Test and Verify

Test the updated implementation with the provided repro script and verify that the issue is resolved.

Code Snippet

import torch
import torch.nn as nn

class Conv2d(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, bias):
        super(Conv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, bias=bias)

    def forward(self, x):
        # Update intermediate accumulation handling
        x = torch.clamp(x, -1e38, 1e38)
        return self.conv(x)

# Example usage
conv = Conv2d(64, 64, 3, 1, 1, True)
input_cuda = torch.randn(1, 64, 224, 224).cuda()
output_cuda = conv(input_cuda)
print(output_cuda)

Verification

To verify that the fix worked, run the repro script with the updated implementation and check that the output is consistent across CPU and CUDA backends.

Extra Tips

Regularly update cuDNN and CUDA versions to ensure you have the latest bug fixes and performance improvements.
Use torch.clamp to prevent intermediate accumulations from exceeding the representable limit.
Test your implementation thoroughly to ensure it handles edge cases correctly.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Numerical inconsistency: nn.Conv2d produces NaN/Inf on CUDA but finite values on CPU for inputs near float32 limits [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Intermediate Accumulation Fix

Step 1: Update cuDNN Implementation

Step 2: Modify CUDA Backend

Step 3: Test and Verify

Code Snippet

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Numerical inconsistency: nn.Conv2d produces NaN/Inf on CUDA but finite values on CPU for inputs near float32 limits [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Intermediate Accumulation Fix

Step 1: Update cuDNN Implementation

Step 2: Modify CUDA Backend

Step 3: Test and Verify

Code Snippet

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING