pytorch - 💡(How to fix) Fix torch.compile produces Inf/NaN on nn.Conv2d with valid large inputs (near FLT_MAX) while Eager mode works correctly [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177657Fetched 2026-04-08 00:52:40
View on GitHub
Comments
3
Participants
3
Timeline
6
Reactions
0
Author
Timeline (top)
commented ×3labeled ×2closed ×1

Error Message

Error Accumulation: In deep networks (like DenseNet), these anomalies accumulate rapidly across layers, eventually crashing the entire model's logic.

Root Cause

Severity This issue is severe because:

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: QEMU Virtual CPU version 2.5+ CPU family: 15 Model: 107 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 1 Stepping: 1 BogoMIPS: 4190.15 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c hypervisor lahf_lm abm cpuid_fault pti bmi1 avx2 bmi2 avx512f avx512dq avx512cd avx512bw avx512vl Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 192 MiB (48 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Code Example

import torch
import torch.nn as nn
import contextlib
import torch._inductor.config as inductor_config

DEVICE = 'cuda'
DTYPE = torch.float32

@contextlib.contextmanager
def set_config(mode):
    orig = inductor_config.fallback_random
    try:
        inductor_config.fallback_random = (mode == "1")
        yield
    finally:
        inductor_config.fallback_random = orig

def check_nan_inf(x):
    has_nan = torch.isnan(x).any().item()
    has_inf = torch.isinf(x).any().item()
    return has_nan, has_inf

input_data = torch.load("input.pt")
print(f"Input stats: min={input_data.min():.6e}, max={input_data.max():.6e}\n")

input_cuda = input_data.to(DEVICE).to(DTYPE)

weight_dict = torch.load("sd.pt", map_location='cpu')

has_bias = weight_dict['bias'] is not None
layer = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=has_bias)
layer.weight.data = weight_dict['weight']
if has_bias:
    layer.bias.data = weight_dict['bias']
layer = layer.to(DEVICE).to(DTYPE).eval()

print("Config | NaN | Inf | Execution")
print("-" * 50)

with torch.no_grad():
    x0 = layer(input_cuda)
    nan0, inf0 = check_nan_inf(x0)
    print(f"0      | {'YES' if nan0 else 'NO':3} | {'YES' if inf0 else 'NO':3} | cuda fp32 eager")

with set_config("1"):
    with torch.no_grad():
        compiled = torch.compile(layer, dynamic=False, backend='inductor')
        x1 = compiled(input_cuda)
        nan1, inf1 = check_nan_inf(x1)
        print(f"1      | {'YES' if nan1 else 'NO':3} | {'YES' if inf1 else 'NO':3} | cuda fp32 compile")

print("\n" + "="*50)

print("="*50)

print()
torch.cuda.empty_cache()

---

Input stats: min=1.334895e+29, max=3.402823e+38

Config | NaN | Inf | Execution
--------------------------------------------------
0      | NO  | NO  | cuda fp32 eager
1      | YES | YES | cuda fp32 compile

==================================================
==================================================
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

I found a critical numerical stability issue where torch.compile (with inductor backend) produces Inf and NaN outputs for a standard nn.Conv2d layer, while the standard Eager mode produces valid finite results using the exact same input and weights.

Although the input data contains large values (around 3.402823e+38, which is near the float32 limit), the data is mathematically valid. The fact that different execution modes yield fundamentally different results (Finite vs. Infinite/NaN) indicates a serious regression in the compiler's numerical handling.

Severity This issue is severe because:

Silent Corruption: It introduces Inf/NaN into the network which leads to incorrect label predictions. Error Accumulation: In deep networks (like DenseNet), these anomalies accumulate rapidly across layers, eventually crashing the entire model's logic. Inconsistency: The lack of parity between Eager and Compile modes makes debugging and deployment extremely difficult.

import torch
import torch.nn as nn
import contextlib
import torch._inductor.config as inductor_config

DEVICE = 'cuda'
DTYPE = torch.float32

@contextlib.contextmanager
def set_config(mode):
    orig = inductor_config.fallback_random
    try:
        inductor_config.fallback_random = (mode == "1")
        yield
    finally:
        inductor_config.fallback_random = orig

def check_nan_inf(x):
    has_nan = torch.isnan(x).any().item()
    has_inf = torch.isinf(x).any().item()
    return has_nan, has_inf

input_data = torch.load("input.pt")
print(f"Input stats: min={input_data.min():.6e}, max={input_data.max():.6e}\n")

input_cuda = input_data.to(DEVICE).to(DTYPE)

weight_dict = torch.load("sd.pt", map_location='cpu')

has_bias = weight_dict['bias'] is not None
layer = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=has_bias)
layer.weight.data = weight_dict['weight']
if has_bias:
    layer.bias.data = weight_dict['bias']
layer = layer.to(DEVICE).to(DTYPE).eval()

print("Config | NaN | Inf | Execution")
print("-" * 50)

with torch.no_grad():
    x0 = layer(input_cuda)
    nan0, inf0 = check_nan_inf(x0)
    print(f"0      | {'YES' if nan0 else 'NO':3} | {'YES' if inf0 else 'NO':3} | cuda fp32 eager")

with set_config("1"):
    with torch.no_grad():
        compiled = torch.compile(layer, dynamic=False, backend='inductor')
        x1 = compiled(input_cuda)
        nan1, inf1 = check_nan_inf(x1)
        print(f"1      | {'YES' if nan1 else 'NO':3} | {'YES' if inf1 else 'NO':3} | cuda fp32 compile")

print("\n" + "="*50)

print("="*50)

print()
torch.cuda.empty_cache()
Input stats: min=1.334895e+29, max=3.402823e+38

Config | NaN | Inf | Execution
--------------------------------------------------
0      | NO  | NO  | cuda fp32 eager
1      | YES | YES | cuda fp32 compile

==================================================
==================================================

demo.zip REDACTED

Versions

PyTorch version: 2.10.0+cu126 Is debug build: False CUDA used to build PyTorch: 12.6 ROCm used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.10.19 | packaged by conda-forge | (main, Jan 26 2026, 23:45:08) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.6.20 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.35.03 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: QEMU Virtual CPU version 2.5+ CPU family: 15 Model: 107 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 1 Stepping: 1 BogoMIPS: 4190.15 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c hypervisor lahf_lm abm cpuid_fault pti bmi1 avx2 bmi2 avx512f avx512dq avx512cd avx512bw avx512vl Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 192 MiB (48 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.6.4.1 [pip3] nvidia-cuda-cupti-cu12==12.6.80 [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 [pip3] nvidia-cuda-runtime-cu12==12.6.77 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.3.0.4 [pip3] nvidia-curand-cu12==10.3.7.77 [pip3] nvidia-cusolver-cu12==11.7.1.2 [pip3] nvidia-cusparse-cu12==12.5.4.2 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.6.85 [pip3] nvidia-nvtx-cu12==12.6.77 [pip3] onnxruntime-gpu==1.23.2 [pip3] optree==0.18.0 [pip3] pytorch-triton==3.2.0+git4b3bb1f8 [pip3] torch==2.10.0+cu126 [pip3] torchaudio==2.11.0.dev20260127+cu126 [pip3] torchvision==0.25.0+cu126 [pip3] triton==3.6.0+git9844da95 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.10.2.21 pypi_0 pypi [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi [conda] nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi [conda] nvidia-nccl-cu12 2.27.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi [conda] optree 0.18.0 pypi_0 pypi [conda] pytorch-triton 3.2.0+git4b3bb1f8 pypi_0 pypi [conda] torch 2.10.0+cu126 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260127+cu126 pypi_0 pypi [conda] torchvision 0.25.0+cu126 pypi_0 pypi [conda] triton 3.6.0+git9844da95 pypi_0 pypi

extent analysis

Fix Plan

To address the numerical stability issue with torch.compile and inductor backend, we can try the following steps:

  • Input normalization: Scale the input data to a smaller range to reduce the likelihood of overflow.
  • Weight normalization: Scale the model weights to a smaller range to reduce the likelihood of overflow.
  • Use a different data type: Try using torch.float64 instead of torch.float32 to increase the precision of the calculations.
  • Disable inductor: Temporarily disable the inductor backend to see if the issue persists with the default backend.

Here's an example of how you can modify your code to implement these steps:

import torch
import torch.nn as nn

# ... (rest of the code remains the same)

# Input normalization
input_data = input_data / 1e30  # scale input data to a smaller range

# Weight normalization
weight_dict['weight'] = weight_dict['weight'] / 1e30  # scale model weights to a smaller range
if has_bias:
    weight_dict['bias'] = weight_dict['bias'] / 1e30  # scale model bias to a smaller range

# Use a different data type
DTYPE = torch.float64  # use torch.float64 instead of torch.float32

# ... (rest of the code remains the same)

# Disable inductor
with set_config("0"):  # disable inductor backend
    with torch.no_grad():
        compiled = torch.compile(layer, dynamic=False, backend='inductor')
        x1 = compiled(input_cuda)
        nan1, inf1 = check_nan_inf(x1)
        print(f"1      | {'YES' if nan1 else 'NO':3} | {'YES' if inf1 else 'NO':3} | cuda fp64 compile")

Verification

To verify that the fix worked, you can check the output of the check_nan_inf function for the compiled model. If the issue is resolved, the output should indicate that there are no NaN or Inf values in the output.

Extra Tips

  • Make sure to test the model with different input values to ensure that the issue is fully resolved.
  • Consider using a more robust normalization technique, such as batch normalization or layer normalization, to improve the stability of the model.
  • If the issue persists, you may want to try using a different backend or consulting the PyTorch documentation for further guidance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING