pytorch - 💡(How to fix) Fix torch.compile(dynamic=True) on CUDA gives large output mismatch vs eager for BatchNorm2d + Conv2d [1 comments, 2 participants]

pytorch2026-03-22 16:11:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178094•Fetched 2026-04-08 01:16:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

hiahu329

Participants

github-actions[bot]

hiahu329

Timeline (top)

labeled ×2closed ×1commented ×1

Error Message

import os import torch import torch.nn as nn _DIR = os.path.dirname(os.path.abspath(file)) INPUT_PT = os.path.join(_DIR, "input.pt") SD_PT = os.path.join(_DIR, "sd.pt") THRESHOLD = 1.19e-7 DEVICE = torch.device("cuda")

class Bn1Conv2(nn.Module): def init(self, sd: dict): super().init() self.bn1 = nn.BatchNorm2d(96) self.conv2 = nn.Conv2d(96, 256, kernel_size=5, padding=2, bias=True) self.load_state_dict(sd, strict=True)

def forward(self, x: torch.Tensor) -> torch.Tensor:
    x = self.bn1(x)
    return self.conv2(x)

def main(): x = torch.load(INPUT_PT, map_location="cpu").float() print(f"input min={x.min().item():.6g} max={x.max().item():.6g}") x = x.to(DEVICE) sd = torch.load(SD_PT, map_location="cpu") wmin = float("inf") wmax = float("-inf") for v in sd.values(): if torch.is_tensor(v): t = v.float().cpu() wmin = min(wmin, float(t.min().item())) wmax = max(wmax, float(t.max().item())) print(f"weights min={wmin:.6g} max={wmax:.6g}")

m0 = Bn1Conv2(sd).to(DEVICE).eval()
with torch.no_grad():
    y0 = m0(x)

m1 = torch.compile(Bn1Conv2(sd).to(DEVICE).eval(), dynamic=True)
with torch.no_grad():
    y1 = m1(x)

o = y0.detach().float().cpu().reshape(y0.shape[0], -1)
s = y1.detach().float().cpu().reshape(y1.shape[0], -1)
eps = 1e-12
valid = torch.isfinite(o) & torch.isfinite(s)
diff = torch.where(valid, (s - o).abs() / (o.abs() + eps), torch.zeros_like(o))
per = diff.max(dim=1).values
gmax = float(per.max().item())

print("modes: eager  vs  torch.compile(dynamic=True)")
print(
    f"relative error: max_rel={gmax:.6e}  "
    f"threshold={THRESHOLD:.2e}"
)

if name == "main": main()

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: QEMU Virtual CPU version 2.5+ CPU family: 15 Model: 107 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 1 Stepping: 1 BogoMIPS: 4190.15 Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c hypervisor lahf_lm abm cpuid_fault pti bmi1 avx2 bmi2 avx512f avx512dq avx512cd avx512bw avx512vl Hypervisor vendor: KVM Virtualization type: full L1d cache: 1.5 MiB (48 instances) L1i cache: 1.5 MiB (48 instances) L2 cache: 192 MiB (48 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Code Example

import os
import torch
import torch.nn as nn
_DIR = os.path.dirname(os.path.abspath(__file__))
INPUT_PT = os.path.join(_DIR, "input.pt")
SD_PT = os.path.join(_DIR, "sd.pt")
THRESHOLD = 1.19e-7
DEVICE = torch.device("cuda")

class Bn1Conv2(nn.Module):
    def __init__(self, sd: dict):
        super().__init__()
        self.bn1 = nn.BatchNorm2d(96)
        self.conv2 = nn.Conv2d(96, 256, kernel_size=5, padding=2, bias=True)
        self.load_state_dict(sd, strict=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.bn1(x)
        return self.conv2(x)

def main():
    x = torch.load(INPUT_PT, map_location="cpu").float()
    print(f"input min={x.min().item():.6g}  max={x.max().item():.6g}")
    x = x.to(DEVICE)
    sd = torch.load(SD_PT, map_location="cpu")
    wmin = float("inf")
    wmax = float("-inf")
    for v in sd.values():
        if torch.is_tensor(v):
            t = v.float().cpu()
            wmin = min(wmin, float(t.min().item()))
            wmax = max(wmax, float(t.max().item()))
    print(f"weights min={wmin:.6g}  max={wmax:.6g}")

    m0 = Bn1Conv2(sd).to(DEVICE).eval()
    with torch.no_grad():
        y0 = m0(x)

    m1 = torch.compile(Bn1Conv2(sd).to(DEVICE).eval(), dynamic=True)
    with torch.no_grad():
        y1 = m1(x)

    o = y0.detach().float().cpu().reshape(y0.shape[0], -1)
    s = y1.detach().float().cpu().reshape(y1.shape[0], -1)
    eps = 1e-12
    valid = torch.isfinite(o) & torch.isfinite(s)
    diff = torch.where(valid, (s - o).abs() / (o.abs() + eps), torch.zeros_like(o))
    per = diff.max(dim=1).values
    gmax = float(per.max().item())

    print("modes: eager  vs  torch.compile(dynamic=True)")
    print(
        f"relative error: max_rel={gmax:.6e}  "
        f"threshold={THRESHOLD:.2e}"
    )

if __name__ == "__main__":
    main()

---

input min=-6.4548  max=29.5996
weights min=-0.0204123  max=1
modes: eager  vs  torch.compile(dynamic=True)
relative error: max_rel=1.677530e+00  threshold=1.19e-07

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary: For a small module (BatchNorm2d(96) → Conv2d(96→256, k=5, p=2)), torch.compile(..., dynamic=True) on CUDA produces outputs that differ massively from eager mode when using the same weights and input. Max per-element relative error vs eager is ~1.7, while float-level agreement would be ~1e-7 or smaller.

Repro: Load fixed input.pt and state_dict subset (bn1., conv2.), run .eval(), compare eager forward to torch.compile(..., dynamic=True) on the conv2 output (same x, same sd, two separate module instances).

Expected: Compiled forward matches eager within tight float tolerance. Actual: Large relative error (~O(1)) between compiled and eager outputs.

import os
import torch
import torch.nn as nn
_DIR = os.path.dirname(os.path.abspath(__file__))
INPUT_PT = os.path.join(_DIR, "input.pt")
SD_PT = os.path.join(_DIR, "sd.pt")
THRESHOLD = 1.19e-7
DEVICE = torch.device("cuda")

class Bn1Conv2(nn.Module):
    def __init__(self, sd: dict):
        super().__init__()
        self.bn1 = nn.BatchNorm2d(96)
        self.conv2 = nn.Conv2d(96, 256, kernel_size=5, padding=2, bias=True)
        self.load_state_dict(sd, strict=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.bn1(x)
        return self.conv2(x)

def main():
    x = torch.load(INPUT_PT, map_location="cpu").float()
    print(f"input min={x.min().item():.6g}  max={x.max().item():.6g}")
    x = x.to(DEVICE)
    sd = torch.load(SD_PT, map_location="cpu")
    wmin = float("inf")
    wmax = float("-inf")
    for v in sd.values():
        if torch.is_tensor(v):
            t = v.float().cpu()
            wmin = min(wmin, float(t.min().item()))
            wmax = max(wmax, float(t.max().item()))
    print(f"weights min={wmin:.6g}  max={wmax:.6g}")

    m0 = Bn1Conv2(sd).to(DEVICE).eval()
    with torch.no_grad():
        y0 = m0(x)

    m1 = torch.compile(Bn1Conv2(sd).to(DEVICE).eval(), dynamic=True)
    with torch.no_grad():
        y1 = m1(x)

    o = y0.detach().float().cpu().reshape(y0.shape[0], -1)
    s = y1.detach().float().cpu().reshape(y1.shape[0], -1)
    eps = 1e-12
    valid = torch.isfinite(o) & torch.isfinite(s)
    diff = torch.where(valid, (s - o).abs() / (o.abs() + eps), torch.zeros_like(o))
    per = diff.max(dim=1).values
    gmax = float(per.max().item())

    print("modes: eager  vs  torch.compile(dynamic=True)")
    print(
        f"relative error: max_rel={gmax:.6e}  "
        f"threshold={THRESHOLD:.2e}"
    )

if __name__ == "__main__":
    main()

input min=-6.4548  max=29.5996
weights min=-0.0204123  max=1
modes: eager  vs  torch.compile(dynamic=True)
relative error: max_rel=1.677530e+00  threshold=1.19e-07

test4.zip

Versions

Versions PyTorch version: 2.6.0+cu126 Is debug build: False CUDA used to build PyTorch: 12.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-90-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.6.20 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.35.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==2.0.2 [pip3] nvidia-cublas-cu12==12.6.4.1 [pip3] nvidia-cuda-cupti-cu12==12.6.80 [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 [pip3] nvidia-cuda-runtime-cu12==12.6.77 [pip3] nvidia-cudnn-cu12==9.5.1.17 [pip3] nvidia-cufft-cu12==11.3.0.4 [pip3] nvidia-curand-cu12==10.3.7.77 [pip3] nvidia-cusolver-cu12==11.7.1.2 [pip3] nvidia-cusparse-cu12==12.5.4.2 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.6.85 [pip3] nvidia-nvtx-cu12==12.6.77 [pip3] onnx==1.19.1 [pip3] onnxruntime==1.19.2 [pip3] open_clip_torch==3.2.0 [pip3] pytorch-lightning==0.7.1 [pip3] torch==2.6.0+cu126 [pip3] torch-geometric==2.6.1 [pip3] torchaudio==2.6.0+cu126 [pip3] torchversion==0.21.0+cu126 [pip3] triton==3.2.0 [conda] numpy 2.0.2 pypi_0 pypi [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi [conda] open-clip-torch 3.2.0 pypi_0 pypi [conda] pytorch-lightning 0.7.1 pypi_0 pypi [conda] torch 2.6.0+cu126 pypi_0 pypi [conda] torch-geometric 2.6.1 pypi_0 pypi [conda] torchaudio 0.6.0+cu126 pypi_0 pypi [conda] torchvision 0.21.0+cu126 pypi_0 pypi [conda] triton 3.2.0 pypi_0 pypi

extent analysis

Fix Plan

To address the issue of large relative errors between compiled and eager outputs, we can try the following steps:

Update PyTorch and CUDA: Ensure that PyTorch and CUDA are updated to the latest versions.
Disable Dynamic Compilation: Try disabling dynamic compilation by setting dynamic=False in torch.compile().
Use torch.cuda.amp: Enable automatic mixed precision using torch.cuda.amp to reduce numerical errors.
Verify Numerical Stability: Verify that the numerical computations are stable by checking for NaNs and Infs.

Here's an example code snippet that demonstrates these steps:

import torch
import torch.nn as nn
import torch.cuda.amp as amp

class Bn1Conv2(nn.Module):
    def __init__(self, sd: dict):
        super().__init__()
        self.bn1 = nn.BatchNorm2d(96)
        self.conv2 = nn.Conv2d(96, 256, kernel_size=5, padding=2, bias=True)
        self.load_state_dict(sd, strict=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.bn1(x)
        return self.conv2(x)

def main():
    # ... (rest of the code remains the same)

    m1 = torch.compile(Bn1Conv2(sd).to(DEVICE).eval(), dynamic=False)
    with torch.no_grad(), amp.autocast():
        y1 = m1(x)

    # Verify numerical stability
    assert not torch.isnan(y1).any()
    assert not torch.isinf(y1).any()

if __name__ == "__main__":
    main()

Verification

To verify that the fix worked, run the modified code and check that the relative error between compiled and eager outputs is within the expected threshold.

Extra Tips

Ensure that the input data is numerically stable and does not contain NaNs or Infs.
Consider using torch.float32 instead of torch.float to reduce numerical errors.
If the issue persists, try reducing the precision of the computations by using torch.half or torch.bfloat16.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix torch.compile(dynamic=True) on CUDA gives large output mismatch vs eager for BatchNorm2d + Conv2d [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix torch.compile(dynamic=True) on CUDA gives large output mismatch vs eager for BatchNorm2d + Conv2d [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING