pytorch - 💡(How to fix) Fix Large numerical discrepancy in torch.renorm between CPU and CUDA

pytorch2026-04-09 03:21:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

import torch
import torch.nn as nn

torch.manual_seed(0)

fc1 = nn.Linear(8, 8)
fc2 = nn.Linear(8, 8)

def forward(model_fc1, model_fc2, device):
    x = torch.randn(4, 8, device=device)
    y = model_fc1(x)
    z = torch.sin(y) * torch.cos(y)
    w = torch.log1p(z.abs())
    t = torch.renorm(w, p=2, dim=0, maxnorm=10.0)
    s = torch.sin(t)
    return model_fc2(s.detach())

# CPU
cpu_out = forward(fc1, fc2, 'cpu')

# GPU
fc1_g = nn.Linear(8, 8).cuda(); fc1_g.load_state_dict(fc1.state_dict())
fc2_g = nn.Linear(8, 8).cuda(); fc2_g.load_state_dict(fc2.state_dict())

gpu_out = forward(fc1_g, fc2_g, 'cuda')

diff = (cpu_out - gpu_out.cpu()).abs()
print("max diff:", diff.max())

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

import torch
import torch.nn as nn

torch.manual_seed(0)

fc1 = nn.Linear(8, 8)
fc2 = nn.Linear(8, 8)

def forward(model_fc1, model_fc2, device):
    x = torch.randn(4, 8, device=device)
    y = model_fc1(x)
    z = torch.sin(y) * torch.cos(y)
    w = torch.log1p(z.abs())
    t = torch.renorm(w, p=2, dim=0, maxnorm=10.0)
    s = torch.sin(t)
    return model_fc2(s.detach())

# CPU
cpu_out = forward(fc1, fc2, 'cpu')

# GPU
fc1_g = nn.Linear(8, 8).cuda(); fc1_g.load_state_dict(fc1.state_dict())
fc2_g = nn.Linear(8, 8).cuda(); fc2_g.load_state_dict(fc2.state_dict())

gpu_out = forward(fc1_g, fc2_g, 'cuda')

diff = (cpu_out - gpu_out.cpu()).abs()
print("max diff:", diff.max())

Versions

2.9.1+cu128 (PyTorch 2.9.1, CUDA 12.8)

While small numerical differences between CPU and CUDA are expected, the magnitude of discrepancy here (~1e-1 after a simple pipeline) appears unusually large.

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

TL;DR

The large numerical discrepancy between CPU and CUDA outputs may be due to differences in the implementation of certain mathematical operations, and verifying the issue with a simpler pipeline or checking for CUDA version compatibility could help identify the root cause.

Guidance

Verify if the issue persists with a simpler pipeline, such as removing the torch.sin, torch.cos, and torch.log1p operations, to isolate the source of the discrepancy.
Check the CUDA version compatibility with PyTorch 2.9.1, as the issue might be related to the specific CUDA version used (12.8).
Compare the results with other CUDA versions or PyTorch versions to see if the issue is specific to this combination.
Consider using torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False to ensure deterministic behavior on the GPU.

Example

No specific code snippet is provided as the issue is more related to the numerical discrepancy between CPU and CUDA outputs rather than a specific code error.

Notes

The issue might be related to the specific mathematical operations used in the pipeline, and further investigation is needed to determine the root cause. Additionally, the use of detach() in the forward function might not be necessary and could potentially affect the results.

Recommendation

Apply workaround: Use a simpler pipeline to verify the issue and check CUDA version compatibility, as the root cause of the discrepancy is not immediately clear.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#LLM response #prompt template #agent execution #callback error #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Large numerical discrepancy in torch.renorm between CPU and CUDA

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Large numerical discrepancy in torch.renorm between CPU and CUDA

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING