pytorch - ✅(Solved) Fix GPU two-pass variance: intermediate squared values overflow float32. CPU Welford online algorithm avoids overflow. [1 pull requests, 1 comments, 2 participants]

pytorch2026-04-12 00:23:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180156•Fetched 2026-04-12 13:23:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

beanduan22

Participants

beanduan22

sakshar2303

Timeline (top)

mentioned ×6subscribed ×6labeled ×5commented ×1

PR fix notes

PR #3337: Prevent float32 `torch.std`/`torch.var` overflow on XPU for large-magnitude inputs

Repository: intel/torch-xpu-ops
Author: Copilot
State: open | merged: False
Link: https://github.com/intel/torch-xpu-ops/pull/3337

Description (problem / solution / changelog)

torch.std on XPU could return inf for large float32 values while CPU remained finite, due to overflow in reduction intermediates. This change updates the XPU std/var reduction path to preserve numerical stability for large-magnitude float32 inputs.

Kernel accumulation precision update (XPU std/var path)
- In src/ATen/native/xpu/sycl/ReduceMomentKernels.cpp, the Welford reduction template was generalized to allow an explicit accumulator type.
- Added a dedicated float32 dispatch branch that uses double-precision accumulation (std_var_template<float, double>(...)) while keeping tensor/result dtype behavior unchanged.
Regression coverage for overflow scenario
- In test/regressions/test_rand.py, added a targeted regression test for large-magnitude float32 input on XPU.
- The test verifies both torch.std and torch.var stay finite and numerically match CPU for the reproducer pattern.

x = torch.randn(1000, dtype=torch.float32) * 1e19 + 1e20
cpu_std = torch.std(x)
xpu_std = torch.std(x.to("xpu")).cpu()
cpu_var = torch.var(x)
xpu_var = torch.var(x.to("xpu")).cpu()

Changed files

src/ATen/native/xpu/sycl/ReduceMomentKernels.cpp (modified, +6/-2)
test/regressions/test_rand.py (modified, +16/-0)
test/xpu/skip_list_common.py (modified, +10/-2)

Code Example

import torch

torch.manual_seed(0)
x = torch.randn(1000, dtype=torch.float32) * 1e19 + 1e20

ref = torch.std(x.double()).item()
cpu = torch.std(x).item()
gpu = torch.std(x.cuda()).cpu().item()

print(f"Reference (float64): {ref:.4e}")
print(f"CPU (float32):       {cpu:.4e}")
print(f"GPU (float32):       {gpu}   <-- {'BUG' if not __import__('math').isinf(cpu) and __import__('math').isinf(gpu) else 'ok'}")

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

import torch

torch.manual_seed(0)
x = torch.randn(1000, dtype=torch.float32) * 1e19 + 1e20

ref = torch.std(x.double()).item()
cpu = torch.std(x).item()
gpu = torch.std(x.cuda()).cpu().item()

print(f"Reference (float64): {ref:.4e}")
print(f"CPU (float32):       {cpu:.4e}")
print(f"GPU (float32):       {gpu}   <-- {'BUG' if not __import__('math').isinf(cpu) and __import__('math').isinf(gpu) else 'ok'}")

Versions

Reference (float64): 1.0287e+19 CPU (float32): 1.0287e+19 GPU (float32): inf <-- BUG

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

TL;DR

The issue can be fixed by using a data type with higher precision, such as torch.float64, when calculating the standard deviation on the GPU.

Guidance

The likely cause is a numerical instability issue due to the large values in the tensor, which is exacerbated by the limited precision of torch.float32.
To verify, compare the results of torch.std(x) and torch.std(x.double()) to see if the issue is resolved when using torch.float64.
To mitigate, consider using torch.float64 for calculations that involve large values, or apply a scaling factor to reduce the magnitude of the values.
Check if the GPU architecture or CUDA version has any known issues with floating-point precision.

Example

x = torch.randn(1000, dtype=torch.float64) * 1e19 + 1e20
gpu = torch.std(x.cuda()).cpu().item()
print(f"GPU (float64): {gpu:.4e}")

Notes

This issue may not apply to all GPU architectures or CUDA versions, and the fix may depend on the specific hardware and software configuration.

Recommendation

Apply workaround: use torch.float64 for calculations involving large values, as it provides higher precision and can help avoid numerical instability issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix GPU two-pass variance: intermediate squared values overflow float32. CPU Welford online algorithm avoids overflow. [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #3337: Prevent float32 `torch.std`/`torch.var` overflow on XPU for large-magnitude inputs

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix GPU two-pass variance: intermediate squared values overflow float32. CPU Welford online algorithm avoids overflow. [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #3337: Prevent float32 torch.std/torch.var overflow on XPU for large-magnitude inputs

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #3337: Prevent float32 `torch.std`/`torch.var` overflow on XPU for large-magnitude inputs