pytorch - ✅(Solved) Fix GPU two-pass variance: intermediate squared values overflow float32. CPU Welford online algorithm avoids overflow. [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180156Fetched 2026-04-12 13:23:35
View on GitHub
Comments
1
Participants
2
Timeline
18
Reactions
0
Timeline (top)
mentioned ×6subscribed ×6labeled ×5commented ×1

PR fix notes

PR #3337: Prevent float32 torch.std/torch.var overflow on XPU for large-magnitude inputs

Description (problem / solution / changelog)

torch.std on XPU could return inf for large float32 values while CPU remained finite, due to overflow in reduction intermediates. This change updates the XPU std/var reduction path to preserve numerical stability for large-magnitude float32 inputs.

  • Kernel accumulation precision update (XPU std/var path)

    • In src/ATen/native/xpu/sycl/ReduceMomentKernels.cpp, the Welford reduction template was generalized to allow an explicit accumulator type.
    • Added a dedicated float32 dispatch branch that uses double-precision accumulation (std_var_template<float, double>(...)) while keeping tensor/result dtype behavior unchanged.
  • Regression coverage for overflow scenario

    • In test/regressions/test_rand.py, added a targeted regression test for large-magnitude float32 input on XPU.
    • The test verifies both torch.std and torch.var stay finite and numerically match CPU for the reproducer pattern.
x = torch.randn(1000, dtype=torch.float32) * 1e19 + 1e20
cpu_std = torch.std(x)
xpu_std = torch.std(x.to("xpu")).cpu()
cpu_var = torch.var(x)
xpu_var = torch.var(x.to("xpu")).cpu()

Changed files

  • src/ATen/native/xpu/sycl/ReduceMomentKernels.cpp (modified, +6/-2)
  • test/regressions/test_rand.py (modified, +16/-0)
  • test/xpu/skip_list_common.py (modified, +10/-2)

Code Example

import torch

torch.manual_seed(0)
x = torch.randn(1000, dtype=torch.float32) * 1e19 + 1e20

ref = torch.std(x.double()).item()
cpu = torch.std(x).item()
gpu = torch.std(x.cuda()).cpu().item()

print(f"Reference (float64): {ref:.4e}")
print(f"CPU (float32):       {cpu:.4e}")
print(f"GPU (float32):       {gpu}   <-- {'BUG' if not __import__('math').isinf(cpu) and __import__('math').isinf(gpu) else 'ok'}")
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

import torch

torch.manual_seed(0)
x = torch.randn(1000, dtype=torch.float32) * 1e19 + 1e20

ref = torch.std(x.double()).item()
cpu = torch.std(x).item()
gpu = torch.std(x.cuda()).cpu().item()

print(f"Reference (float64): {ref:.4e}")
print(f"CPU (float32):       {cpu:.4e}")
print(f"GPU (float32):       {gpu}   <-- {'BUG' if not __import__('math').isinf(cpu) and __import__('math').isinf(gpu) else 'ok'}")

Versions

Reference (float64): 1.0287e+19 CPU (float32): 1.0287e+19 GPU (float32): inf <-- BUG

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

TL;DR

The issue can be fixed by using a data type with higher precision, such as torch.float64, when calculating the standard deviation on the GPU.

Guidance

  • The likely cause is a numerical instability issue due to the large values in the tensor, which is exacerbated by the limited precision of torch.float32.
  • To verify, compare the results of torch.std(x) and torch.std(x.double()) to see if the issue is resolved when using torch.float64.
  • To mitigate, consider using torch.float64 for calculations that involve large values, or apply a scaling factor to reduce the magnitude of the values.
  • Check if the GPU architecture or CUDA version has any known issues with floating-point precision.

Example

x = torch.randn(1000, dtype=torch.float64) * 1e19 + 1e20
gpu = torch.std(x.cuda()).cpu().item()
print(f"GPU (float64): {gpu:.4e}")

Notes

This issue may not apply to all GPU architectures or CUDA versions, and the fix may depend on the specific hardware and software configuration.

Recommendation

Apply workaround: use torch.float64 for calculations involving large values, as it provides higher precision and can help avoid numerical instability issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING