pytorch - 💡(How to fix) Fix torch.fmax with uint8 out produces inconsistent overflow casting on CPU and CUDA [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181805Fetched 2026-04-29 06:10:55
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

torch.fmax produces inconsistent results between CPU and CUDA when the inputs are int64 tensors and the out tensor has dtype uint8.

For the same operation, CPU writes 160 to the uint8 output, while CUDA writes 255.

This suggests that CPU and CUDA use different overflow conversion behavior when writing the int64 result into a narrower unsigned integer out tensor.

Root Cause

torch.fmax produces inconsistent results between CPU and CUDA when the inputs are int64 tensors and the out tensor has dtype uint8.

For the same operation, CPU writes 160 to the uint8 output, while CUDA writes 255.

This suggests that CPU and CUDA use different overflow conversion behavior when writing the int64 result into a narrower unsigned integer out tensor.

Code Example

import torch

lhs = torch.tensor([100000], dtype=torch.int64)
rhs = torch.tensor([-1], dtype=torch.int64)

out_cpu = torch.empty(1, dtype=torch.uint8)
out_cuda = torch.empty(1, dtype=torch.uint8, device="cuda")

torch.fmax(lhs, rhs, out=out_cpu)
torch.fmax(lhs.cuda(), rhs.cuda(), out=out_cuda)

print("cpu:", out_cpu)
print("cuda:", out_cuda.cpu())

---

cpu: tensor([160], dtype=torch.uint8)
cuda: tensor([255], dtype=torch.uint8)

---

CPU result : 160  # consistent with wraparound / modulo 256
CUDA result: 255  # consistent with saturation to uint8 max
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

torch.fmax produces inconsistent results between CPU and CUDA when the inputs are int64 tensors and the out tensor has dtype uint8.

For the same operation, CPU writes 160 to the uint8 output, while CUDA writes 255.

This suggests that CPU and CUDA use different overflow conversion behavior when writing the int64 result into a narrower unsigned integer out tensor.

Reproduction

import torch

lhs = torch.tensor([100000], dtype=torch.int64)
rhs = torch.tensor([-1], dtype=torch.int64)

out_cpu = torch.empty(1, dtype=torch.uint8)
out_cuda = torch.empty(1, dtype=torch.uint8, device="cuda")

torch.fmax(lhs, rhs, out=out_cpu)
torch.fmax(lhs.cuda(), rhs.cuda(), out=out_cuda)

print("cpu:", out_cpu)
print("cuda:", out_cuda.cpu())

Actual output

cpu: tensor([160], dtype=torch.uint8)
cuda: tensor([255], dtype=torch.uint8)

Expected behavior

CPU and CUDA should use consistent casting behavior when writing the result of torch.fmax into an out tensor with dtype uint8.

The mathematically selected value is 100000, but this value cannot be represented by uint8. The operation should either:

  1. produce the same converted value on both backends, or
  2. document that overflow behavior for integer out tensors is backend-dependent.

Why this seems like a bug

This is not a floating-point tolerance issue. The inputs are integer tensors, and the selected maximum value is exactly 100000.

The discrepancy appears only when the result is written into a narrower uint8 output tensor:

CPU result : 160  # consistent with wraparound / modulo 256
CUDA result: 255  # consistent with saturation to uint8 max

This indicates inconsistent integer overflow casting semantics between CPU and CUDA for torch.fmax(..., out=...).



### Versions

PyTorch 2.11.0+cu128, CUDA 12.8

extent analysis

TL;DR

To fix the inconsistent results between CPU and CUDA for torch.fmax with int64 inputs and uint8 output, ensure consistent overflow handling by either using a larger output dtype or documenting backend-dependent behavior.

Guidance

  • Verify the issue by running the provided reproduction code to confirm the discrepancy between CPU and CUDA results.
  • Consider using a larger dtype for the output tensor, such as torch.uint16 or torch.uint32, to avoid overflow issues.
  • If using a narrower output dtype is necessary, document the backend-dependent behavior for integer overflow casting semantics.
  • Investigate PyTorch documentation and version notes for any updates or fixes related to integer overflow handling in torch.fmax.

Example

# Using a larger output dtype to avoid overflow
out_cpu = torch.empty(1, dtype=torch.uint16)
out_cuda = torch.empty(1, dtype=torch.uint16, device="cuda")
torch.fmax(lhs, rhs, out=out_cpu)
torch.fmax(lhs.cuda(), rhs.cuda(), out=out_cuda)

Notes

The provided code snippet and issue description suggest a specific version of PyTorch (2.11.0+cu128) and CUDA (12.8), but it is unclear if this issue persists in other versions.

Recommendation

Apply workaround by using a larger output dtype or documenting backend-dependent behavior, as upgrading to a fixed version is not clearly implied in the issue description.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

CPU and CUDA should use consistent casting behavior when writing the result of torch.fmax into an out tensor with dtype uint8.

The mathematically selected value is 100000, but this value cannot be represented by uint8. The operation should either:

  1. produce the same converted value on both backends, or
  2. document that overflow behavior for integer out tensors is backend-dependent.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix torch.fmax with uint8 out produces inconsistent overflow casting on CPU and CUDA [1 participants]