pytorch - 💡(How to fix) Fix torch.fmax with uint8 out produces inconsistent overflow casting on CPU and CUDA [1 participants]

pytorch2026-04-28 23:59:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181805•Fetched 2026-04-29 06:10:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

beanduan22

Participants

beanduan22

torch.fmax produces inconsistent results between CPU and CUDA when the inputs are int64 tensors and the out tensor has dtype uint8.

For the same operation, CPU writes 160 to the uint8 output, while CUDA writes 255.

This suggests that CPU and CUDA use different overflow conversion behavior when writing the int64 result into a narrower unsigned integer out tensor.

Root Cause

torch.fmax produces inconsistent results between CPU and CUDA when the inputs are int64 tensors and the out tensor has dtype uint8.

For the same operation, CPU writes 160 to the uint8 output, while CUDA writes 255.

This suggests that CPU and CUDA use different overflow conversion behavior when writing the int64 result into a narrower unsigned integer out tensor.

Code Example

import torch

lhs = torch.tensor([100000], dtype=torch.int64)
rhs = torch.tensor([-1], dtype=torch.int64)

out_cpu = torch.empty(1, dtype=torch.uint8)
out_cuda = torch.empty(1, dtype=torch.uint8, device="cuda")

torch.fmax(lhs, rhs, out=out_cpu)
torch.fmax(lhs.cuda(), rhs.cuda(), out=out_cuda)

print("cpu:", out_cpu)
print("cuda:", out_cuda.cpu())

---

cpu: tensor([160], dtype=torch.uint8)
cuda: tensor([255], dtype=torch.uint8)

---

CPU result : 160  # consistent with wraparound / modulo 256
CUDA result: 255  # consistent with saturation to uint8 max

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

torch.fmax produces inconsistent results between CPU and CUDA when the inputs are int64 tensors and the out tensor has dtype uint8.

For the same operation, CPU writes 160 to the uint8 output, while CUDA writes 255.

This suggests that CPU and CUDA use different overflow conversion behavior when writing the int64 result into a narrower unsigned integer out tensor.

Reproduction

import torch

lhs = torch.tensor([100000], dtype=torch.int64)
rhs = torch.tensor([-1], dtype=torch.int64)

out_cpu = torch.empty(1, dtype=torch.uint8)
out_cuda = torch.empty(1, dtype=torch.uint8, device="cuda")

torch.fmax(lhs, rhs, out=out_cpu)
torch.fmax(lhs.cuda(), rhs.cuda(), out=out_cuda)

print("cpu:", out_cpu)
print("cuda:", out_cuda.cpu())

Actual output

cpu: tensor([160], dtype=torch.uint8)
cuda: tensor([255], dtype=torch.uint8)

Expected behavior

CPU and CUDA should use consistent casting behavior when writing the result of torch.fmax into an out tensor with dtype uint8.

The mathematically selected value is 100000, but this value cannot be represented by uint8. The operation should either:

produce the same converted value on both backends, or
document that overflow behavior for integer out tensors is backend-dependent.

Why this seems like a bug

This is not a floating-point tolerance issue. The inputs are integer tensors, and the selected maximum value is exactly 100000.

The discrepancy appears only when the result is written into a narrower uint8 output tensor:

CPU result : 160  # consistent with wraparound / modulo 256
CUDA result: 255  # consistent with saturation to uint8 max

This indicates inconsistent integer overflow casting semantics between CPU and CUDA for torch.fmax(..., out=...).



### Versions

PyTorch 2.11.0+cu128, CUDA 12.8

extent analysis

TL;DR

To fix the inconsistent results between CPU and CUDA for torch.fmax with int64 inputs and uint8 output, ensure consistent overflow handling by either using a larger output dtype or documenting backend-dependent behavior.

Guidance

Verify the issue by running the provided reproduction code to confirm the discrepancy between CPU and CUDA results.
Consider using a larger dtype for the output tensor, such as torch.uint16 or torch.uint32, to avoid overflow issues.
If using a narrower output dtype is necessary, document the backend-dependent behavior for integer overflow casting semantics.
Investigate PyTorch documentation and version notes for any updates or fixes related to integer overflow handling in torch.fmax.

Example

# Using a larger output dtype to avoid overflow
out_cpu = torch.empty(1, dtype=torch.uint16)
out_cuda = torch.empty(1, dtype=torch.uint16, device="cuda")
torch.fmax(lhs, rhs, out=out_cpu)
torch.fmax(lhs.cuda(), rhs.cuda(), out=out_cuda)

Notes

The provided code snippet and issue description suggest a specific version of PyTorch (2.11.0+cu128) and CUDA (12.8), but it is unclear if this issue persists in other versions.

Recommendation

Apply workaround by using a larger output dtype or documenting backend-dependent behavior, as upgrading to a fixed version is not clearly implied in the issue description.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

CPU and CUDA should use consistent casting behavior when writing the result of torch.fmax into an out tensor with dtype uint8.

The mathematically selected value is 100000, but this value cannot be represented by uint8. The operation should either:

produce the same converted value on both backends, or
document that overflow behavior for integer out tensors is backend-dependent.

#request error #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix torch.fmax with uint8 out produces inconsistent overflow casting on CPU and CUDA [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🐛 Describe the bug

Summary

Reproduction

Actual output

Expected behavior

Why this seems like a bug

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix torch.fmax with uint8 out produces inconsistent overflow casting on CPU and CUDA [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🐛 Describe the bug

Summary

Reproduction

Actual output

Expected behavior

Why this seems like a bug

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING