pytorch - ✅(Solved) Fix `torch.compile` produces different uint8 output from `ceil(log2(...))` pipeline due to float precision amplification [1 pull requests, 1 comments, 2 participants]

pytorch2026-03-21 08:02:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178045•Fetched 2026-04-08 01:07:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

himi1008

Participants

himi1008

williamwen42

Timeline (top)

mentioned ×36subscribed ×36labeled ×9referenced ×2

Error Message

Error logs

No error — both modes succeed, but produce different uint8 outputs:

Root Cause

The float32 classification output matches at 4.6e-7 level, confirming that the Inductor produces nearly-correct float values. However, the ceil(log2(...)) pipeline acts as a precision amplifier: any float difference near an integer boundary results in a discrete jump of 1 in the uint8 output. With hundreds of channels, multiple such boundary crossings accumulate to a max_diff of 6.

This is an inherent limitation of deterministic behavior when the Inductor fuses GELU operations differently from eager mode.

PR fix notes

PR #178698: [Inductor] Fix e8m0_rceil_log2 pattern not registering on any CUDA device (gh-178045)

Repository: pytorch/pytorch
Author: saifmb0
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/178698

Description (problem / solution / changelog)

Summary

Fixes two stacked bugs that caused torch.compile to produce wrong uint8 values from ceil(log2(...)) pipelines (e.g. torchao MX format scaling factors). Fixes #178045.

Bug 1 — device string mismatch (root cause, affects all CUDA hardware)

joint_graph.py calls _misc_patterns_init(torch.device("cuda:0")). str(torch.device("cuda:0")) == "cuda:0", not "cuda". The old guard if device == "cuda": was therefore always False — the e8m0_rceil_log2 pattern was never registered on any real CUDA compilation path.

Fix: device == "cuda" → device.startswith("cuda").

Bug 2 — pre-SM100 hardware has no PTX replacement

Once Bug 1 is fixed, the original code would call inductor_prims.cvt_e8m0_rceil unconditionally, which requires the SM100 PTX instruction cvt.rp.satfinite.ue8m0x2.f32 and crashes on earlier GPUs.

On pre-SM100 hardware the log2+ceil pattern is now replaced with an exact IEEE 754 bit-manipulation:

inp_bits   = inp.view(torch.int32)
biased_exp = (inp_bits >> 23) & 0xFF
needs_up   = (inp_bits & 0x7FFFFF) != 0   # non-zero mantissa → not exact 2^e
result     = clamp(biased_exp + needs_up, 0, 254).to(torch.uint8)

This is mathematically equivalent to clamp(ceil(log2(inp)), -127, 127) + 127 for all positive normal float32 values, but is immune to the ~1 ULP rounding error in CUDA's log2f: for inputs like 2^e + 1 ULP, software log2f can round DOWN to exactly e, making ceil return e instead of e+1.

On SM100+, both patterns continue to use inductor_prims.cvt_e8m0_rceil (the hardware PTX instruction).

Tests

TestCvtE8M0Rceil.test_log2_pattern_near_power_of_two (SM100+): verifies the PTX path returns the correct ceiling for values 1 ULP above a power of 2.
TestE8M0Log2PatternBitManip (pre-SM100, 3 tests): pattern fires and is correct, boundary correctness (1 ULP above 2^e), and a named regression test for gh-178045.

Verified on NVIDIA GeForce GTX 1650 (SM 7.5) with PyTorch 2.11.0+cu128.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela @azahed98 @mlazos @ezyang @jansel @shunting314

Changed files

test/inductor/test_fp8.py (modified, +144/-0)
torch/_inductor/config.py (modified, +6/-1)
torch/_inductor/fx_passes/misc_patterns.py (modified, +66/-33)
torch/accelerator/memory.py (modified, +24/-0)

Code Example

import torch
import torch.nn as nn

class E8M0Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.pw1 = nn.Conv2d(32, 64, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(64)
        self.pw2 = nn.Conv2d(64, 128, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pw1(x)
        x = self.bn1(x)
        # GELU via erf decomposition
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))
        x = self.pw2(x)
        x = self.bn2(x)
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))

        # E8M0 encoding pipeline (sensitive to float precision)
        feat = torch.abs(x) + 1e-7
        log2_val = torch.log2(feat)
        ceil_val = torch.ceil(log2_val)
        clamped = torch.clamp(ceil_val, min=-127, max=127)
        encoded = (clamped + 127).to(torch.uint8)

        # Classification path
        cls = self.pool(x).flatten(1)
        cls = self.fc(cls)
        return cls, encoded

device = "cuda"
model = E8M0Model().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager
with torch.no_grad():
    ref_cls, ref_enc = model(x)

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out_cls, out_enc = compiled(x)

cls_diff = (ref_cls.float() - out_cls.float()).abs().max().item()
enc_diff = (ref_enc.float() - out_enc.float()).abs().max().item()

print(f"Classification max_diff: {cls_diff:.6e}")  # ~4.6e-7 (OK)
print(f"Encoded features max_diff: {enc_diff:.1f}")  # ~6.0 (WRONG)
print(f"Encoded features mean_diff: {(ref_enc.float() - out_enc.float()).abs().mean().item():.6e}")

---

Classification max_diff: 4.619360e-07
Encoded features max_diff: 6.0
Encoded features mean_diff: 1.847267e-03

---

PyTorch version: 2.12.0.dev20260315+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces significantly different results for a model that includes a float-to-uint8 encoding pipeline (abs → log2 → ceil → clamp → add → to(uint8)). While the float32 classification output matches within 4.6e-7, the uint8-encoded intermediate features differ by up to 6 units.

The root cause is that the Inductor's GELU fusion (0.5 * x * (1 + erf(x / √2))) produces intermediate float values that differ at the ~1e-7 level from eager mode. This tiny float difference is amplified by the torch.ceil(torch.log2(...)) pipeline: when a value is near an integer boundary (e.g., log2(x) ≈ 3.9999... vs 4.0001...), ceil rounds in opposite directions, producing uint8 values that differ by 1 per boundary crossing — accumulated over 6+ crossings in a 512-channel feature map.

This was discovered via a fuzzer-generated vision model that implements E8M0 (8-bit exponent-only) floating-point encoding, targeting the e8m0_rceil_log2 Inductor pattern.

Minimal reproducer

import torch
import torch.nn as nn

class E8M0Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.pw1 = nn.Conv2d(32, 64, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(64)
        self.pw2 = nn.Conv2d(64, 128, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pw1(x)
        x = self.bn1(x)
        # GELU via erf decomposition
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))
        x = self.pw2(x)
        x = self.bn2(x)
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))

        # E8M0 encoding pipeline (sensitive to float precision)
        feat = torch.abs(x) + 1e-7
        log2_val = torch.log2(feat)
        ceil_val = torch.ceil(log2_val)
        clamped = torch.clamp(ceil_val, min=-127, max=127)
        encoded = (clamped + 127).to(torch.uint8)

        # Classification path
        cls = self.pool(x).flatten(1)
        cls = self.fc(cls)
        return cls, encoded

device = "cuda"
model = E8M0Model().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager
with torch.no_grad():
    ref_cls, ref_enc = model(x)

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out_cls, out_enc = compiled(x)

cls_diff = (ref_cls.float() - out_cls.float()).abs().max().item()
enc_diff = (ref_enc.float() - out_enc.float()).abs().max().item()

print(f"Classification max_diff: {cls_diff:.6e}")  # ~4.6e-7 (OK)
print(f"Encoded features max_diff: {enc_diff:.1f}")  # ~6.0 (WRONG)
print(f"Encoded features mean_diff: {(ref_enc.float() - out_enc.float()).abs().mean().item():.6e}")

Behavior summary

Output	Eager	`torch.compile`	max_diff	Assessment
Classification (float32)	Reference	Matches	4.6e-7	OK
E8M0 encoded features (uint8)	Reference	Differs	6.0	Wrong

Analysis

This is an inherent limitation of deterministic behavior when the Inductor fuses GELU operations differently from eager mode.

Error logs

No error — both modes succeed, but produce different uint8 outputs:

Classification max_diff: 4.619360e-07
Encoded features max_diff: 6.0
Encoded features mean_diff: 1.847267e-03

Versions

PyTorch version: 2.12.0.dev20260315+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

To address the issue of different results between eager mode and torch.compile with the inductor backend for the E8M0 encoding pipeline, we can try the following steps:

Stabilize floating-point operations: Since the issue arises from the difference in floating-point operations between eager mode and compiled mode, we can attempt to stabilize these operations.
Use a consistent GELU implementation: Ensure that both eager and compiled modes use the same GELU implementation to minimize differences in intermediate float values.

Here's an example of how you could modify your code to achieve this:

import torch
import torch.nn as nn
import torch.nn.functional as F

class E8M0Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.pw1 = nn.Conv2d(32, 64, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(64)
        self.pw2 = nn.Conv2d(64, 128, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)

    def gelu(self, x):
        # Use torch.nn.functional.gelu for consistency
        return F.gelu(x)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pw1(x)
        x = self.bn1(x)
        x = self.gelu(x)  # Use consistent GELU implementation
        x = self.pw2(x)
        x = self.bn2(x)
        x = self.gelu(x)  # Use consistent GELU implementation

        # E8M0 encoding pipeline (sensitive to float precision)
        feat = torch.abs(x) + 1e-7
        log2_val = torch.log2(feat)
        ceil_val = torch.ceil(log2_val)
        clamped = torch.clamp(ceil_val, min=-127, max=127)
        encoded = (clamped + 127).to(torch.uint8)

        # Classification path
        cls = self.pool(x).flatten(1)
        cls = self.fc(cls)
        return cls, encoded

device = "cuda"
model = E8M0Model().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager
with torch.no_grad():
    ref_cls, ref_enc = model(x)

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor", fullgraph=True)
with torch.no_grad():
    out_cls, out_enc = compiled(x)

cls

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix `torch.compile` produces different uint8 output from `ceil(log2(...))` pipeline due to float precision amplification [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

PR fix notes

PR #178698: [Inductor] Fix e8m0_rceil_log2 pattern not registering on any CUDA device (gh-178045)

Description (problem / solution / changelog)

Summary

Bug 1 — device string mismatch (root cause, affects all CUDA hardware)

Bug 2 — pre-SM100 hardware has no PTX replacement

Tests

Changed files

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Analysis

Error logs

Versions

extent analysis

Fix Plan

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix `torch.compile` produces different uint8 output from `ceil(log2(...))` pipeline due to float precision amplification [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

PR fix notes

PR #178698: [Inductor] Fix e8m0_rceil_log2 pattern not registering on any CUDA device (gh-178045)

Description (problem / solution / changelog)

Summary

Bug 1 — device string mismatch (root cause, affects all CUDA hardware)

Bug 2 — pre-SM100 hardware has no PTX replacement

Tests

Changed files

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Analysis

Error logs

Versions

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING