pytorch - ✅(Solved) Fix `torch.compile` produces different uint8 output from `ceil(log2(...))` pipeline due to float precision amplification [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178045Fetched 2026-04-08 01:07:27
View on GitHub
Comments
1
Participants
2
Timeline
87
Reactions
0
Author
Timeline (top)
mentioned ×36subscribed ×36labeled ×9referenced ×2

Error Message

Error logs

No error — both modes succeed, but produce different uint8 outputs:

Root Cause

The float32 classification output matches at 4.6e-7 level, confirming that the Inductor produces nearly-correct float values. However, the ceil(log2(...)) pipeline acts as a precision amplifier: any float difference near an integer boundary results in a discrete jump of 1 in the uint8 output. With hundreds of channels, multiple such boundary crossings accumulate to a max_diff of 6.

This is an inherent limitation of deterministic behavior when the Inductor fuses GELU operations differently from eager mode.

PR fix notes

PR #178698: [Inductor] Fix e8m0_rceil_log2 pattern not registering on any CUDA device (gh-178045)

Description (problem / solution / changelog)

Summary

Fixes two stacked bugs that caused torch.compile to produce wrong uint8 values from ceil(log2(...)) pipelines (e.g. torchao MX format scaling factors). Fixes #178045.

Bug 1 — device string mismatch (root cause, affects all CUDA hardware)

joint_graph.py calls _misc_patterns_init(torch.device("cuda:0")). str(torch.device("cuda:0")) == "cuda:0", not "cuda". The old guard if device == "cuda": was therefore always Falsethe e8m0_rceil_log2 pattern was never registered on any real CUDA compilation path.

Fix: device == "cuda"device.startswith("cuda").

Bug 2 — pre-SM100 hardware has no PTX replacement

Once Bug 1 is fixed, the original code would call inductor_prims.cvt_e8m0_rceil unconditionally, which requires the SM100 PTX instruction cvt.rp.satfinite.ue8m0x2.f32 and crashes on earlier GPUs.

On pre-SM100 hardware the log2+ceil pattern is now replaced with an exact IEEE 754 bit-manipulation:

inp_bits   = inp.view(torch.int32)
biased_exp = (inp_bits >> 23) & 0xFF
needs_up   = (inp_bits & 0x7FFFFF) != 0   # non-zero mantissa → not exact 2^e
result     = clamp(biased_exp + needs_up, 0, 254).to(torch.uint8)

This is mathematically equivalent to clamp(ceil(log2(inp)), -127, 127) + 127 for all positive normal float32 values, but is immune to the ~1 ULP rounding error in CUDA's log2f: for inputs like 2^e + 1 ULP, software log2f can round DOWN to exactly e, making ceil return e instead of e+1.

On SM100+, both patterns continue to use inductor_prims.cvt_e8m0_rceil (the hardware PTX instruction).

Tests

  • TestCvtE8M0Rceil.test_log2_pattern_near_power_of_two (SM100+): verifies the PTX path returns the correct ceiling for values 1 ULP above a power of 2.
  • TestE8M0Log2PatternBitManip (pre-SM100, 3 tests): pattern fires and is correct, boundary correctness (1 ULP above 2^e), and a named regression test for gh-178045.

Verified on NVIDIA GeForce GTX 1650 (SM 7.5) with PyTorch 2.11.0+cu128.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela @azahed98 @mlazos @ezyang @jansel @shunting314

Changed files

  • test/inductor/test_fp8.py (modified, +144/-0)
  • torch/_inductor/config.py (modified, +6/-1)
  • torch/_inductor/fx_passes/misc_patterns.py (modified, +66/-33)
  • torch/accelerator/memory.py (modified, +24/-0)

Code Example

import torch
import torch.nn as nn

class E8M0Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.pw1 = nn.Conv2d(32, 64, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(64)
        self.pw2 = nn.Conv2d(64, 128, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pw1(x)
        x = self.bn1(x)
        # GELU via erf decomposition
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))
        x = self.pw2(x)
        x = self.bn2(x)
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))

        # E8M0 encoding pipeline (sensitive to float precision)
        feat = torch.abs(x) + 1e-7
        log2_val = torch.log2(feat)
        ceil_val = torch.ceil(log2_val)
        clamped = torch.clamp(ceil_val, min=-127, max=127)
        encoded = (clamped + 127).to(torch.uint8)

        # Classification path
        cls = self.pool(x).flatten(1)
        cls = self.fc(cls)
        return cls, encoded

device = "cuda"
model = E8M0Model().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager
with torch.no_grad():
    ref_cls, ref_enc = model(x)

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out_cls, out_enc = compiled(x)

cls_diff = (ref_cls.float() - out_cls.float()).abs().max().item()
enc_diff = (ref_enc.float() - out_enc.float()).abs().max().item()

print(f"Classification max_diff: {cls_diff:.6e}")  # ~4.6e-7 (OK)
print(f"Encoded features max_diff: {enc_diff:.1f}")  # ~6.0 (WRONG)
print(f"Encoded features mean_diff: {(ref_enc.float() - out_enc.float()).abs().mean().item():.6e}")

---

Classification max_diff: 4.619360e-07
Encoded features max_diff: 6.0
Encoded features mean_diff: 1.847267e-03

---

PyTorch version: 2.12.0.dev20260315+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend produces significantly different results for a model that includes a float-to-uint8 encoding pipeline (abs → log2 → ceil → clamp → add → to(uint8)). While the float32 classification output matches within 4.6e-7, the uint8-encoded intermediate features differ by up to 6 units.

The root cause is that the Inductor's GELU fusion (0.5 * x * (1 + erf(x / √2))) produces intermediate float values that differ at the ~1e-7 level from eager mode. This tiny float difference is amplified by the torch.ceil(torch.log2(...)) pipeline: when a value is near an integer boundary (e.g., log2(x) ≈ 3.9999... vs 4.0001...), ceil rounds in opposite directions, producing uint8 values that differ by 1 per boundary crossing — accumulated over 6+ crossings in a 512-channel feature map.

This was discovered via a fuzzer-generated vision model that implements E8M0 (8-bit exponent-only) floating-point encoding, targeting the e8m0_rceil_log2 Inductor pattern.

Minimal reproducer

import torch
import torch.nn as nn

class E8M0Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.pw1 = nn.Conv2d(32, 64, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(64)
        self.pw2 = nn.Conv2d(64, 128, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pw1(x)
        x = self.bn1(x)
        # GELU via erf decomposition
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))
        x = self.pw2(x)
        x = self.bn2(x)
        x = 0.5 * x * (1 + torch.erf(x * 0.7071067811865476))

        # E8M0 encoding pipeline (sensitive to float precision)
        feat = torch.abs(x) + 1e-7
        log2_val = torch.log2(feat)
        ceil_val = torch.ceil(log2_val)
        clamped = torch.clamp(ceil_val, min=-127, max=127)
        encoded = (clamped + 127).to(torch.uint8)

        # Classification path
        cls = self.pool(x).flatten(1)
        cls = self.fc(cls)
        return cls, encoded

device = "cuda"
model = E8M0Model().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager
with torch.no_grad():
    ref_cls, ref_enc = model(x)

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
with torch.no_grad():
    out_cls, out_enc = compiled(x)

cls_diff = (ref_cls.float() - out_cls.float()).abs().max().item()
enc_diff = (ref_enc.float() - out_enc.float()).abs().max().item()

print(f"Classification max_diff: {cls_diff:.6e}")  # ~4.6e-7 (OK)
print(f"Encoded features max_diff: {enc_diff:.1f}")  # ~6.0 (WRONG)
print(f"Encoded features mean_diff: {(ref_enc.float() - out_enc.float()).abs().mean().item():.6e}")

Behavior summary

OutputEagertorch.compilemax_diffAssessment
Classification (float32)ReferenceMatches4.6e-7OK
E8M0 encoded features (uint8)ReferenceDiffers6.0Wrong

Analysis

The float32 classification output matches at 4.6e-7 level, confirming that the Inductor produces nearly-correct float values. However, the ceil(log2(...)) pipeline acts as a precision amplifier: any float difference near an integer boundary results in a discrete jump of 1 in the uint8 output. With hundreds of channels, multiple such boundary crossings accumulate to a max_diff of 6.

This is an inherent limitation of deterministic behavior when the Inductor fuses GELU operations differently from eager mode.

Error logs

No error — both modes succeed, but produce different uint8 outputs:

Classification max_diff: 4.619360e-07
Encoded features max_diff: 6.0
Encoded features mean_diff: 1.847267e-03

Versions

PyTorch version: 2.12.0.dev20260315+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

To address the issue of different results between eager mode and torch.compile with the inductor backend for the E8M0 encoding pipeline, we can try the following steps:

  • Stabilize floating-point operations: Since the issue arises from the difference in floating-point operations between eager mode and compiled mode, we can attempt to stabilize these operations.
  • Use a consistent GELU implementation: Ensure that both eager and compiled modes use the same GELU implementation to minimize differences in intermediate float values.

Here's an example of how you could modify your code to achieve this:

import torch
import torch.nn as nn
import torch.nn.functional as F

class E8M0Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.pw1 = nn.Conv2d(32, 64, kernel_size=1)
        self.bn1 = nn.BatchNorm2d(64)
        self.pw2 = nn.Conv2d(64, 128, kernel_size=1)
        self.bn2 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)

    def gelu(self, x):
        # Use torch.nn.functional.gelu for consistency
        return F.gelu(x)

    def forward(self, x):
        x = self.conv1(x)
        x = self.pw1(x)
        x = self.bn1(x)
        x = self.gelu(x)  # Use consistent GELU implementation
        x = self.pw2(x)
        x = self.bn2(x)
        x = self.gelu(x)  # Use consistent GELU implementation

        # E8M0 encoding pipeline (sensitive to float precision)
        feat = torch.abs(x) + 1e-7
        log2_val = torch.log2(feat)
        ceil_val = torch.ceil(log2_val)
        clamped = torch.clamp(ceil_val, min=-127, max=127)
        encoded = (clamped + 127).to(torch.uint8)

        # Classification path
        cls = self.pool(x).flatten(1)
        cls = self.fc(cls)
        return cls, encoded

device = "cuda"
model = E8M0Model().to(device).eval()
x = torch.randn(4, 3, 32, 32, device=device)

# Eager
with torch.no_grad():
    ref_cls, ref_enc = model(x)

# Compiled
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor", fullgraph=True)
with torch.no_grad():
    out_cls, out_enc = compiled(x)

cls

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix `torch.compile` produces different uint8 output from `ceil(log2(...))` pipeline due to float precision amplification [1 pull requests, 1 comments, 2 participants]