Either: 1. Compile successfully by emitting a software implementation (as eager mode does), or 2. Raise a clear `UnsupportedHardwareError` at graph compilation time (before Triton codegen)

pytorch - 💡(How to fix) Fix `torch.compile` crashes with `CompilationError` when casting to `float8_e4m3fn` on GPUs without native FP8 support

pytorch2026-05-17 07:45:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

InductorError
CompilationError: at 1:0:
def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Root Cause

Inductor's codegen unconditionally maps torch.float8_e4m3fn to tl.float8e4nv without checking the GPU's compute capability. The Triton compiler then rejects this type for architectures < SM89.

Eager mode works because it dispatches to a CUDA kernel (aten::_to_copy) that implements the FP8 conversion in software without requiring hardware FP8 ALU instructions.

Fix Action

Fix / Workaround

Eager mode works because it dispatches to a CUDA kernel (aten::_to_copy) that implements the FP8 conversion in software without requiring hardware FP8 ALU instructions.

Code Example

import torch

x = torch.randn(4, 4, device="cuda")

# Eager mode: works fine
eager_out = x.to(torch.float8_e4m3fn)
print(f"Eager: OK, shape={eager_out.shape}, dtype={eager_out.dtype}")

# Compiled mode: crashes
@torch.compile(backend="inductor")
def cast_to_fp8(x):
    return x.to(torch.float8_e4m3fn)

compiled_out = cast_to_fp8(x)  # CompilationError!

---

InductorError
CompilationError: at 1:0:
def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

---

def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tmp0.to(tl.float8e4nv)  # <-- unsupported on SM < 89
    tl.store(out_ptr0 + (x0), tmp1, xmask)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Bug description

torch.compile with the Inductor backend crashes when compiling a function that casts a tensor to torch.float8_e4m3fn (or float8_e4m3fnuz / float8_e5m2fnuz) on GPUs with compute capability < 8.9 (e.g., T4, V100, A100).

The eager mode handles these casts correctly via CUDA kernels that don't require hardware FP8 support, but Inductor generates Triton code using tl.float8e4nv without checking whether the target GPU architecture supports this Triton dtype.

Minimal reproducer

import torch

x = torch.randn(4, 4, device="cuda")

# Eager mode: works fine
eager_out = x.to(torch.float8_e4m3fn)
print(f"Eager: OK, shape={eager_out.shape}, dtype={eager_out.dtype}")

# Compiled mode: crashes
@torch.compile(backend="inductor")
def cast_to_fp8(x):
    return x.to(torch.float8_e4m3fn)

compiled_out = cast_to_fp8(x)  # CompilationError!

Error message

InductorError
CompilationError: at 1:0:
def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Generated Triton kernel (from error log)

def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tmp0.to(tl.float8e4nv)  # <-- unsupported on SM < 89
    tl.store(out_ptr0 + (x0), tmp1, xmask)

Affected dtypes

dtype	Triton type	Behavior on SM75 (T4)
`torch.float8_e4m3fn`	`tl.float8e4nv`	CRASH
`torch.float8_e4m3fnuz`	unsupported	CRASH
`torch.float8_e5m2fnuz`	unsupported	CRASH
`torch.float8_e5m2`	`tl.float8e5`	OK

Expected behavior

Either:

Compile successfully by emitting a software implementation (as eager mode does), or
Raise a clear UnsupportedHardwareError at graph compilation time (before Triton codegen)

Root cause

Inductor's codegen unconditionally maps torch.float8_e4m3fn to tl.float8e4nv without checking the GPU's compute capability. The Triton compiler then rejects this type for architectures < SM89.

Eager mode works because it dispatches to a CUDA kernel (aten::_to_copy) that implements the FP8 conversion in software without requiring hardware FP8 ALU instructions.

Versions

Environment

PyTorch: 2.13.0.dev20260513+cu126
Triton: 3.7.0
GPU: Tesla T4 (compute capability 7.5)
CUDA: 12.6
OS: Linux (Ubuntu)
Python: 3.11

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Either:

Compile successfully by emitting a software implementation (as eager mode does), or
Raise a clear UnsupportedHardwareError at graph compilation time (before Triton codegen)

#indexing error #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.compile` crashes with `CompilationError` when casting to `float8_e4m3fn` on GPUs without native FP8 support

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Bug description

Minimal reproducer

Error message

Generated Triton kernel (from error log)

Affected dtypes

Expected behavior

Root cause

Versions

Environment

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` crashes with `CompilationError` when casting to `float8_e4m3fn` on GPUs without native FP8 support

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Bug description

Minimal reproducer

Error message

Generated Triton kernel (from error log)

Affected dtypes

Expected behavior

Root cause

Versions

Environment

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING