pytorch - 💡(How to fix) Fix `torch.compile` crashes with `CompilationError` when casting to `float8_e4m3fn` on GPUs without native FP8 support

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

InductorError
CompilationError: at 1:0:
def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Root Cause

Inductor's codegen unconditionally maps torch.float8_e4m3fn to tl.float8e4nv without checking the GPU's compute capability. The Triton compiler then rejects this type for architectures < SM89.

Eager mode works because it dispatches to a CUDA kernel (aten::_to_copy) that implements the FP8 conversion in software without requiring hardware FP8 ALU instructions.

Fix Action

Fix / Workaround

Eager mode works because it dispatches to a CUDA kernel (aten::_to_copy) that implements the FP8 conversion in software without requiring hardware FP8 ALU instructions.

Code Example

import torch

x = torch.randn(4, 4, device="cuda")

# Eager mode: works fine
eager_out = x.to(torch.float8_e4m3fn)
print(f"Eager: OK, shape={eager_out.shape}, dtype={eager_out.dtype}")

# Compiled mode: crashes
@torch.compile(backend="inductor")
def cast_to_fp8(x):
    return x.to(torch.float8_e4m3fn)

compiled_out = cast_to_fp8(x)  # CompilationError!

---

InductorError
CompilationError: at 1:0:
def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

---

def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tmp0.to(tl.float8e4nv)  # <-- unsupported on SM < 89
    tl.store(out_ptr0 + (x0), tmp1, xmask)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Bug description

torch.compile with the Inductor backend crashes when compiling a function that casts a tensor to torch.float8_e4m3fn (or float8_e4m3fnuz / float8_e5m2fnuz) on GPUs with compute capability < 8.9 (e.g., T4, V100, A100).

The eager mode handles these casts correctly via CUDA kernels that don't require hardware FP8 support, but Inductor generates Triton code using tl.float8e4nv without checking whether the target GPU architecture supports this Triton dtype.

Minimal reproducer

import torch

x = torch.randn(4, 4, device="cuda")

# Eager mode: works fine
eager_out = x.to(torch.float8_e4m3fn)
print(f"Eager: OK, shape={eager_out.shape}, dtype={eager_out.dtype}")

# Compiled mode: crashes
@torch.compile(backend="inductor")
def cast_to_fp8(x):
    return x.to(torch.float8_e4m3fn)

compiled_out = cast_to_fp8(x)  # CompilationError!

Error message

InductorError
CompilationError: at 1:0:
def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")

Generated Triton kernel (from error log)

def triton_poi_fused__to_copy_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tmp0.to(tl.float8e4nv)  # <-- unsupported on SM < 89
    tl.store(out_ptr0 + (x0), tmp1, xmask)

Affected dtypes

dtypeTriton typeBehavior on SM75 (T4)
torch.float8_e4m3fntl.float8e4nvCRASH
torch.float8_e4m3fnuzunsupportedCRASH
torch.float8_e5m2fnuzunsupportedCRASH
torch.float8_e5m2tl.float8e5OK

Expected behavior

Either:

  1. Compile successfully by emitting a software implementation (as eager mode does), or
  2. Raise a clear UnsupportedHardwareError at graph compilation time (before Triton codegen)

Root cause

Inductor's codegen unconditionally maps torch.float8_e4m3fn to tl.float8e4nv without checking the GPU's compute capability. The Triton compiler then rejects this type for architectures < SM89.

Eager mode works because it dispatches to a CUDA kernel (aten::_to_copy) that implements the FP8 conversion in software without requiring hardware FP8 ALU instructions.

Versions

Environment

  • PyTorch: 2.13.0.dev20260513+cu126
  • Triton: 3.7.0
  • GPU: Tesla T4 (compute capability 7.5)
  • CUDA: 12.6
  • OS: Linux (Ubuntu)
  • Python: 3.11

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Either:

  1. Compile successfully by emitting a software implementation (as eager mode does), or
  2. Raise a clear UnsupportedHardwareError at graph compilation time (before Triton codegen)

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` crashes with `CompilationError` when casting to `float8_e4m3fn` on GPUs without native FP8 support