pytorch - ✅(Solved) Fix [inductor] Triton codegen emits pow(float32, int64) for symbolic integer exponents despite #173685 fix [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177131Fetched 2026-04-08 00:22:03
View on GitHub
Comments
1
Participants
1
Timeline
61
Reactions
0
Author
Participants
Timeline (top)
mentioned ×23subscribed ×23labeled ×5referenced ×5

Error Message

triton.compiler.errors.CompilationError: tmp6 = (tl.full([], -1.0, tl.float64)) + ((libdevice.pow(2.0, ks0)) / 2) ^ (triton.language.float32, triton.language.int64)

Fix Action

Fixed

PR fix notes

PR #627: fix: use float base in pow to avoid Inductor dtype mismatch

Description (problem / solution / changelog)

when torch.compile with capture_scalar_outputs=True traces through calculate_range, the integer base in 2**num_bits causes num_bits to enter the Inductor graph as a symbolic int64. Inductor then emits libdevice.pow(float32, int64) in Triton codegen which fails on the type mismatch.

changing to 2.0** makes Python produce a float result directly so the Inductor graph stays type-consistent and Triton compiles cleanly.

related pytorch issue: https://github.com/pytorch/pytorch/issues/177131 needed by: https://github.com/vllm-project/llm-compressor/pull/2384

Changed files

  • src/compressed_tensors/quantization/utils/helpers.py (modified, +1/-1)

PR #2384: perf: make MSE observer compatible with torch.compile

Description (problem / solution / changelog)

make the MSE observer inner loop compatible with torch.compile by extracting _compute_candidate_error as a standalone function compiled with torch.compile ( dynamic=True ). Early stopping is preserved in the outer loop. compile flag is exposed as a oneshot argument (enable_observer_compile).

e2e benchmark (TinyLlama-1.1B, INT8 W8A8, MSE observer, 64 cal samples, RTX 4060 Ti):

  • Eager: 4.9s, 4265 MB
  • Compiled warm: 3.9s, 4199 MB
  • Speedup: 1.26x

Requires: https://github.com/vllm-project/compressed-tensors/pull/627 Related: https://github.com/pytorch/pytorch/issues/177131 Partial fix for #1485

Changed files

  • src/llmcompressor/entrypoints/oneshot.py (modified, +7/-1)
  • src/llmcompressor/observers/compile_config.py (added, +16/-0)
  • src/llmcompressor/observers/mse.py (modified, +284/-74)
  • tests/llmcompressor/observers/test_mse.py (modified, +33/-0)

Code Example

import torch
torch._dynamo.config.capture_scalar_outputs = True

def calculate_range(num_bits, device):
    bit_range = 2 ** num_bits
    q_max = torch.tensor(bit_range / 2 - 1, device=device)
    q_min = torch.tensor(-bit_range / 2, device=device)
    return q_min, q_max

@torch.compile(dynamic=True, backend="inductor")
def fn(x, num_bits):
    q_min, q_max = calculate_range(num_bits, x.device)
    bit_range = q_max - q_min
    max_val = torch.max(torch.abs(x))
    scale = max_val / (float(bit_range) / 2)
    scale = torch.where(scale == 0, torch.tensor(1e-10, device=x.device, dtype=x.dtype), scale)
    x_q = torch.clamp(torch.round(x / scale), q_min, q_max)
    x_dq = x_q * scale
    error = (x_dq - x).abs().pow(2.4).sum()
    return error, scale, scale

x = torch.randn(1, 1, 128, device="cuda", dtype=torch.float16)
result = fn(x, 8)

---

triton.compiler.errors.CompilationError:
    tmp6 = (tl.full([], -1.0, tl.float64)) + ((libdevice.pow(2.0, ks0)) / 2)
                                                ^
(triton.language.float32, triton.language.int64)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

the fix in #173685 / #173684 addressed libdevice.pow type mismatches but doesn't cover the case where the exponent is a symbolic int64 kernel scalar from capture_scalar_outputs=True.

inductor still emits libdevice.pow(2.0, ks0) where ks0 is i64, and triton rejects the (float32, int64) combo.

looking at the codegen, _print_FloatPow and the general pow() helper don't cast the exponent when it's an int64 scalar. _print_PowByNatural casts the base but not the exponent.

repro:

import torch
torch._dynamo.config.capture_scalar_outputs = True

def calculate_range(num_bits, device):
    bit_range = 2 ** num_bits
    q_max = torch.tensor(bit_range / 2 - 1, device=device)
    q_min = torch.tensor(-bit_range / 2, device=device)
    return q_min, q_max

@torch.compile(dynamic=True, backend="inductor")
def fn(x, num_bits):
    q_min, q_max = calculate_range(num_bits, x.device)
    bit_range = q_max - q_min
    max_val = torch.max(torch.abs(x))
    scale = max_val / (float(bit_range) / 2)
    scale = torch.where(scale == 0, torch.tensor(1e-10, device=x.device, dtype=x.dtype), scale)
    x_q = torch.clamp(torch.round(x / scale), q_min, q_max)
    x_dq = x_q * scale
    error = (x_dq - x).abs().pow(2.4).sum()
    return error, scale, scale

x = torch.randn(1, 1, 128, device="cuda", dtype=torch.float16)
result = fn(x, 8)

error:

triton.compiler.errors.CompilationError:
    tmp6 = (tl.full([], -1.0, tl.float64)) + ((libdevice.pow(2.0, ks0)) / 2)
                                                ^
(triton.language.float32, triton.language.int64)

backend isolation:

  • eager — pass
  • aot_eager — pass
  • inductor — fail

Versions

  • pytorch 2.10.0+cu128
  • python 3.12.3
  • triton (bundled with pytorch)
  • rtx 4060 ti (sm_89)
<img width="2559" height="1439" alt="Image" src="https://github.com/user-attachments/assets/746ecbb7-dad8-4056-ba45-6a3dc08116ed" />

Versions

Collecting environment information... PyTorch version: 2.10.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 13.1.115 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4060 Ti Nvidia driver version: 581.42 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

Versions of relevant libraries: [pip3] torch==2.10.0 [pip3] torchvision==0.25.0 [pip3] triton==3.6.0

cc @chauhang @penguinwu @ezyang @bobrenjc93 @aditvenk @laithsakka @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

Cast Exponent to Float

To fix the issue, we need to cast the exponent to a float when it's an int64 scalar. We can do this by modifying the _print_FloatPow and the general pow() helper to cast the exponent.

Code Changes

# _print_FloatPow
def _print_FloatPow(self, base, exponent):
    # Cast exponent to float
    exponent = torch.as_tensor(exponent, dtype=torch.float32)
    return f"libdevice.pow({base}, {exponent})"

# pow() helper
def pow(self, base, exponent):
    # Cast exponent to float
    exponent = torch.as_tensor(exponent, dtype=torch.float32)
    return f"libdevice.pow({base}, {exponent})"

Additional Changes

We also need to modify the calculate_range function to return the exponent as a float.

def calculate_range(num_bits, device):
    bit_range = 2 ** num_bits
    q_max = torch.tensor(bit_range / 2 - 1, device=device)
    q_min = torch.tensor(-bit_range / 2, device=device)
    return q_min, q_max, 2.4  # Return exponent as a float

Update Codegen

We need to update the codegen to use the new pow() helper and _print_FloatPow function.

@torch.compile(dynamic=True, backend="inductor")
def fn(x, num_bits):
    q_min, q_max, exponent = calculate_range(num_bits, x.device)
    bit_range = q_max - q_min
    max_val = torch.max(torch.abs(x))
    scale = max_val / (float(bit_range) / 2)
    scale = torch.where(scale == 0, torch.tensor(1e-10, device=x.device, dtype=x.dtype), scale)
    x_q =

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [inductor] Triton codegen emits pow(float32, int64) for symbolic integer exponents despite #173685 fix [2 pull requests, 1 comments, 1 participants]