pytorch - ✅(Solved) Fix arange CPU kernel has different precisions for scalar and vectorized executions [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177599Fetched 2026-04-08 00:47:20
View on GitHub
Comments
1
Participants
2
Timeline
48
Reactions
0
Author
Participants
Timeline (top)
mentioned ×19subscribed ×19labeled ×7commented ×1

Error Message

""" Demonstrates the precision difference between the scalar and vectorized paths in PyTorch CPU's arange kernel for fp16 output.

The scalar path computes each element independently in float (accscalar_t): output[i] = float(start) + float(step) * float(i) -> cast to fp16

The vectorized path truncates the base to fp16 before per-element arithmetic: base = fp16(float(start) + float(step) * chunk_start) <-- truncation here output[j] = base + float(step) * j -> cast to fp16

This script simulates both paths in Python using struct for fp16/fp32 casts. """

import struct

def to_fp32(x): """Round a Python float to fp32 precision.""" return struct.unpack("f", struct.pack("f", x))[0]

def to_fp16(x): """Round a Python float to fp16 precision.""" return struct.unpack("e", struct.pack("e", x))[0]

def cpu_scalar_path(start_f32, step_f32, n): """ Simulates RangeFactoriesKernel.cpp scalar lambda (line 32-34): return start + step * (idx++); start and step are accscalar_t = float. Result cast to scalar_t = Half. """ results = [] for i in range(n): val_f32 = to_fp32(start_f32 + step_f32 * float(i)) val_fp16 = to_fp16(val_f32) results.append(val_fp16) return results

def cpu_vectorized_path(start_f32, step_f32, n, vec_width=16): """ Simulates RangeFactoriesKernel.cpp vectorized lambda (line 35-39): Vectorized<scalar_t>::arange(start + step * idx, step)

The key: Vectorized<Half>::arange(T base, step_t step)
has T=Half for the base parameter, so the float value gets truncated to fp16.
Then base + j*step promotes Half back to float, but precision is already lost.
"""
results = []
for chunk_start in range(0, n, vec_width):
    # Base computed in float, then truncated to fp16 by the function signature
    base_f32 = to_fp32(start_f32 + step_f32 * float(chunk_start))
    base_fp16 = to_fp16(base_f32)  # <-- THIS IS THE BUG

    chunk_size = min(vec_width, n - chunk_start)
    for j in range(chunk_size):
        # base (fp16) promoted back to float for arithmetic with step (float)
        val_f32 = to_fp32(float(base_fp16) + step_f32 * float(j))
        val_fp16 = to_fp16(val_f32)
        results.append(val_fp16)
return results

=== Config from the failing test ===

start = to_fp32(-912.2) step = to_fp32(4.64) n = 20

print(f"Config: start={start}, step={step}, n={n}, dtype=fp16") print(f"fp16(start) = {to_fp16(start)} (truncation error: {to_fp16(start) - start:+.4f})") print()

scalar = cpu_scalar_path(start, step, n) vectorized = cpu_vectorized_path(start, step, n)

print(f"{'i':>3} {'Scalar':>10} {'Vectorized':>10} {'Match':>6}") print("-" * 45) mismatches = 0 for i in range(n): match = "ok" if scalar[i] == vectorized[i] else "DIFF" if match == "DIFF": mismatches += 1 print(f"{i:3d} {scalar[i]:10.1f} {vectorized[i]:10.1f} {match:>6}")

print(f"\n{mismatches} mismatches out of {n} elements") print() print("The scalar path keeps full fp32 precision for the base value.") print("The vectorized path truncates base to fp16, shifting all values") print("in the chunk and causing different fp16 rounding decisions.")

Fix Action

Fixed

PR fix notes

PR #178334: Fix arange half-precision scalar/vectorized inconsistency

Description (problem / solution / changelog)

Fixes #177599

Problem:

The CPU arange kernel uses two code paths; a scalar lambda and a vectorized lambda. For Half and BFloat16 types, the vectorized path passes a float32 base value into Vectorized<T>::arange(T base, ...), which implicitly truncates it to fp16/bf16 before per-element arithmetic. This causes ~30% of elements to produce different values compared to the scalar path, depending on tensor size and chunk alignment

Solution:

Remove the vectorized path for Half and BFloat16 using if constexpr, keeping it for other types where accscalar_t == scalar_t and no truncation occurs. This is similar to the approach in linspace_kernel in the same file, and benchmarks show it also improves fp16/bf16 performance since the vectorized path was slower due to conversion overhead. Added test_arange_lowp_precision regression test with exact equality checks.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

Changed files

  • aten/src/ATen/native/cpu/RangeFactoriesKernel.cpp (modified, +19/-11)
  • test/test_tensor_creation_ops.py (modified, +18/-0)

Code Example

output[i] = float(start) + float(step) * float(i)   ->  cast to fp16

---

base = fp16(float(start) + float(step) * chunk_start)   <-- truncation here
    output[j] = base + float(step) * j                      ->  cast to fp16

---

"""
Demonstrates the precision difference between the scalar and vectorized
paths in PyTorch CPU's arange kernel for fp16 output.

The scalar path computes each element independently in float (accscalar_t):
    output[i] = float(start) + float(step) * float(i)   ->  cast to fp16

The vectorized path truncates the base to fp16 before per-element arithmetic:
    base = fp16(float(start) + float(step) * chunk_start)   <-- truncation here
    output[j] = base + float(step) * j                      ->  cast to fp16

This script simulates both paths in Python using struct for fp16/fp32 casts.
"""

import struct


def to_fp32(x):
    """Round a Python float to fp32 precision."""
    return struct.unpack("f", struct.pack("f", x))[0]


def to_fp16(x):
    """Round a Python float to fp16 precision."""
    return struct.unpack("e", struct.pack("e", x))[0]


def cpu_scalar_path(start_f32, step_f32, n):
    """
    Simulates RangeFactoriesKernel.cpp scalar lambda (line 32-34):
        return start + step * (idx++);
    start and step are accscalar_t = float. Result cast to scalar_t = Half.
    """
    results = []
    for i in range(n):
        val_f32 = to_fp32(start_f32 + step_f32 * float(i))
        val_fp16 = to_fp16(val_f32)
        results.append(val_fp16)
    return results


def cpu_vectorized_path(start_f32, step_f32, n, vec_width=16):
    """
    Simulates RangeFactoriesKernel.cpp vectorized lambda (line 35-39):
        Vectorized<scalar_t>::arange(start + step * idx, step)

    The key: Vectorized<Half>::arange(T base, step_t step)
    has T=Half for the base parameter, so the float value gets truncated to fp16.
    Then base + j*step promotes Half back to float, but precision is already lost.
    """
    results = []
    for chunk_start in range(0, n, vec_width):
        # Base computed in float, then truncated to fp16 by the function signature
        base_f32 = to_fp32(start_f32 + step_f32 * float(chunk_start))
        base_fp16 = to_fp16(base_f32)  # <-- THIS IS THE BUG

        chunk_size = min(vec_width, n - chunk_start)
        for j in range(chunk_size):
            # base (fp16) promoted back to float for arithmetic with step (float)
            val_f32 = to_fp32(float(base_fp16) + step_f32 * float(j))
            val_fp16 = to_fp16(val_f32)
            results.append(val_fp16)
    return results


# === Config from the failing test ===
start = to_fp32(-912.2)
step = to_fp32(4.64)
n = 20

print(f"Config: start={start}, step={step}, n={n}, dtype=fp16")
print(f"fp16(start) = {to_fp16(start)}  (truncation error: {to_fp16(start) - start:+.4f})")
print()

scalar = cpu_scalar_path(start, step, n)
vectorized = cpu_vectorized_path(start, step, n)

print(f"{'i':>3}  {'Scalar':>10}  {'Vectorized':>10}  {'Match':>6}")
print("-" * 45)
mismatches = 0
for i in range(n):
    match = "ok" if scalar[i] == vectorized[i] else "DIFF"
    if match == "DIFF":
        mismatches += 1
    print(f"{i:3d}  {scalar[i]:10.1f}  {vectorized[i]:10.1f}  {match:>6}")

print(f"\n{mismatches} mismatches out of {n} elements")
print()
print("The scalar path keeps full fp32 precision for the base value.")
print("The vectorized path truncates base to fp16, shifting all values")
print("in the chunk and causing different fp16 rounding decisions.")
RAW_BUFFERClick to expand / collapse

This is relevant for fp16 and bf16 cases, and shows inconsistent results between scalar and vectorized. Kudos to @pdmeta for the discoveries and discussions!

The scalar path computes each element independently in float (accscalar_t):

    output[i] = float(start) + float(step) * float(i)   ->  cast to fp16

The vectorized path truncates the base to scalar_t (fp16) before per-element arithmetic:

    base = fp16(float(start) + float(step) * chunk_start)   <-- truncation here
    output[j] = base + float(step) * j                      ->  cast to fp16

In the code, there are 2 lambdas. The first one uses start, end, step (which are using accscalar_t) while the second lambda uses the same variables, but downcasts when calling into the vectorized function Vectorized<scalar_t>::arange.

Demo script shows the impact of the issue by simulating what is in Pytorch across scalar and vectorized, and comparing them.

"""
Demonstrates the precision difference between the scalar and vectorized
paths in PyTorch CPU's arange kernel for fp16 output.

The scalar path computes each element independently in float (accscalar_t):
    output[i] = float(start) + float(step) * float(i)   ->  cast to fp16

The vectorized path truncates the base to fp16 before per-element arithmetic:
    base = fp16(float(start) + float(step) * chunk_start)   <-- truncation here
    output[j] = base + float(step) * j                      ->  cast to fp16

This script simulates both paths in Python using struct for fp16/fp32 casts.
"""

import struct


def to_fp32(x):
    """Round a Python float to fp32 precision."""
    return struct.unpack("f", struct.pack("f", x))[0]


def to_fp16(x):
    """Round a Python float to fp16 precision."""
    return struct.unpack("e", struct.pack("e", x))[0]


def cpu_scalar_path(start_f32, step_f32, n):
    """
    Simulates RangeFactoriesKernel.cpp scalar lambda (line 32-34):
        return start + step * (idx++);
    start and step are accscalar_t = float. Result cast to scalar_t = Half.
    """
    results = []
    for i in range(n):
        val_f32 = to_fp32(start_f32 + step_f32 * float(i))
        val_fp16 = to_fp16(val_f32)
        results.append(val_fp16)
    return results


def cpu_vectorized_path(start_f32, step_f32, n, vec_width=16):
    """
    Simulates RangeFactoriesKernel.cpp vectorized lambda (line 35-39):
        Vectorized<scalar_t>::arange(start + step * idx, step)

    The key: Vectorized<Half>::arange(T base, step_t step)
    has T=Half for the base parameter, so the float value gets truncated to fp16.
    Then base + j*step promotes Half back to float, but precision is already lost.
    """
    results = []
    for chunk_start in range(0, n, vec_width):
        # Base computed in float, then truncated to fp16 by the function signature
        base_f32 = to_fp32(start_f32 + step_f32 * float(chunk_start))
        base_fp16 = to_fp16(base_f32)  # <-- THIS IS THE BUG

        chunk_size = min(vec_width, n - chunk_start)
        for j in range(chunk_size):
            # base (fp16) promoted back to float for arithmetic with step (float)
            val_f32 = to_fp32(float(base_fp16) + step_f32 * float(j))
            val_fp16 = to_fp16(val_f32)
            results.append(val_fp16)
    return results


# === Config from the failing test ===
start = to_fp32(-912.2)
step = to_fp32(4.64)
n = 20

print(f"Config: start={start}, step={step}, n={n}, dtype=fp16")
print(f"fp16(start) = {to_fp16(start)}  (truncation error: {to_fp16(start) - start:+.4f})")
print()

scalar = cpu_scalar_path(start, step, n)
vectorized = cpu_vectorized_path(start, step, n)

print(f"{'i':>3}  {'Scalar':>10}  {'Vectorized':>10}  {'Match':>6}")
print("-" * 45)
mismatches = 0
for i in range(n):
    match = "ok" if scalar[i] == vectorized[i] else "DIFF"
    if match == "DIFF":
        mismatches += 1
    print(f"{i:3d}  {scalar[i]:10.1f}  {vectorized[i]:10.1f}  {match:>6}")

print(f"\n{mismatches} mismatches out of {n} elements")
print()
print("The scalar path keeps full fp32 precision for the base value.")
print("The vectorized path truncates base to fp16, shifting all values")
print("in the chunk and causing different fp16 rounding decisions.")

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @jspark1105 @pingtakpetertang

Versions

Collecting environment information... PyTorch version: 2.11.0.dev20260203+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64) GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-14) Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.34

Python version: 3.14.2 | packaged by Anaconda, Inc. | (main, Dec 19 2025, 11:49:32) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-6.13.2-0_fbk11_0_g599ea5da5981-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA H100 GPU 1: NVIDIA H100 GPU 2: NVIDIA H100 GPU 3: NVIDIA H100 GPU 4: NVIDIA H100 GPU 5: NVIDIA H100 GPU 6: NVIDIA H100 GPU 7: NVIDIA H100

Nvidia driver version: 580.82.07 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 100% CPU max MHz: 2400.0000 CPU min MHz: 1500.0000 BogoMIPS: 4793.01 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d debug_swap Virtualization: AMD-V L1d cache: 6 MiB (192 instances) L1i cache: 6 MiB (192 instances) L2 cache: 192 MiB (192 instances) L3 cache: 768 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-95,192-287 NUMA node1 CPU(s): 96-191,288-383 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers Vulnerability Spectre v2: Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==2.4.1 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.17.1.4 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.28.9 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] torch==2.11.0.dev20260203+cu128 [pip3] triton==3.6.0+git9844da95 [conda] numpy 2.4.1 pypi_0 pypi [conda] nvidia-cublas-cu12 12.8.4.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.8.90 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.8.93 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.8.90 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.17.1.4 pypi_0 pypi [conda] nvidia-cufft-cu12 11.3.3.83 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.9.90 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.7.3.90 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.5.8.93 pypi_0 pypi [conda] nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi [conda] nvidia-nccl-cu12 2.28.9 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.8.93 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.8.90 pypi_0 pypi [conda] torch 2.11.0.dev20260203+cu128 pypi_0 pypi [conda] triton 3.6.0+git9844da95 pypi_0 pypi

extent analysis

Fix Plan

The issue arises from the truncation of the base value to fp16 before performing per-element arithmetic in the vectorized path. To fix this, we need to maintain the full fp32 precision for the base value.

Code Changes

We will modify the cpu_vectorized_path function to keep the base value in fp32 precision. Here's the updated code:

def cpu_vectorized_path(start_f32, step_f32, n, vec_width=16):
    results = []
    for chunk_start in range(0, n, vec_width):
        # Keep base in fp32 precision
        base_f32 = start_f32 + step_f32 * float(chunk_start)
        
        chunk_size = min(vec_width, n - chunk_start)
        for j in range(chunk_size):
            # Perform arithmetic in fp32 precision
            val_f32 = base_f32 + step_f32 * float(j)
            val_fp16 = to_fp16(val_f32)
            results.append(val_fp16)
    return results

By maintaining the base value in fp32 precision, we ensure that the per-element arithmetic is performed with full precision, reducing the truncation error.

Verification

To verify the fix, we can compare the results of the scalar and vectorized paths using the same demo script. The number of mismatches should be significantly reduced or eliminated.

Extra Tips

  • When working with mixed precision arithmetic, it's essential to consider the precision of intermediate results to avoid truncation errors.
  • Using fp32 precision for intermediate results can help maintain accuracy, especially when performing arithmetic operations that involve large numbers or small differences.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix arange CPU kernel has different precisions for scalar and vectorized executions [1 pull requests, 1 comments, 2 participants]