pytorch - 💡(How to fix) Fix Addcmul CPU bitwise numerics not matching eager w/ torch.compile [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#176929Fetched 2026-04-08 00:23:50
View on GitHub
Comments
0
Participants
1
Timeline
40
Reactions
0
Author
Participants
Assignees
Timeline (top)
mentioned ×18subscribed ×18labeled ×3assigned ×1

Error Message

Error logs

Root Cause

Root cause: ATen's CPU addcmul kernel (aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp) computes the expression self + scalar * t1 * t2 as a single C++ expression in a vectorized loop. With -ffp-contract=fast (gcc/clang default at -O2), the compiler may contract the multiply-add chain into hardware FMA instructions, choosing which sub-expression to fuse. Our decomposition forces a specific operation sequence, which always gets different FP contraction.

Fix Action

Fix / Workaround

with torch._dynamo.config.patch(enable_dynamo_decompositions=True): actual = torch.compile(addcmul_fn, fullgraph=True)(x.clone(), t1, t2, value)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug


CPU addcmul_ bitwise mismatch

Summary: The dynamo decomposition of addcmul_ (tensor value path) is not bitwise-identical to ATen on CPU. ~18% of elements differ by up to 1-2 ULP.

Root cause: ATen's CPU addcmul kernel (aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp) computes the expression self + scalar * t1 * t2 as a single C++ expression in a vectorized loop. With -ffp-contract=fast (gcc/clang default at -O2), the compiler may contract the multiply-add chain into hardware FMA instructions, choosing which sub-expression to fuse. Our decomposition forces a specific operation sequence, which always gets different FP contraction.

All four decomposition strategies produce identical mismatch counts — confirming the issue is the compiler's FMA contraction of ATen's single-expression kernel vs any sequence of separate ops:

  • fma(t1*t2, value, self) → 751/4096 mismatches
  • self + (value*t1)*t2 → 751/4096 mismatches
  • self + (t1*t2)*value → 751/4096 mismatches
  • self + value*(t1*t2) → 751/4096 mismatches

Why CUDA is fine: On CUDA, ATen uses explicit std::fma intrinsics (DeviceAddCmulCdiv.cuh), and inductor_prims.fma lowers to tl.fma — both are explicit and match.

Possible fixes:

  1. Create an inductor prim that emits the exact C++ expression self + value * t1 * t2 in a single codegen kernel, letting the same compiler optimize it identically to ATen
  2. Control FP contraction flags (-ffp-contract) in inductor's C++ codegen to match ATen's build configuration
  3. Accept ~1 ULP difference on CPU (current approach)

Repro: import torch import torch._dynamo.config

def addcmul_fn(x, tensor1, tensor2, value): return x.addcmul_(tensor1, tensor2, value=value)

torch.manual_seed(42) x = torch.randn(64, 64) t1 = torch.randn(64, 64) t2 = torch.randn(64, 64) value = torch.tensor(0.5)

expected = addcmul_fn(x.clone(), t1, t2, value)

with torch._dynamo.config.patch(enable_dynamo_decompositions=True): actual = torch.compile(addcmul_fn, fullgraph=True)(x.clone(), t1, t2, value)

diff = (expected - actual).abs() mismatches = (diff > 0).sum().item() print(f"mismatches: {mismatches} / {diff.numel()} ({100*mismatches/diff.numel():.1f}%)") print(f"max abs diff: {diff.max().item()}") print(f"bitwise equal: {torch.equal(expected, actual)}")

Error logs

No response

Versions

1d13be67fea5d1f44998483271fc258f550fa524

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

Option 1: Create an inductor prim that emits the exact C++ expression

To fix this issue, we can create an inductor prim that emits the exact C++ expression self + value * t1 * t2 in a single codegen kernel. This will allow the compiler to optimize it identically to ATen.

Step-by-Step Solution:

  1. Create a new inductor prim: Create a new prim in inductor_prims.py that emits the exact C++ expression self + value * t1 * t2.
  2. Update the codegen kernel: Update the codegen kernel to use the new prim.
  3. Rebuild and retest: Rebuild and retest the code to ensure that the issue is fixed.

Example Code:

# inductor_prims.py
from inductor_prims import *

@prim
def addcmul_prim(self, value, t1, t2):
    return self + value * t1 * t2
# codegen_kernel.py
from inductor_prims import addcmul_prim

def codegen_kernel(x, value, t1, t2):
    return addcmul_prim(x, value, t1, t2)

Option 2: Control FP contraction flags

To fix this issue, we can control FP contraction flags (-ffp-contract) in inductor's C++ codegen to match ATen's build configuration.

Step-by-Step Solution:

  1. Update the C++ codegen: Update the C++ codegen to control FP contraction flags.
  2. Rebuild and retest: Rebuild and retest the code to ensure that the issue is fixed.

Example Code:

// codegen_kernel.cpp
#include <inductor/codegen.h>

void codegen_kernel(float* x, float value

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING