pytorch - 💡(How to fix) Fix Addcmul CPU bitwise numerics not matching eager w/ torch.compile [1 participants]

pytorch2026-03-09 20:27:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#176929•Fetched 2026-04-08 00:23:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mlazos

Participants

mlazos

Assignees

mlazos

Timeline (top)

mentioned ×18subscribed ×18labeled ×3assigned ×1

Error Message

Error logs

Root Cause

Root cause: ATen's CPU addcmul kernel (aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp) computes the expression self + scalar * t1 * t2 as a single C++ expression in a vectorized loop. With -ffp-contract=fast (gcc/clang default at -O2), the compiler may contract the multiply-add chain into hardware FMA instructions, choosing which sub-expression to fuse. Our decomposition forces a specific operation sequence, which always gets different FP contraction.

Fix Action

Fix / Workaround

with torch._dynamo.config.patch(enable_dynamo_decompositions=True): actual = torch.compile(addcmul_fn, fullgraph=True)(x.clone(), t1, t2, value)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

CPU addcmul_ bitwise mismatch

Summary: The dynamo decomposition of addcmul_ (tensor value path) is not bitwise-identical to ATen on CPU. ~18% of elements differ by up to 1-2 ULP.

All four decomposition strategies produce identical mismatch counts — confirming the issue is the compiler's FMA contraction of ATen's single-expression kernel vs any sequence of separate ops:

fma(t1*t2, value, self) → 751/4096 mismatches
self + (value*t1)*t2 → 751/4096 mismatches
self + (t1*t2)*value → 751/4096 mismatches
self + value*(t1*t2) → 751/4096 mismatches

Why CUDA is fine: On CUDA, ATen uses explicit std::fma intrinsics (DeviceAddCmulCdiv.cuh), and inductor_prims.fma lowers to tl.fma — both are explicit and match.

Possible fixes:

Create an inductor prim that emits the exact C++ expression self + value * t1 * t2 in a single codegen kernel, letting the same compiler optimize it identically to ATen
Control FP contraction flags (-ffp-contract) in inductor's C++ codegen to match ATen's build configuration
Accept ~1 ULP difference on CPU (current approach)

Repro: import torch import torch._dynamo.config

def addcmul_fn(x, tensor1, tensor2, value): return x.addcmul_(tensor1, tensor2, value=value)

torch.manual_seed(42) x = torch.randn(64, 64) t1 = torch.randn(64, 64) t2 = torch.randn(64, 64) value = torch.tensor(0.5)

expected = addcmul_fn(x.clone(), t1, t2, value)

with torch._dynamo.config.patch(enable_dynamo_decompositions=True): actual = torch.compile(addcmul_fn, fullgraph=True)(x.clone(), t1, t2, value)

diff = (expected - actual).abs() mismatches = (diff > 0).sum().item() print(f"mismatches: {mismatches} / {diff.numel()} ({100*mismatches/diff.numel():.1f}%)") print(f"max abs diff: {diff.max().item()}") print(f"bitwise equal: {torch.equal(expected, actual)}")

Error logs

No response

Versions

1d13be67fea5d1f44998483271fc258f550fa524

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

Option 1: Create an inductor prim that emits the exact C++ expression

To fix this issue, we can create an inductor prim that emits the exact C++ expression self + value * t1 * t2 in a single codegen kernel. This will allow the compiler to optimize it identically to ATen.

Step-by-Step Solution:

Create a new inductor prim: Create a new prim in inductor_prims.py that emits the exact C++ expression self + value * t1 * t2.
Update the codegen kernel: Update the codegen kernel to use the new prim.
Rebuild and retest: Rebuild and retest the code to ensure that the issue is fixed.

Example Code:

# inductor_prims.py
from inductor_prims import *

@prim
def addcmul_prim(self, value, t1, t2):
    return self + value * t1 * t2

# codegen_kernel.py
from inductor_prims import addcmul_prim

def codegen_kernel(x, value, t1, t2):
    return addcmul_prim(x, value, t1, t2)

Option 2: Control FP contraction flags

To fix this issue, we can control FP contraction flags (-ffp-contract) in inductor's C++ codegen to match ATen's build configuration.

Step-by-Step Solution:

Update the C++ codegen: Update the C++ codegen to control FP contraction flags.
Rebuild and retest: Rebuild and retest the code to ensure that the issue is fixed.

Example Code:

// codegen_kernel.cpp
#include <inductor/codegen.h>

void codegen_kernel(float* x, float value

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Addcmul CPU bitwise numerics not matching eager w/ torch.compile [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

Fix Action

Fix / Workaround

🐛 Describe the bug

Error logs

Versions

extent analysis

Fix Plan

Option 1: Create an inductor prim that emits the exact C++ expression

Option 2: Control FP contraction flags

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Addcmul CPU bitwise numerics not matching eager w/ torch.compile [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Root Cause

Fix Action

Fix / Workaround

🐛 Describe the bug

Error logs

Versions

extent analysis

Fix Plan

Option 1: Create an inductor prim that emits the exact C++ expression

Option 2: Control FP contraction flags

Still need to ship something?

RELATED_DISCOVERY

TRENDING