pytorch - 💡(How to fix) Fix `torch.compile`: Inductor silently produces wrong results for `addcdiv_`/`addcmul_` when `.item()` graph break occurs in a loop

Error Message

When a compiled function contains a loop that calls a helper function, and that helper function uses .item() (causing a graph break) followed by in-place ternary ops (addcdiv_ or addcmul_) whose arguments depend on the .item() scalar, Inductor produces silently incorrect results starting from the 3rd iteration. The error grows unboundedly with iteration count.

Error growth

The error accumulates with each iteration and grows unboundedly: | Iterations | Absolute error | Relative error | Any hand-written Adam/AdamW optimizer that uses .item() for the step counter (a common pattern in tutorials and custom implementations) will produce silently incorrect gradient updates when compiled with torch.compile(backend="inductor"). No error or warning is raised — the output looks plausible but is wrong.

Code Example

import torch
import torch._dynamo

def adam_step(param, grad, m, v, step_t):
    step_t.add_(1)
    m.mul_(0.9).add_(grad, alpha=0.1)
    v.mul_(0.999).addcmul_(grad, grad, value=0.001)
    step = step_t.item()  # graph break
    bc1 = 1 - 0.9 ** step
    bc2 = 1 - 0.999 ** step
    step_size = 0.001 / bc1
    denom = (v.sqrt() / (bc2 ** 0.5)).add_(1e-8)
    param.addcdiv_(m, denom, value=-step_size)

def train_loop(param_data, grad_data):
    p = param_data.clone()
    g = grad_data.clone()
    m = torch.zeros_like(p)
    v = torch.zeros_like(p)
    s = torch.tensor(0.0, device=p.device)
    for _ in range(3):
        adam_step(p, g, m, v, s)
    return p

torch.manual_seed(42)
p = torch.randn(128, device="cuda")
g = torch.randn(128, device="cuda")

eager_out = train_loop(p.clone(), g.clone())

torch._dynamo.reset()
compiled_out = torch.compile(train_loop, backend="inductor")(p.clone(), g.clone())
torch.cuda.synchronize()

torch._dynamo.reset()
aot_out = torch.compile(train_loop, backend="aot_eager")(p.clone(), g.clone())

print(f"Eager vs Inductor:  {(eager_out - compiled_out).abs().max().item():.6e}")  # 4.263520e-04 ← WRONG
print(f"Eager vs aot_eager: {(eager_out - aot_out).abs().max().item():.6e}")       # 2.328306e-10 ← correct

---

torch._dynamo.config.capture_scalar_outputs = True

---

PyTorch: 2.13.0.dev20260520+cu126
CUDA: 12.6
GPU: Tesla T4
Python: 3.11
OS: Linux 5.4.0-42-generic (x86_64)

🐛 Describe the bug

aot_eager backend produces correct results, confirming this is an Inductor-specific issue.

Graph breaks are documented as affecting only performance, not correctness — this bug violates that contract.

Minimal reproducer

import torch
import torch._dynamo

def adam_step(param, grad, m, v, step_t):
    step_t.add_(1)
    m.mul_(0.9).add_(grad, alpha=0.1)
    v.mul_(0.999).addcmul_(grad, grad, value=0.001)
    step = step_t.item()  # graph break
    bc1 = 1 - 0.9 ** step
    bc2 = 1 - 0.999 ** step
    step_size = 0.001 / bc1
    denom = (v.sqrt() / (bc2 ** 0.5)).add_(1e-8)
    param.addcdiv_(m, denom, value=-step_size)

def train_loop(param_data, grad_data):
    p = param_data.clone()
    g = grad_data.clone()
    m = torch.zeros_like(p)
    v = torch.zeros_like(p)
    s = torch.tensor(0.0, device=p.device)
    for _ in range(3):
        adam_step(p, g, m, v, s)
    return p

torch.manual_seed(42)
p = torch.randn(128, device="cuda")
g = torch.randn(128, device="cuda")

eager_out = train_loop(p.clone(), g.clone())

torch._dynamo.reset()
compiled_out = torch.compile(train_loop, backend="inductor")(p.clone(), g.clone())
torch.cuda.synchronize()

torch._dynamo.reset()
aot_out = torch.compile(train_loop, backend="aot_eager")(p.clone(), g.clone())

print(f"Eager vs Inductor:  {(eager_out - compiled_out).abs().max().item():.6e}")  # 4.263520e-04 ← WRONG
print(f"Eager vs aot_eager: {(eager_out - aot_out).abs().max().item():.6e}")       # 2.328306e-10 ← correct

Error growth

The error accumulates with each iteration and grows unboundedly:

Iterations	Absolute error	Relative error
3	4.26e-04	0.017%
5	2.39e-03	0.095%
10	1.23e-02	0.49%
20	4.41e-02	1.7%
50	1.67e-01	6.5%

Trigger conditions

All of the following are required to trigger the bug:

.item() graph break inside a function called in a loop (≥ 3 iterations)
The .item() scalar changes each iteration (e.g., step counter)
In-place ternary op (addcdiv_ or addcmul_) where value= depends on the changing scalar
The denom/tensor argument also depends on the scalar (e.g., v.sqrt() / f(scalar))

What does NOT trigger the bug

Variant	Result
Non-in-place `torch.addcdiv()`	✅ OK
Manual arithmetic `x.sub_(m / denom * scale)`	✅ OK
Inlining the function body (no function call)	✅ OK
`print()`-based graph break with Python constants	✅ OK
`aot_eager` backend	✅ OK
1–2 loop iterations	✅ OK
`capture_scalar_outputs=True`	✅ OK

Workaround

torch._dynamo.config.capture_scalar_outputs = True

This prevents the .item() graph break, keeping the scalar in the graph. With this setting, the diff drops to 0.

Impact

Any hand-written Adam/AdamW optimizer that uses .item() for the step counter (a common pattern in tutorials and custom implementations) will produce silently incorrect gradient updates when compiled with torch.compile(backend="inductor"). No error or warning is raised — the output looks plausible but is wrong.

Versions

PyTorch: 2.13.0.dev20260520+cu126
CUDA: 12.6
GPU: Tesla T4
Python: 3.11
OS: Linux 5.4.0-42-generic (x86_64)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.compile`: Inductor silently produces wrong results for `addcdiv_`/`addcmul_` when `.item()` graph break occurs in a loop

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error growth

Fix Action

Workaround

Code Example

🐛 Describe the bug

🐛 Describe the bug

Minimal reproducer

Error growth

Trigger conditions

What does NOT trigger the bug

Workaround

Impact

Versions

Versions

Still need to ship something?

TRENDING