pytorch - 💡(How to fix) Fix `torch.compile`: Inductor silently produces wrong results for `addcdiv_`/`addcmul_` when `.item()` graph break occurs in a loop

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

When a compiled function contains a loop that calls a helper function, and that helper function uses .item() (causing a graph break) followed by in-place ternary ops (addcdiv_ or addcmul_) whose arguments depend on the .item() scalar, Inductor produces silently incorrect results starting from the 3rd iteration. The error grows unboundedly with iteration count.

Error growth

The error accumulates with each iteration and grows unboundedly: | Iterations | Absolute error | Relative error | Any hand-written Adam/AdamW optimizer that uses .item() for the step counter (a common pattern in tutorials and custom implementations) will produce silently incorrect gradient updates when compiled with torch.compile(backend="inductor"). No error or warning is raised — the output looks plausible but is wrong.

Fix Action

Workaround

torch._dynamo.config.capture_scalar_outputs = True

This prevents the .item() graph break, keeping the scalar in the graph. With this setting, the diff drops to 0.

Code Example

import torch
import torch._dynamo

def adam_step(param, grad, m, v, step_t):
    step_t.add_(1)
    m.mul_(0.9).add_(grad, alpha=0.1)
    v.mul_(0.999).addcmul_(grad, grad, value=0.001)
    step = step_t.item()  # graph break
    bc1 = 1 - 0.9 ** step
    bc2 = 1 - 0.999 ** step
    step_size = 0.001 / bc1
    denom = (v.sqrt() / (bc2 ** 0.5)).add_(1e-8)
    param.addcdiv_(m, denom, value=-step_size)

def train_loop(param_data, grad_data):
    p = param_data.clone()
    g = grad_data.clone()
    m = torch.zeros_like(p)
    v = torch.zeros_like(p)
    s = torch.tensor(0.0, device=p.device)
    for _ in range(3):
        adam_step(p, g, m, v, s)
    return p

torch.manual_seed(42)
p = torch.randn(128, device="cuda")
g = torch.randn(128, device="cuda")

eager_out = train_loop(p.clone(), g.clone())

torch._dynamo.reset()
compiled_out = torch.compile(train_loop, backend="inductor")(p.clone(), g.clone())
torch.cuda.synchronize()

torch._dynamo.reset()
aot_out = torch.compile(train_loop, backend="aot_eager")(p.clone(), g.clone())

print(f"Eager vs Inductor:  {(eager_out - compiled_out).abs().max().item():.6e}")  # 4.263520e-04WRONG
print(f"Eager vs aot_eager: {(eager_out - aot_out).abs().max().item():.6e}")       # 2.328306e-10 ← correct

---

torch._dynamo.config.capture_scalar_outputs = True

---

PyTorch: 2.13.0.dev20260520+cu126
CUDA: 12.6
GPU: Tesla T4
Python: 3.11
OS: Linux 5.4.0-42-generic (x86_64)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

🐛 Describe the bug

When a compiled function contains a loop that calls a helper function, and that helper function uses .item() (causing a graph break) followed by in-place ternary ops (addcdiv_ or addcmul_) whose arguments depend on the .item() scalar, Inductor produces silently incorrect results starting from the 3rd iteration. The error grows unboundedly with iteration count.

aot_eager backend produces correct results, confirming this is an Inductor-specific issue.

Graph breaks are documented as affecting only performance, not correctness — this bug violates that contract.

Minimal reproducer

import torch
import torch._dynamo

def adam_step(param, grad, m, v, step_t):
    step_t.add_(1)
    m.mul_(0.9).add_(grad, alpha=0.1)
    v.mul_(0.999).addcmul_(grad, grad, value=0.001)
    step = step_t.item()  # graph break
    bc1 = 1 - 0.9 ** step
    bc2 = 1 - 0.999 ** step
    step_size = 0.001 / bc1
    denom = (v.sqrt() / (bc2 ** 0.5)).add_(1e-8)
    param.addcdiv_(m, denom, value=-step_size)

def train_loop(param_data, grad_data):
    p = param_data.clone()
    g = grad_data.clone()
    m = torch.zeros_like(p)
    v = torch.zeros_like(p)
    s = torch.tensor(0.0, device=p.device)
    for _ in range(3):
        adam_step(p, g, m, v, s)
    return p

torch.manual_seed(42)
p = torch.randn(128, device="cuda")
g = torch.randn(128, device="cuda")

eager_out = train_loop(p.clone(), g.clone())

torch._dynamo.reset()
compiled_out = torch.compile(train_loop, backend="inductor")(p.clone(), g.clone())
torch.cuda.synchronize()

torch._dynamo.reset()
aot_out = torch.compile(train_loop, backend="aot_eager")(p.clone(), g.clone())

print(f"Eager vs Inductor:  {(eager_out - compiled_out).abs().max().item():.6e}")  # 4.263520e-04 ← WRONG
print(f"Eager vs aot_eager: {(eager_out - aot_out).abs().max().item():.6e}")       # 2.328306e-10 ← correct

Error growth

The error accumulates with each iteration and grows unboundedly:

IterationsAbsolute errorRelative error
34.26e-040.017%
52.39e-030.095%
101.23e-020.49%
204.41e-021.7%
501.67e-016.5%

Trigger conditions

All of the following are required to trigger the bug:

  1. .item() graph break inside a function called in a loop (≥ 3 iterations)
  2. The .item() scalar changes each iteration (e.g., step counter)
  3. In-place ternary op (addcdiv_ or addcmul_) where value= depends on the changing scalar
  4. The denom/tensor argument also depends on the scalar (e.g., v.sqrt() / f(scalar))

What does NOT trigger the bug

VariantResult
Non-in-place torch.addcdiv()✅ OK
Manual arithmetic x.sub_(m / denom * scale)✅ OK
Inlining the function body (no function call)✅ OK
print()-based graph break with Python constants✅ OK
aot_eager backend✅ OK
1–2 loop iterations✅ OK
capture_scalar_outputs=True✅ OK

Workaround

torch._dynamo.config.capture_scalar_outputs = True

This prevents the .item() graph break, keeping the scalar in the graph. With this setting, the diff drops to 0.

Impact

Any hand-written Adam/AdamW optimizer that uses .item() for the step counter (a common pattern in tutorials and custom implementations) will produce silently incorrect gradient updates when compiled with torch.compile(backend="inductor"). No error or warning is raised — the output looks plausible but is wrong.

Versions

Versions

PyTorch: 2.13.0.dev20260520+cu126
CUDA: 12.6
GPU: Tesla T4
Python: 3.11
OS: Linux 5.4.0-42-generic (x86_64)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING