pytorch - 💡(How to fix) Fix torch.compile + aot_eager: backward ignores inner autocast(enabled=False) vs eager

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

import torch
import torch.nn as nn

w = nn.Linear(64, 64, device="cuda")
h = torch.randn(4, 64, device="cuda", dtype=torch.bfloat16, requires_grad=True)

def inner_fn(x):
    with torch.amp.autocast("cuda", enabled=False):
        return x.float() @ w.weight.float().T

def outer_fn(x, compile=False):
    if compile:
        return torch.compile(inner_fn, backend="aot_eager")(x)
    return inner_fn(x)

with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_eager = outer_fn(h)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_compiled = outer_fn(h, compile=True)

out_eager.sum().backward()
grad_eager = w.weight.grad.clone()

w.weight.grad = None
out_compiled.sum().backward()
grad_compiled = w.weight.grad.clone()

assert torch.equal(grad_eager, grad_compiled)  # fails
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Repro

import torch
import torch.nn as nn

w = nn.Linear(64, 64, device="cuda")
h = torch.randn(4, 64, device="cuda", dtype=torch.bfloat16, requires_grad=True)

def inner_fn(x):
    with torch.amp.autocast("cuda", enabled=False):
        return x.float() @ w.weight.float().T

def outer_fn(x, compile=False):
    if compile:
        return torch.compile(inner_fn, backend="aot_eager")(x)
    return inner_fn(x)

with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_eager = outer_fn(h)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_compiled = outer_fn(h, compile=True)

out_eager.sum().backward()
grad_eager = w.weight.grad.clone()

w.weight.grad = None
out_compiled.sum().backward()
grad_compiled = w.weight.grad.clone()

assert torch.equal(grad_eager, grad_compiled)  # fails

Expected

Eager and compiled paths should bitwise match: forward runs under outer autocast(bf16) but the matmul is forced to fp32 via inner autocast(enabled=False); backward is outside autocast, so weight gradients should agree (fp32 in both cases for this setup).

Actual

grad_eager and grad_compiled differ (torch.equal is False).

Hypothesis (short)

In eager mode, autograd restores the per-op autocast state from forward, so backward matmuls run without the outer bf16 autocast. When compile with aot_eager backend, backward is traced under the outer autocast via aot autograd, so bf16 casts can get baked into the compiled backward even when backward() runs outside autocast—so gradients no longer match eager.

Versions

torch 2.9.1

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93 @chauhang @penguinwu

extent analysis

TL;DR

The issue can be resolved by ensuring that the autocast state is consistently applied during both forward and backward passes in eager and compiled modes.

Guidance

  • Verify that the torch.amp.autocast context is properly nested and that the enabled parameter is correctly set to False in the inner function to force FP32 operations.
  • Check the documentation for torch.compile and aot_eager backend to ensure that the compilation process is correctly handling the autocast state.
  • Consider adding additional logging or debugging statements to track the autocast state and data types during both forward and backward passes.
  • Review the PyTorch version and check if there are any known issues or updates related to autocast and compilation.

Example

# Verify the autocast state and data types
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    print(torch.autocast.enabled)
    out_eager = outer_fn(h)
    print(out_eager.dtype)

Notes

The issue seems to be related to the interaction between autocast, compilation, and autograd. The provided hypothesis suggests that the compiled backward pass is being traced under the outer autocast, which is causing the gradients to differ. However, without further information or debugging, it is difficult to provide a definitive solution.

Recommendation

Apply a workaround by ensuring consistent autocast state application, as the issue seems to be related to the interaction between autocast and compilation. Upgrading to a newer version of PyTorch may also resolve the issue, but this is not explicitly implied in the provided information.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING