pytorch - 💡(How to fix) Fix torch.compile + aot_eager: backward ignores inner autocast(enabled=False) vs eager

pytorch2026-04-09 11:01:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

import torch
import torch.nn as nn

w = nn.Linear(64, 64, device="cuda")
h = torch.randn(4, 64, device="cuda", dtype=torch.bfloat16, requires_grad=True)

def inner_fn(x):
    with torch.amp.autocast("cuda", enabled=False):
        return x.float() @ w.weight.float().T

def outer_fn(x, compile=False):
    if compile:
        return torch.compile(inner_fn, backend="aot_eager")(x)
    return inner_fn(x)

with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_eager = outer_fn(h)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_compiled = outer_fn(h, compile=True)

out_eager.sum().backward()
grad_eager = w.weight.grad.clone()

w.weight.grad = None
out_compiled.sum().backward()
grad_compiled = w.weight.grad.clone()

assert torch.equal(grad_eager, grad_compiled)  # fails

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Repro

import torch
import torch.nn as nn

w = nn.Linear(64, 64, device="cuda")
h = torch.randn(4, 64, device="cuda", dtype=torch.bfloat16, requires_grad=True)

def inner_fn(x):
    with torch.amp.autocast("cuda", enabled=False):
        return x.float() @ w.weight.float().T

def outer_fn(x, compile=False):
    if compile:
        return torch.compile(inner_fn, backend="aot_eager")(x)
    return inner_fn(x)

with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_eager = outer_fn(h)
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    out_compiled = outer_fn(h, compile=True)

out_eager.sum().backward()
grad_eager = w.weight.grad.clone()

w.weight.grad = None
out_compiled.sum().backward()
grad_compiled = w.weight.grad.clone()

assert torch.equal(grad_eager, grad_compiled)  # fails

Expected

Eager and compiled paths should bitwise match: forward runs under outer autocast(bf16) but the matmul is forced to fp32 via inner autocast(enabled=False); backward is outside autocast, so weight gradients should agree (fp32 in both cases for this setup).

Actual

grad_eager and grad_compiled differ (torch.equal is False).

Hypothesis (short)

In eager mode, autograd restores the per-op autocast state from forward, so backward matmuls run without the outer bf16 autocast. When compile with aot_eager backend, backward is traced under the outer autocast via aot autograd, so bf16 casts can get baked into the compiled backward even when backward() runs outside autocast—so gradients no longer match eager.

Versions

torch 2.9.1

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93 @chauhang @penguinwu

extent analysis

TL;DR

The issue can be resolved by ensuring that the autocast state is consistently applied during both forward and backward passes in eager and compiled modes.

Guidance

Verify that the torch.amp.autocast context is properly nested and that the enabled parameter is correctly set to False in the inner function to force FP32 operations.
Check the documentation for torch.compile and aot_eager backend to ensure that the compilation process is correctly handling the autocast state.
Consider adding additional logging or debugging statements to track the autocast state and data types during both forward and backward passes.
Review the PyTorch version and check if there are any known issues or updates related to autocast and compilation.

Example

# Verify the autocast state and data types
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    print(torch.autocast.enabled)
    out_eager = outer_fn(h)
    print(out_eager.dtype)

Notes

The issue seems to be related to the interaction between autocast, compilation, and autograd. The provided hypothesis suggests that the compiled backward pass is being traced under the outer autocast, which is causing the gradients to differ. However, without further information or debugging, it is difficult to provide a definitive solution.

Recommendation

Apply a workaround by ensuring consistent autocast state application, as the issue seems to be related to the interaction between autocast and compilation. Upgrading to a newer version of PyTorch may also resolve the issue, but this is not explicitly implied in the provided information.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#prompt formatting #chain error #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix torch.compile + aot_eager: backward ignores inner autocast(enabled=False) vs eager

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Repro

Expected

Actual

Hypothesis (short)

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix torch.compile + aot_eager: backward ignores inner autocast(enabled=False) vs eager

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Repro

Expected

Actual

Hypothesis (short)

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING