pytorch - 💡(How to fix) Fix torch.compile: autograd.Function.apply with aliased inputs drops per-slot gradient contributions [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181146Fetched 2026-04-23 07:22:23
View on GitHub
Comments
0
Participants
1
Timeline
140
Reactions
0
Author
Participants
Timeline (top)
mentioned ×66subscribed ×66labeled ×8

Error Message

Error logs

No error — this is silent incorrectness. In a production model we hit this indirectly: a multi-position-shared tensor received a zero/partial gradient, parameters drifted, and a downstream MulBackward0 eventually produced NaN. That's how we stumbled onto it.

Fix Action

Fix / Workaround

Relation to #180642

#180642 covers side-effect-in-fwd + autograd_function_apply, fixed by #180670 / #180921. This is input-aliasing + asymmetric backward return, and still repros on current main after those fixes.

Code Example

import torch

class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v, mask_v_grad: bool):
        ctx.save_for_backward(q, k, v)
        ctx.mask_v_grad = mask_v_grad
        return q * (k + v)

    @staticmethod
    def backward(ctx, dout):
        q, k, v = ctx.saved_tensors
        dq = dout * (k + v)
        dk = dout * q
        dv = dout * q
        if ctx.mask_v_grad:
            dk = dk + dv        # bwd pre-aggregates; returns None for v slot
            return dq, dk, None, None
        return dq, dk, dv, None

def f(q, k):
    return MyFn.apply(q, k, k, True).sum()   # same tensor k at slots 2 AND 3

device = "cuda"
for label, g in [("eager", f), ("compile", torch.compile(f, fullgraph=True))]:
    torch.manual_seed(0)
    q = torch.randn(16, device=device, requires_grad=True)
    k = torch.randn(16, device=device, requires_grad=True)
    g(q, k).backward()
    print(label, "k.grad=",
          None if k.grad is None else k.grad.abs().max().item())

---

eager   k.grad= 5.287691593170166
compile k.grad= None

---

autograd_function_apply(fwd_body_0, bwd_body_0, l_k_, l_q_, ...)   # only 2 tensor args

---

return (None, dq)     # slot 0 (for l_k_) -> None, slot 1 (for l_q_) -> dq

---

PyTorch version: 2.13.0a0+git9daaaa2
Is debug build: False
CUDA used to build PyTorch: 12.9
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-14)
Clang version: Could not collect
CMake version: version 4.3.1
Libc version: glibc-2.34

Python version: 3.10.19 (main, Feb 12 2026, 00:42:18) [Clang 21.1.4 ] (64-bit runtime)
Python platform: Linux-6.13.2-0_fbk11_0_g599ea5da5981-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.9.86
GPU: NVIDIA A100 (PG509-210)
Nvidia driver version: 580.82.07

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] optree==0.19.0
[pip3] torch==2.13.0a0+git9daaaa2
[pip3] triton==3.7.0+gitb4e20bbe
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When the same Tensor is passed to two (or more) positional tensor arguments of a torch.autograd.Function.apply(...), and the backward returns different values for those positions (e.g. None for one slot and a tensor for another), torch.compile produces a graph that drops one of the per-slot gradient contributions.

In eager, autograd accumulates both slot contributions into a single .grad on the shared input. Under torch.compile, dynamo collapses the aliased inputs to a single HOP argument and then maps backward returns incorrectly — only one of the duplicated slots' return values ends up as the gradient for the deduped input.

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @azahed98 @ydwu4 @bdhirsh @bobrenjc93 @anijain2305 @xmfan — related to the area you worked on in #180642 / #180670 / #180921, but the mechanism here is input aliasing rather than fwd side-effects, and it survives those fixes on current main.

Minimal repro (~50 LOC, pure PyTorch, no triton, no custom ops)

import torch

class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v, mask_v_grad: bool):
        ctx.save_for_backward(q, k, v)
        ctx.mask_v_grad = mask_v_grad
        return q * (k + v)

    @staticmethod
    def backward(ctx, dout):
        q, k, v = ctx.saved_tensors
        dq = dout * (k + v)
        dk = dout * q
        dv = dout * q
        if ctx.mask_v_grad:
            dk = dk + dv        # bwd pre-aggregates; returns None for v slot
            return dq, dk, None, None
        return dq, dk, dv, None

def f(q, k):
    return MyFn.apply(q, k, k, True).sum()   # same tensor k at slots 2 AND 3

device = "cuda"
for label, g in [("eager", f), ("compile", torch.compile(f, fullgraph=True))]:
    torch.manual_seed(0)
    q = torch.randn(16, device=device, requires_grad=True)
    k = torch.randn(16, device=device, requires_grad=True)
    g(q, k).backward()
    print(label, "k.grad=",
          None if k.grad is None else k.grad.abs().max().item())

Output:

eager   k.grad= 5.287691593170166
compile k.grad= None

Variants (confirms it's about the duplicated apply() slot, not the None return)

apply(q, k, k) aliasNone return for v sloteager k.gradcompile k.grad
yesyes5.29None (fully lost)
yesno5.292.64 (one slot's value only — not summed)
noyes5.295.29
nono2.64 / 2.642.64 / 2.64

Ablation — the broken graph comes from dynamo (not AOT or Inductor)

torch.compile(backend=…)compile k.grad
"eager" (dynamo only, run captured graph in eager)None
"aot_eager"None
"inductor"None

All three wrong → the bug is in dynamo's graph capture.

Captured graph (with a custom backend that prints gm)

autograd_function_apply(fwd_body_0, bwd_body_0, l_k_, l_q_, ...)   # only 2 tensor args

The apply(q, k, k, True) had three tensor positional args; dynamo deduped the two ks to a single HOP input. bwd_body_0 returns:

return (None, dq)     # slot 0 (for l_k_) -> None, slot 1 (for l_q_) -> dq

The original backward returned (dq, dk_after_adding_dv, None, None) — one per apply() positional. When dynamo collapsed the aliased k/v down to a single input, it picked the None from the v-slot as the grad for the deduped l_k_ instead of summing both apply() positions' per-slot returns.

Likely location

torch/_dynamo/variables/higher_order_ops.py (AutogradFunctionApplyVariable) or the autograd_function_apply HOP output-mapping logic that unifies per-slot backward returns onto deduped inputs.

Relation to #180642

#180642 covers side-effect-in-fwd + autograd_function_apply, fixed by #180670 / #180921. This is input-aliasing + asymmetric backward return, and still repros on current main after those fixes.

Error logs

No error — this is silent incorrectness. In a production model we hit this indirectly: a multi-position-shared tensor received a zero/partial gradient, parameters drifted, and a downstream MulBackward0 eventually produced NaN. That's how we stumbled onto it.

Versions

PyTorch version: 2.13.0a0+git9daaaa2
Is debug build: False
CUDA used to build PyTorch: 12.9
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-14)
Clang version: Could not collect
CMake version: version 4.3.1
Libc version: glibc-2.34

Python version: 3.10.19 (main, Feb 12 2026, 00:42:18) [Clang 21.1.4 ] (64-bit runtime)
Python platform: Linux-6.13.2-0_fbk11_0_g599ea5da5981-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.9.86
GPU: NVIDIA A100 (PG509-210)
Nvidia driver version: 580.82.07

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] optree==0.19.0
[pip3] torch==2.13.0a0+git9daaaa2
[pip3] triton==3.7.0+gitb4e20bbe

extent analysis

TL;DR

The issue can be fixed by modifying the torch/_dynamo/variables/higher_order_ops.py file to correctly handle input aliasing and asymmetric backward returns in the AutogradFunctionApplyVariable class.

Guidance

  • Identify the AutogradFunctionApplyVariable class in torch/_dynamo/variables/higher_order_ops.py and modify it to handle input aliasing by summing the per-slot backward returns for deduped inputs.
  • Verify the fix by running the provided minimal repro code and checking that the k.grad value is correctly calculated in both eager and compiled modes.
  • Consider adding additional tests to cover different scenarios of input aliasing and asymmetric backward returns to ensure the fix is robust.
  • Review the autograd_function_apply HOP output-mapping logic to ensure it correctly unifies per-slot backward returns onto deduped inputs.

Example

# In torch/_dynamo/variables/higher_order_ops.py
class AutogradFunctionApplyVariable:
    #...
    def backward(self, grads):
        # Handle input aliasing by summing per-slot backward returns
        deduped_grads = {}
        for i, grad in enumerate(grads):
            if self.inputs[i] in deduped_grads:
                deduped_grads[self.inputs[i]] += grad
            else:
                deduped_grads[self.inputs[i]] = grad
        return [deduped_grads[input] for input in self.inputs]

Notes

  • The provided fix is based on the assumption that the issue is caused by the incorrect handling of input aliasing in the AutogradFunctionApplyVariable class.
  • The fix may need to be modified or extended to cover different scenarios or edge cases.
  • The issue is specific to the torch.compile mode and does not affect the eager mode.

Recommendation

Apply the workaround by modifying the torch/_dynamo/variables/higher_order_ops.py file to correctly handle input aliasing and asymmetric backward returns. This fix should resolve the issue and provide the correct gradient values for the shared input tensor.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix torch.compile: autograd.Function.apply with aliased inputs drops per-slot gradient contributions [1 participants]