pytorch - 💡(How to fix) Fix torch.compile: autograd.Function.apply with aliased inputs drops per-slot gradient contributions [1 participants]

pytorch2026-04-22 18:22:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181146•Fetched 2026-04-23 07:22:23

View on GitHub

Comments

Participants

Timeline

140

Reactions

Author

aorenste

Participants

aorenste

Timeline (top)

mentioned ×66subscribed ×66labeled ×8

Error Message

Error logs

No error — this is silent incorrectness. In a production model we hit this indirectly: a multi-position-shared tensor received a zero/partial gradient, parameters drifted, and a downstream MulBackward0 eventually produced NaN. That's how we stumbled onto it.

Fix Action

Fix / Workaround

Relation to #180642

#180642 covers side-effect-in-fwd + autograd_function_apply, fixed by #180670 / #180921. This is input-aliasing + asymmetric backward return, and still repros on current main after those fixes.

Code Example

import torch

class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v, mask_v_grad: bool):
        ctx.save_for_backward(q, k, v)
        ctx.mask_v_grad = mask_v_grad
        return q * (k + v)

    @staticmethod
    def backward(ctx, dout):
        q, k, v = ctx.saved_tensors
        dq = dout * (k + v)
        dk = dout * q
        dv = dout * q
        if ctx.mask_v_grad:
            dk = dk + dv        # bwd pre-aggregates; returns None for v slot
            return dq, dk, None, None
        return dq, dk, dv, None

def f(q, k):
    return MyFn.apply(q, k, k, True).sum()   # same tensor k at slots 2 AND 3

device = "cuda"
for label, g in [("eager", f), ("compile", torch.compile(f, fullgraph=True))]:
    torch.manual_seed(0)
    q = torch.randn(16, device=device, requires_grad=True)
    k = torch.randn(16, device=device, requires_grad=True)
    g(q, k).backward()
    print(label, "k.grad=",
          None if k.grad is None else k.grad.abs().max().item())

---

eager   k.grad= 5.287691593170166
compile k.grad= None

---

autograd_function_apply(fwd_body_0, bwd_body_0, l_k_, l_q_, ...)   # only 2 tensor args

---

return (None, dq)     # slot 0 (for l_k_) -> None, slot 1 (for l_q_) -> dq

---

PyTorch version: 2.13.0a0+git9daaaa2
Is debug build: False
CUDA used to build PyTorch: 12.9
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-14)
Clang version: Could not collect
CMake version: version 4.3.1
Libc version: glibc-2.34

Python version: 3.10.19 (main, Feb 12 2026, 00:42:18) [Clang 21.1.4 ] (64-bit runtime)
Python platform: Linux-6.13.2-0_fbk11_0_g599ea5da5981-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.9.86
GPU: NVIDIA A100 (PG509-210)
Nvidia driver version: 580.82.07

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] optree==0.19.0
[pip3] torch==2.13.0a0+git9daaaa2
[pip3] triton==3.7.0+gitb4e20bbe

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When the same Tensor is passed to two (or more) positional tensor arguments of a torch.autograd.Function.apply(...), and the backward returns different values for those positions (e.g. None for one slot and a tensor for another), torch.compile produces a graph that drops one of the per-slot gradient contributions.

In eager, autograd accumulates both slot contributions into a single .grad on the shared input. Under torch.compile, dynamo collapses the aliased inputs to a single HOP argument and then maps backward returns incorrectly — only one of the duplicated slots' return values ends up as the gradient for the deduped input.

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @azahed98 @ydwu4 @bdhirsh @bobrenjc93 @anijain2305 @xmfan — related to the area you worked on in #180642 / #180670 / #180921, but the mechanism here is input aliasing rather than fwd side-effects, and it survives those fixes on current main.

Minimal repro (~50 LOC, pure PyTorch, no triton, no custom ops)

import torch

class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v, mask_v_grad: bool):
        ctx.save_for_backward(q, k, v)
        ctx.mask_v_grad = mask_v_grad
        return q * (k + v)

    @staticmethod
    def backward(ctx, dout):
        q, k, v = ctx.saved_tensors
        dq = dout * (k + v)
        dk = dout * q
        dv = dout * q
        if ctx.mask_v_grad:
            dk = dk + dv        # bwd pre-aggregates; returns None for v slot
            return dq, dk, None, None
        return dq, dk, dv, None

def f(q, k):
    return MyFn.apply(q, k, k, True).sum()   # same tensor k at slots 2 AND 3

device = "cuda"
for label, g in [("eager", f), ("compile", torch.compile(f, fullgraph=True))]:
    torch.manual_seed(0)
    q = torch.randn(16, device=device, requires_grad=True)
    k = torch.randn(16, device=device, requires_grad=True)
    g(q, k).backward()
    print(label, "k.grad=",
          None if k.grad is None else k.grad.abs().max().item())

Output:

eager   k.grad= 5.287691593170166
compile k.grad= None

Variants (confirms it's about the duplicated `apply()` slot, not the `None` return)

`apply(q, k, k)` alias	`None` return for v slot	eager `k.grad`	compile `k.grad`
yes	yes	5.29	None (fully lost)
yes	no	5.29	2.64 (one slot's value only — not summed)
no	yes	5.29	5.29
no	no	2.64 / 2.64	2.64 / 2.64

Ablation — the broken graph comes from dynamo (not AOT or Inductor)

`torch.compile(backend=…)`	compile `k.grad`
`"eager"` (dynamo only, run captured graph in eager)	None
`"aot_eager"`	None
`"inductor"`	None

All three wrong → the bug is in dynamo's graph capture.

Captured graph (with a custom backend that prints `gm`)

autograd_function_apply(fwd_body_0, bwd_body_0, l_k_, l_q_, ...)   # only 2 tensor args

The apply(q, k, k, True) had three tensor positional args; dynamo deduped the two ks to a single HOP input. bwd_body_0 returns:

return (None, dq)     # slot 0 (for l_k_) -> None, slot 1 (for l_q_) -> dq

The original backward returned (dq, dk_after_adding_dv, None, None) — one per apply() positional. When dynamo collapsed the aliased k/v down to a single input, it picked the None from the v-slot as the grad for the deduped l_k_ instead of summing both apply() positions' per-slot returns.

Likely location

torch/_dynamo/variables/higher_order_ops.py (AutogradFunctionApplyVariable) or the autograd_function_apply HOP output-mapping logic that unifies per-slot backward returns onto deduped inputs.

Relation to #180642

#180642 covers side-effect-in-fwd + autograd_function_apply, fixed by #180670 / #180921. This is input-aliasing + asymmetric backward return, and still repros on current main after those fixes.

Error logs

Versions

PyTorch version: 2.13.0a0+git9daaaa2
Is debug build: False
CUDA used to build PyTorch: 12.9
ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-14)
Clang version: Could not collect
CMake version: version 4.3.1
Libc version: glibc-2.34

Python version: 3.10.19 (main, Feb 12 2026, 00:42:18) [Clang 21.1.4 ] (64-bit runtime)
Python platform: Linux-6.13.2-0_fbk11_0_g599ea5da5981-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.9.86
GPU: NVIDIA A100 (PG509-210)
Nvidia driver version: 580.82.07

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] optree==0.19.0
[pip3] torch==2.13.0a0+git9daaaa2
[pip3] triton==3.7.0+gitb4e20bbe

extent analysis

TL;DR

The issue can be fixed by modifying the torch/_dynamo/variables/higher_order_ops.py file to correctly handle input aliasing and asymmetric backward returns in the AutogradFunctionApplyVariable class.

Guidance

Identify the AutogradFunctionApplyVariable class in torch/_dynamo/variables/higher_order_ops.py and modify it to handle input aliasing by summing the per-slot backward returns for deduped inputs.
Verify the fix by running the provided minimal repro code and checking that the k.grad value is correctly calculated in both eager and compiled modes.
Consider adding additional tests to cover different scenarios of input aliasing and asymmetric backward returns to ensure the fix is robust.
Review the autograd_function_apply HOP output-mapping logic to ensure it correctly unifies per-slot backward returns onto deduped inputs.

Example

# In torch/_dynamo/variables/higher_order_ops.py
class AutogradFunctionApplyVariable:
    #...
    def backward(self, grads):
        # Handle input aliasing by summing per-slot backward returns
        deduped_grads = {}
        for i, grad in enumerate(grads):
            if self.inputs[i] in deduped_grads:
                deduped_grads[self.inputs[i]] += grad
            else:
                deduped_grads[self.inputs[i]] = grad
        return [deduped_grads[input] for input in self.inputs]

Notes

The provided fix is based on the assumption that the issue is caused by the incorrect handling of input aliasing in the AutogradFunctionApplyVariable class.
The fix may need to be modified or extended to cover different scenarios or edge cases.
The issue is specific to the torch.compile mode and does not affect the eager mode.

Recommendation

Apply the workaround by modifying the torch/_dynamo/variables/higher_order_ops.py file to correctly handle input aliasing and asymmetric backward returns. This fix should resolve the issue and provide the correct gradient values for the shared input tensor.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#prompt formatting #chain error #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix torch.compile: autograd.Function.apply with aliased inputs drops per-slot gradient contributions [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Fix Action

Fix / Workaround

Relation to #180642

Code Example

🐛 Describe the bug

Minimal repro (~50 LOC, pure PyTorch, no triton, no custom ops)

Variants (confirms it's about the duplicated `apply()` slot, not the `None` return)

Ablation — the broken graph comes from dynamo (not AOT or Inductor)

Captured graph (with a custom backend that prints `gm`)

Likely location

Relation to #180642

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix torch.compile: autograd.Function.apply with aliased inputs drops per-slot gradient contributions [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Fix Action

Fix / Workaround

Relation to #180642

Code Example

🐛 Describe the bug

Minimal repro (~50 LOC, pure PyTorch, no triton, no custom ops)

Variants (confirms it's about the duplicated apply() slot, not the None return)

Ablation — the broken graph comes from dynamo (not AOT or Inductor)

Captured graph (with a custom backend that prints gm)

Likely location

Relation to #180642

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Variants (confirms it's about the duplicated `apply()` slot, not the `None` return)

Captured graph (with a custom backend that prints `gm`)