pytorch - 💡(How to fix) Fix torch.compile produces wrong results for F.multilabel_margin_loss when target contains -1

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Target patternEager lossCompiled lossFWD diffBWD max diff
[0,1,2,3,4] (no padding)✅ correct✅ correct00
[0,1,2,3,-1]✅ correct❌ wrong0.0490.20
[0,1,-1,-1,-1]✅ correct❌ wrong1.601.20
[0,-1,-1,-1,-1]✅ correct❌ wrong0.690.80
[-1,0,1,2,3]✅ correct❌ wrong7.814.00

Root Cause

The bug is in the AOT autograd decomposition of aten::multilabel_margin_loss_forward. Both aot_eager and inductor backends produce the same wrong result, confirming the issue is in the decomposition layer, not in Inductor codegen.

The eager C++ kernel treats -1 in target as a "stop reading positive indices" marker. The decomposition appears to not implement this stop-marker semantics correctly, causing it to include extra terms in the hinge loss computation.

Code Example

import torch
import torch.nn.functional as F

device = "cuda"
x = torch.tensor([[1.0, -1.0, 0.5, 0.3, -0.2]], device=device, requires_grad=True)
target = torch.tensor([[0, 1, -1, -1, -1]], device=device, dtype=torch.long)

# Eager (correct)
x_e = x.detach().clone().requires_grad_(True)
loss_e = F.multilabel_margin_loss(x_e, target)
loss_e.backward()
print(f"Eager loss: {loss_e.item()}")       # 1.48
print(f"Eager grad: {x_e.grad}")            # [[-0.4, -0.6, 0.4, 0.4, 0.2]]

# Compiled (WRONG)
torch._dynamo.reset()
x_c = x.detach().clone().requires_grad_(True)
compiled_fn = torch.compile(lambda x: F.multilabel_margin_loss(x, target))
loss_c = compiled_fn(x_c)
loss_c.backward()
print(f"Compiled loss: {loss_c.item()}")     # 1.96WRONG
print(f"Compiled grad: {x_c.grad}")          # [[-1.6, -0.6, 1.0, 1.0, 0.2]]WRONG
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

🐛 Describe the bug

torch.compile (both inductor and aot_eager backends) computes wrong forward loss and wrong backward gradients for F.multilabel_margin_loss when the target tensor contains -1 (the standard padding/ignore marker).

The eager implementation correctly treats -1 as a stop marker for positive class indices. The compiled decomposition does not handle -1 correctly, resulting in a different (wrong) loss value and gradient.

Minimal reproducer

import torch
import torch.nn.functional as F

device = "cuda"
x = torch.tensor([[1.0, -1.0, 0.5, 0.3, -0.2]], device=device, requires_grad=True)
target = torch.tensor([[0, 1, -1, -1, -1]], device=device, dtype=torch.long)

# Eager (correct)
x_e = x.detach().clone().requires_grad_(True)
loss_e = F.multilabel_margin_loss(x_e, target)
loss_e.backward()
print(f"Eager loss: {loss_e.item()}")       # 1.48
print(f"Eager grad: {x_e.grad}")            # [[-0.4, -0.6, 0.4, 0.4, 0.2]]

# Compiled (WRONG)
torch._dynamo.reset()
x_c = x.detach().clone().requires_grad_(True)
compiled_fn = torch.compile(lambda x: F.multilabel_margin_loss(x, target))
loss_c = compiled_fn(x_c)
loss_c.backward()
print(f"Compiled loss: {loss_c.item()}")     # 1.96 ← WRONG
print(f"Compiled grad: {x_c.grad}")          # [[-1.6, -0.6, 1.0, 1.0, 0.2]] ← WRONG

Expected behavior

Compiled output should match eager output exactly:

  • Loss: 1.48 (not 1.96)
  • Grad: [[-0.4, -0.6, 0.4, 0.4, 0.2]] (not [[-1.6, -0.6, 1.0, 1.0, 0.2]])

Observed behavior

Target patternEager lossCompiled lossFWD diffBWD max diff
[0,1,2,3,4] (no padding)✅ correct✅ correct00
[0,1,2,3,-1]✅ correct❌ wrong0.0490.20
[0,1,-1,-1,-1]✅ correct❌ wrong1.601.20
[0,-1,-1,-1,-1]✅ correct❌ wrong0.690.80
[-1,0,1,2,3]✅ correct❌ wrong7.814.00

Root cause

The bug is in the AOT autograd decomposition of aten::multilabel_margin_loss_forward. Both aot_eager and inductor backends produce the same wrong result, confirming the issue is in the decomposition layer, not in Inductor codegen.

The eager C++ kernel treats -1 in target as a "stop reading positive indices" marker. The decomposition appears to not implement this stop-marker semantics correctly, causing it to include extra terms in the hinge loss computation.

Impact

Any multi-label classification model using F.multilabel_margin_loss with -1 padding under torch.compile silently trains with incorrect gradients. There is no error or warning.

Versions

Versions

  • PyTorch: 2.13.0.dev20260520+cu126
  • CUDA: 12.6
  • GPU: Tesla T4
  • Python: 3.11

cc @chauhang @penguinwu

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Compiled output should match eager output exactly:

  • Loss: 1.48 (not 1.96)
  • Grad: [[-0.4, -0.6, 0.4, 0.4, 0.2]] (not [[-1.6, -0.6, 1.0, 1.0, 0.2]])

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING