pytorch - 💡(How to fix) Fix [Inductor][CPU][bf16] Gradient mismatch in F.margin_ranking_loss backward with bfloat16 tensors and float scalar

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py Eager grad: [-0.5, -0.5] Inductor grad: [0.0, 0.0] Traceback (most recent call last): File "/tmp/bug.py", line 31, in <module> test_margin_ranking_loss_boundary() File "/tmp/bug.py", line 28, in test_margin_ranking_loss_boundary assert torch.allclose(grad_eager, grad_inductor), "Gradient mismatch at boundary due to precision loss!" AssertionError: Gradient mismatch at boundary due to precision loss!

Code Example

import torch

def test_margin_ranking_loss_boundary():
    x_val = 1.703125
    margin_val = 1.7 

    input1 = torch.tensor([x_val, x_val], dtype=torch.bfloat16, requires_grad=True)
    input2 = torch.tensor([0.0, 0.0], dtype=torch.bfloat16, requires_grad=True)
    target = torch.tensor([1.0, 1.0], dtype=torch.bfloat16)

    def fn(x, y, t):
        return torch.nn.functional.margin_ranking_loss(x, y, t, margin=margin_val, reduction='mean')

    loss_eager = fn(input1, input2, target)
    loss_eager.backward()
    grad_eager = input1.grad.clone()
    print(f"Eager grad:    {grad_eager.tolist()}") 
    
    input1.grad.zero_()
    input2.grad.zero_()

    opt_fn = torch.compile(fn, backend='inductor')
    loss_inductor = opt_fn(input1, input2, target)
    loss_inductor.backward()
    grad_inductor = input1.grad.clone()
    print(f"Inductor grad: {grad_inductor.tolist()}") 
    
    assert torch.allclose(grad_eager, grad_inductor), "Gradient mismatch at boundary due to precision loss!"

if __name__ == "__main__":
    test_margin_ranking_loss_boundary()

---

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
Eager grad:    [-0.5, -0.5]
Inductor grad: [0.0, 0.0]
Traceback (most recent call last):
  File "/tmp/bug.py", line 31, in <module>
    test_margin_ranking_loss_boundary()
  File "/tmp/bug.py", line 28, in test_margin_ranking_loss_boundary
    assert torch.allclose(grad_eager, grad_inductor), "Gradient mismatch at boundary due to precision loss!"
AssertionError: Gradient mismatch at boundary due to precision loss!
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

There is a gradient mismatch between Eager mode and torch.compile (Inductor) when running F.margin_ranking_loss backward with bfloat16 tensors and a Python float scalar as the margin.

In Eager mode, the gradient propagates correctly (producing non-zero values), but in Inductor, the gradient becomes completely zero. This discrepancy occurs at the boundary condition where the loss evaluates to exactly 0.0 depending on how the Python float scalar is promoted or cast during the computation.

This reproducer uses CPU tensors. I also tested the same reproducer on CUDA, where Eager and Inductor both produce zero gradients, so the mismatch appears to be CPU-specific.

To Reproduce

import torch

def test_margin_ranking_loss_boundary():
    x_val = 1.703125
    margin_val = 1.7 

    input1 = torch.tensor([x_val, x_val], dtype=torch.bfloat16, requires_grad=True)
    input2 = torch.tensor([0.0, 0.0], dtype=torch.bfloat16, requires_grad=True)
    target = torch.tensor([1.0, 1.0], dtype=torch.bfloat16)

    def fn(x, y, t):
        return torch.nn.functional.margin_ranking_loss(x, y, t, margin=margin_val, reduction='mean')

    loss_eager = fn(input1, input2, target)
    loss_eager.backward()
    grad_eager = input1.grad.clone()
    print(f"Eager grad:    {grad_eager.tolist()}") 
    
    input1.grad.zero_()
    input2.grad.zero_()

    opt_fn = torch.compile(fn, backend='inductor')
    loss_inductor = opt_fn(input1, input2, target)
    loss_inductor.backward()
    grad_inductor = input1.grad.clone()
    print(f"Inductor grad: {grad_inductor.tolist()}") 
    
    assert torch.allclose(grad_eager, grad_inductor), "Gradient mismatch at boundary due to precision loss!"

if __name__ == "__main__":
    test_margin_ranking_loss_boundary()

Output:

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
Eager grad:    [-0.5, -0.5]
Inductor grad: [0.0, 0.0]
Traceback (most recent call last):
  File "/tmp/bug.py", line 31, in <module>
    test_margin_ranking_loss_boundary()
  File "/tmp/bug.py", line 28, in test_margin_ranking_loss_boundary
    assert torch.allclose(grad_eager, grad_inductor), "Gradient mismatch at boundary due to precision loss!"
AssertionError: Gradient mismatch at boundary due to precision loss!

Expected behavior torch.compile should yield the same gradients as Eager mode for bfloat16 inputs interacting with float scalars, maintaining consistent type promotion and precision semantics.

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 Nvidia driver version: 596.49 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_tensor_ir.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.21.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A ersions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.13.0.dev20260521+cu130 [pip3] torchaudio==2.11.0.dev20260525+cu130 [pip3] torchvision==0.28.0.dev20260525+cu130 [pip3] triton==3.7.0+git88b227e2 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas 13.1.1.3 pypi_0 pypi [conda] nvidia-cuda-cupti 13.0.85 pypi_0 pypi [conda] nvidia-cuda-nvrtc 13.0.88 pypi_0 pypi [conda] nvidia-cuda-runtime 13.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu13 9.20.0.48 pypi_0 pypi [conda] nvidia-cufft 12.0.0.61 pypi_0 pypi [conda] nvidia-curand 10.4.0.35 pypi_0 pypi [conda] nvidia-cusolver 12.0.4.66 pypi_0 pypi [conda] nvidia-cusparse 12.6.3.3 pypi_0 pypi [conda] nvidia-cusparselt-cu13 0.8.1 pypi_0 pypi [conda] nvidia-nccl-cu13 2.29.7 pypi_0 pypi [conda] nvidia-nvjitlink 13.0.88 pypi_0 pypi [conda] nvidia-nvtx 13.0.85 pypi_0 pypi [conda] torch 2.13.0.dev20260521+cu130 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260525+cu130 pypi_0 pypi [conda] torchvision 0.28.0.dev20260525+cu130 pypi_0 pypi [conda] triton 3.7.0+git88b227e2 pypi_0 pypi

cc @nairbv @mruberry @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING