pytorch - 💡(How to fix) Fix [Inductor] Correctness mismatch in torch.cumulative_trapezoid with bfloat16

pytorch2026-05-29 05:59:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py torch: 2.13.0.dev20260521+cu130 cuda: 13.0

output max abs diff: tensor(0.0204, device='cuda:0')

grad_y max abs diff: tensor(0.0312, device='cuda:0')

grad_x max abs diff: tensor(0.3781, device='cuda:0')

grad_x eager: tensor([ 14.5715, 26.7449, -2.5195, 0.3438, 30.8165, -40.6359, -40.6240, -7.2207, 34.6648, -17.5710, -14.7471, 19.5552, 1.4834, -8.3599, 4.0752, -0.5771], device='cuda:0')

grad_x inductor: tensor([ 14.5546, 26.9564, -2.8976, 0.4818, 30.8799, -40.5858, -40.7334, -7.1762, 34.6669, -17.4636, -14.8274, 19.5194, 1.5355, -8.4003, 4.0609, -0.5711], device='cuda:0')

grad_x diff: tensor([ 0.0169, -0.2115, 0.3781, -0.1381, -0.0634, -0.0501, 0.1094, -0.0445, -0.0021, -0.1075, 0.0804, 0.0358, -0.0521, 0.0404, 0.0143, -0.0060], device='cuda:0')

allclose output: False

allclose grad_y: False

allclose grad_x: False Traceback (most recent call last): File "/tmp/bug.py", line 85, in <module> assert torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3) AssertionError

Code Example

import torch

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

torch.manual_seed(0)

device = "cuda"


def make_inputs():
    torch.manual_seed(0)

    y = torch.randn(8, 1, 5, 16, device=device, dtype=torch.float32)
    y = y.to(torch.bfloat16).detach().requires_grad_(True)

    torch.manual_seed(1000)

    x_base = torch.randn(16, device=device, dtype=torch.float32)
    x_base = x_base.detach().requires_grad_(True)

    return y, x_base


def fn(y, x_base):
    x = x_base.view(1, 1, 1, 16).expand_as(y)
    return torch.cumulative_trapezoid(y, x, dim=-1)


def run_eager():
    y, x_base = make_inputs()

    out = fn(y, x_base)
    loss = out.float().sum()
    grad_y, grad_x = torch.autograd.grad(loss, (y, x_base))

    return out.detach(), grad_y.detach(), grad_x.detach()


def run_inductor():
    y, x_base = make_inputs()

    torch._dynamo.reset()
    compiled_fn = torch.compile(fn, backend="inductor")

    out = compiled_fn(y, x_base)
    loss = out.float().sum()
    grad_y, grad_x = torch.autograd.grad(loss, (y, x_base))

    return out.detach(), grad_y.detach(), grad_x.detach()


out_eager, grad_y_eager, grad_x_eager = run_eager()
out_inductor, grad_y_inductor, grad_x_inductor = run_inductor()

print("\noutput max abs diff:")
print((out_eager.float() - out_inductor.float()).abs().max())

print("\ngrad_y max abs diff:")
print((grad_y_eager.float() - grad_y_inductor.float()).abs().max())

print("\ngrad_x max abs diff:")
print((grad_x_eager - grad_x_inductor).abs().max())

print("\ngrad_x eager:")
print(grad_x_eager)

print("\ngrad_x inductor:")
print(grad_x_inductor)

print("\ngrad_x diff:")
print(grad_x_eager - grad_x_inductor)

print("\nallclose output:")
print(torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3))

print("\nallclose grad_y:")
print(torch.allclose(grad_y_eager.float(), grad_y_inductor.float(), rtol=1e-3, atol=1e-3))

print("\nallclose grad_x:")
print(torch.allclose(grad_x_eager, grad_x_inductor, rtol=1e-3, atol=1e-3))

assert torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3)
assert torch.allclose(grad_y_eager.float(), grad_y_inductor.float(), rtol=1e-3, atol=1e-3)
assert torch.allclose(grad_x_eager, grad_x_inductor, rtol=1e-3, atol=1e-3)

---

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0

output max abs diff:
tensor(0.0204, device='cuda:0')

grad_y max abs diff:
tensor(0.0312, device='cuda:0')

grad_x max abs diff:
tensor(0.3781, device='cuda:0')

grad_x eager:
tensor([ 14.5715,  26.7449,  -2.5195,   0.3438,  30.8165, -40.6359, -40.6240,
         -7.2207,  34.6648, -17.5710, -14.7471,  19.5552,   1.4834,  -8.3599,
          4.0752,  -0.5771], device='cuda:0')

grad_x inductor:
tensor([ 14.5546,  26.9564,  -2.8976,   0.4818,  30.8799, -40.5858, -40.7334,
         -7.1762,  34.6669, -17.4636, -14.8274,  19.5194,   1.5355,  -8.4003,
          4.0609,  -0.5711], device='cuda:0')

grad_x diff:
tensor([ 0.0169, -0.2115,  0.3781, -0.1381, -0.0634, -0.0501,  0.1094, -0.0445,
        -0.0021, -0.1075,  0.0804,  0.0358, -0.0521,  0.0404,  0.0143, -0.0060],
       device='cuda:0')

allclose output:
False

allclose grad_y:
False

allclose grad_x:
False
Traceback (most recent call last):
  File "/tmp/bug.py", line 85, in <module>
    assert torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3)
AssertionError

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

There is a correctness mismatch between Eager mode and torch.compile (Inductor backend) when using torch.cumulative_trapezoid with bfloat16 inputs.

The mismatch occurs in both the forward output and the gradients (grad_y and grad_x) during backward pass. The maximum absolute difference in grad_x reaches ~0.378, which fails the torch.allclose check with rtol=1e-3 and atol=1e-3.

This may be related to #185587 since both involve torch.cumulative_trapezoid under Inductor, but this reproducer is independent: it does not use dynamic=True, polygamma, Conv2d, or GroupNorm, and the main mismatch is in backward for an expanded x tensor.

Reproducer

import torch

assert torch.cuda.is_available()

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)

torch.manual_seed(0)

device = "cuda"


def make_inputs():
    torch.manual_seed(0)

    y = torch.randn(8, 1, 5, 16, device=device, dtype=torch.float32)
    y = y.to(torch.bfloat16).detach().requires_grad_(True)

    torch.manual_seed(1000)

    x_base = torch.randn(16, device=device, dtype=torch.float32)
    x_base = x_base.detach().requires_grad_(True)

    return y, x_base


def fn(y, x_base):
    x = x_base.view(1, 1, 1, 16).expand_as(y)
    return torch.cumulative_trapezoid(y, x, dim=-1)


def run_eager():
    y, x_base = make_inputs()

    out = fn(y, x_base)
    loss = out.float().sum()
    grad_y, grad_x = torch.autograd.grad(loss, (y, x_base))

    return out.detach(), grad_y.detach(), grad_x.detach()


def run_inductor():
    y, x_base = make_inputs()

    torch._dynamo.reset()
    compiled_fn = torch.compile(fn, backend="inductor")

    out = compiled_fn(y, x_base)
    loss = out.float().sum()
    grad_y, grad_x = torch.autograd.grad(loss, (y, x_base))

    return out.detach(), grad_y.detach(), grad_x.detach()


out_eager, grad_y_eager, grad_x_eager = run_eager()
out_inductor, grad_y_inductor, grad_x_inductor = run_inductor()

print("\noutput max abs diff:")
print((out_eager.float() - out_inductor.float()).abs().max())

print("\ngrad_y max abs diff:")
print((grad_y_eager.float() - grad_y_inductor.float()).abs().max())

print("\ngrad_x max abs diff:")
print((grad_x_eager - grad_x_inductor).abs().max())

print("\ngrad_x eager:")
print(grad_x_eager)

print("\ngrad_x inductor:")
print(grad_x_inductor)

print("\ngrad_x diff:")
print(grad_x_eager - grad_x_inductor)

print("\nallclose output:")
print(torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3))

print("\nallclose grad_y:")
print(torch.allclose(grad_y_eager.float(), grad_y_inductor.float(), rtol=1e-3, atol=1e-3))

print("\nallclose grad_x:")
print(torch.allclose(grad_x_eager, grad_x_inductor, rtol=1e-3, atol=1e-3))

assert torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3)
assert torch.allclose(grad_y_eager.float(), grad_y_inductor.float(), rtol=1e-3, atol=1e-3)
assert torch.allclose(grad_x_eager, grad_x_inductor, rtol=1e-3, atol=1e-3)

Output

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
torch: 2.13.0.dev20260521+cu130
cuda: 13.0

output max abs diff:
tensor(0.0204, device='cuda:0')

grad_y max abs diff:
tensor(0.0312, device='cuda:0')

grad_x max abs diff:
tensor(0.3781, device='cuda:0')

grad_x eager:
tensor([ 14.5715,  26.7449,  -2.5195,   0.3438,  30.8165, -40.6359, -40.6240,
         -7.2207,  34.6648, -17.5710, -14.7471,  19.5552,   1.4834,  -8.3599,
          4.0752,  -0.5771], device='cuda:0')

grad_x inductor:
tensor([ 14.5546,  26.9564,  -2.8976,   0.4818,  30.8799, -40.5858, -40.7334,
         -7.1762,  34.6669, -17.4636, -14.8274,  19.5194,   1.5355,  -8.4003,
          4.0609,  -0.5711], device='cuda:0')

grad_x diff:
tensor([ 0.0169, -0.2115,  0.3781, -0.1381, -0.0634, -0.0501,  0.1094, -0.0445,
        -0.0021, -0.1075,  0.0804,  0.0358, -0.0521,  0.0404,  0.0143, -0.0060],
       device='cuda:0')

allclose output:
False

allclose grad_y:
False

allclose grad_x:
False
Traceback (most recent call last):
  File "/tmp/bug.py", line 85, in <module>
    assert torch.allclose(out_eager.float(), out_inductor.float(), rtol=1e-3, atol=1e-3)
AssertionError

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 Nvidia driver version: 596.49 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_tensor_ir.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.21.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A ersions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.13.0.dev20260521+cu130 [pip3] torchaudio==2.11.0.dev20260525+cu130 [pip3] torchvision==0.28.0.dev20260525+cu130 [pip3] triton==3.7.0+git88b227e2 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas 13.1.1.3 pypi_0 pypi [conda] nvidia-cuda-cupti 13.0.85 pypi_0 pypi [conda] nvidia-cuda-nvrtc 13.0.88 pypi_0 pypi [conda] nvidia-cuda-runtime 13.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu13 9.20.0.48 pypi_0 pypi [conda] nvidia-cufft 12.0.0.61 pypi_0 pypi [conda] nvidia-curand 10.4.0.35 pypi_0 pypi [conda] nvidia-cusolver 12.0.4.66 pypi_0 pypi [conda] nvidia-cusparse 12.6.3.3 pypi_0 pypi [conda] nvidia-cusparselt-cu13 0.8.1 pypi_0 pypi [conda] nvidia-nccl-cu13 2.29.7 pypi_0 pypi [conda] nvidia-nvjitlink 13.0.88 pypi_0 pypi [conda] nvidia-nvtx 13.0.85 pypi_0 pypi [conda] torch 2.13.0.dev20260521+cu130 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260525+cu130 pypi_0 pypi [conda] torchvision 0.28.0.dev20260525+cu130 pypi_0 pypi [conda] triton 3.7.0+git88b227e2 pypi_0 pypi

cc @albanD @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering