pytorch - ✅(Solved) Fix [inductor] torch.compile produces incorrect results when applying `index_put` and `sort` on the return of `pad` [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177631Fetched 2026-04-08 00:47:07
View on GitHub
Comments
2
Participants
3
Timeline
94
Reactions
0
Author
Timeline (top)
mentioned ×37subscribed ×37labeled ×12commented ×2

Error Message

Error logs

Fix Action

Fixed

PR fix notes

PR #177816: [inductor] Preserve mutation ordering deps in recompute_size_and_body

Description (problem / solution / changelog)

Summary

Fixes https://github.com/pytorch/pytorch/issues/177631

SchedulerNode.recompute_size_and_body() calls _compute_attrs() which recomputes read_writes from scratch via extract_read_writes(). This only discovers natural data-flow dependencies and silently drops manually-added fake dependencies (WeakDep, StarDep) that were added by compute_dependencies() for mutation ordering.

When the CPU backend's fuse() method calls recompute_size_and_body() to reconcile loop ranges before fusion, the WeakDep that enforces "sort must run before the mutation of its input" is lost. The fused mutation kernel then gets scheduled before the sort, causing the sort to read already-mutated data and produce incorrect results.

Root cause chain

  1. remove_noop_ops eliminates no-op constant_pad_nd(a, [0,0]), making sort(b) become sort(arg0_1) — both sort and the input mutation now target the same buffer
  2. compute_dependencies() correctly adds WeakDep('buf0') to the mutation node, ensuring it runs after the sort's FallbackKernel
  3. During fusion, the CPU backend calls recompute_size_and_body() on the mutation node to reconcile loop ranges — this recomputes read_writes from scratch, dropping the WeakDep
  4. The fused mutation kernel now has no ordering constraint relative to the sort, and gets scheduled first
  5. Sort reads the already-mutated input, producing [[1,1],[0,0]] instead of [[0,0],[0,0]]

Fix

Save and restore fake dependencies (WeakDep, StarDep) across _compute_attrs(), following the same pattern already used in refresh_dependencies().

Test plan

  • Added test_sort_after_noop_pad_and_mutation to test/inductor/test_torchinductor.py
  • Verified passing on both CPU and CUDA
  • All existing sort tests pass (no regressions)
  • All existing input mutation tests pass (no regressions)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @ezyang

Changed files

  • test/inductor/test_torchinductor.py (modified, +20/-0)
  • torch/_inductor/scheduler.py (modified, +9/-0)

Code Example

import torch


def fn(a):
    b = torch.nn.functional.pad(a, (0, 0))
    a[0] = 1  # a[0, 0] = 1 will not cause inconsistency
    sorted_tensor, _ = torch.sort(b) # return b here will not cause inconsistency
    return sorted_tensor

x1 = torch.zeros((2, 2), dtype=torch.float32)
x2 = torch.zeros((2, 2), dtype=torch.float32)

cfunc = torch.compile(fn, backend='inductor')
out1 = fn(x1)
out2 = cfunc(x2)

print(f'{out1=}')
print(f'{out2=}')
torch.testing.assert_close(out1, out2, equal_nan=True)

'''
out1=tensor([[0., 0.],
        [0., 0.]])
out2=tensor([[1., 1.],
        [0., 0.]])
'''
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When compiling a model with no-op padding, broadcast index_put, and sort, the compiled model outputs incorrect results.

Here is the code to reproduce:

import torch


def fn(a):
    b = torch.nn.functional.pad(a, (0, 0))
    a[0] = 1  # a[0, 0] = 1 will not cause inconsistency
    sorted_tensor, _ = torch.sort(b) # return b here will not cause inconsistency
    return sorted_tensor

x1 = torch.zeros((2, 2), dtype=torch.float32)
x2 = torch.zeros((2, 2), dtype=torch.float32)

cfunc = torch.compile(fn, backend='inductor')
out1 = fn(x1)
out2 = cfunc(x2)

print(f'{out1=}')
print(f'{out2=}')
torch.testing.assert_close(out1, out2, equal_nan=True)

'''
out1=tensor([[0., 0.],
        [0., 0.]])
out2=tensor([[1., 1.],
        [0., 0.]])
'''

After a few investigations, I have the following findings:

  1. changing a[0]=1 to a[0,0] will not cause inconsistency
  2. directly returning b before the torch.sort(b) will also not cause the inconsistency.
  3. aot_eager and eager backends do not have this issue
  4. it seems to be a regression bug, this issue does not occur on torch 2.8

Error logs

No response

Versions

Versions of relevant libraries: [pip3] numpy==2.4.2 [pip3] torch==2.12.0.dev20260316+cpu [pip3] torchvision==0.26.0.dev20260223+cpu [pip3] triton==3.6.0 [conda] numpy 2.4.2 pypi_0 pypi [conda] torch 2.12.0.dev20260316+cpu pypi_0 pypi [conda] torchvision 0.26.0.dev20260223+cpu pypi_0 pypi [conda] triton 3.6.0 pypi_0 pypi

cc @ezyang @gchanan @kadeng @msaroufim @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

To fix the issue with incorrect results when compiling a model with no-op padding, broadcast index_put, and sort, we need to update the PyTorch version or modify the code to avoid the regression bug.

Update PyTorch Version

Update PyTorch to a version where this issue is fixed. Since the issue does not occur on torch 2.8, updating to this version or a later version where the bug is fixed should resolve the issue.

Modify the Code

If updating PyTorch is not feasible, we can modify the code to avoid the bug. Based on the findings, changing a[0]=1 to a[0,0]=1 or directly returning b before the torch.sort(b) can prevent the inconsistency.

Here is an example of how to modify the code:

import torch

def fn(a):
    b = torch.nn.functional.pad(a, (0, 0))
    # Change a[0]=1 to a[0,0]=1 to avoid the bug
    a[0, 0] = 1  
    sorted_tensor, _ = torch.sort(b) 
    return sorted_tensor

x1 = torch.zeros((2, 2), dtype=torch.float32)
x2 = torch.zeros((2, 2), dtype=torch.float32)

cfunc = torch.compile(fn, backend='inductor')
out1 = fn(x1)
out2 = cfunc(x2)

print(f'{out1=}')
print(f'{out2=}')
torch.testing.assert_close(out1, out2, equal_nan=True)

Alternatively, you can return b directly before the torch.sort(b):

import torch

def fn(a):
    b = torch.nn.functional.pad(a, (0, 0))
    a[0] = 1  
    # Return b directly to avoid the bug
    # sorted_tensor, _ = torch.sort(b) 
    return b

x1 = torch.zeros((2, 2), dtype=torch.float32)
x2 = torch.zeros((2, 2), dtype=torch.float32)

cfunc = torch.compile(fn, backend='inductor')
out1 = fn(x1)
out2 = cfunc(x2)

print(f'{out1=}')
print(f'{out2=}')
torch.testing.assert_close(out1, out2, equal_nan=True)

Verification

To verify that the fix worked, run the modified code and check that the output of out1 and out2 are equal. You can use torch.testing.assert_close to assert that the outputs are close.

Extra Tips

  • Always test your code with different inputs and edge cases to ensure that the fix works correctly.
  • If you are using a version of PyTorch where this bug is fixed, you can update your code to use

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [inductor] torch.compile produces incorrect results when applying `index_put` and `sort` on the return of `pad` [1 pull requests, 2 comments, 3 participants]