pytorch - ✅(Solved) Fix AOTInductor fails to compile dynamic-shape CUDA gather pattern with `ValueError: The argument '((0)) + 48' is not comparable` [1 pull requests, 1 comments, 1 participants]

pytorch2026-04-10 07:22:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#179900•Fetched 2026-04-11 06:12:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

FFXIVYYDS

Participants

FFXIVYYDS

Timeline (top)

mentioned ×18subscribed ×18labeled ×7unsubscribed ×5

Error Message

torch._inductor.exc.InductorError: ValueError: The argument '((0)) + 48' is not comparable.

Fix Action

Fix / Workaround

A small workaround that replaces:

if os.getenv("USE_WORKAROUND", "0") == "1": hist_tokens = seq_hidden[:, -hist_len:, :] else: hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim))

python3 repro_not_comparable.py
USE_WORKAROUND=1 python3 repro_not_comparable.py

PR fix notes

PR #180545: fix AOTInductor fails to compile dynamic-shape CUDA gather pattern with ValueError: The argument '((0)) + 48' is not comparable

Repository: pytorch/pytorch
Author: FFXIVYYDS
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/180545

Description (problem / solution / changelog)

GitHub comment draft for follow-up PR on issue #179900

Below is a ready-to-paste GitHub comment for your PR conversation.

Thanks for taking a look.

This PR is a follow-up to issue #179900:
https://github.com/pytorch/pytorch/issues/179900

That issue was closed after it was reported that the original repro worked on main. I re-checked the situation on our side and the conclusion is:

the original issue in #179900 is not fully resolved on current main
main no longer fails at the first ValueError: The argument '((0)) + 48' is not comparable site from the original minimal repro
however, after that first failure point is avoided, the same underlying symbolic comparability problem still appears at later compilation stages
for our real AOTI model, this means the end-to-end compile is still broken unless we patch additional call sites

What I found

Initially I also thought the fix might have come from:

PR #175975

But after checking commit history more carefully, that is not the real reason the original failure path disappeared.

The key change is actually from:

PR #174521
commit: 037c0f4053b5136cb22f88b3cb4e18f0ab588cfd

Patch:
https://github.com/pytorch/pytorch/commit/037c0f4053b5136cb22f88b3cb4e18f0ab588cfd.patch

The important part is not mainly the stride_hints() change from:

result.append(self.size_hint_or_throw(s))

to:

result.append(self.optimization_hint(s, fallback=0))

The crucial part is that this commit added:

if isinstance(expr, sympy.Expr):
    expr = expr.expand(identity=True)

That expansion is what avoids the original comparability failure in the first path.

Why this PR is still needed

After understanding that, I tried applying the same idea to the later failing sites in the compilation pipeline.

With only the earlier fix in place, I was still able to reproduce new failures from the same model/test, still ending in:

torch._inductor.exc.InductorError: ValueError: The argument '((0)) + 48' is not comparable.

In our case, after the first path is fixed, the next failures show up in:

torch/_inductor/sizevars.py
torch/_inductor/codegen/simd.py

More specifically, I found that applying the same expand(identity=True) handling at these two places fixes the remaining compile failure for this case:

1) `torch/_inductor/sizevars.py`

@@ -616,6 +618,8 @@ class SizeVarAllocator:
         """
         strides = []
         index = self.simplify(index)
+        if isinstance(index, Expr):
+            index = index.expand(identity=True)
         # remove any offset
         index = index - sympy_subs(
             index, {v: sympy.S.Zero for v in support_vars if v != 0}

2) `torch/_inductor/codegen/simd.py`

@@ -933,6 +933,8 @@ class SIMDKernel(Kernel[CSEVariableType], Generic[CSEVariableType]):
         {xindex: 512, rindex: 1024}
         """
         index_to_tile_indexes = {k: v.expr for k, v in self.range_tree_nodes.items()}
+        if isinstance(index, sympy.Expr):
+            index = index.expand(identity=True)
         index_in_tile_vars = sympy_subs(index, index_to_tile_indexes)  # type: ignore[arg-type]
         strides = {}
         for range_tree in self.range_trees:

With both of those changes applied, the model finally compiles successfully on my side.

Result on my side

After patching those two call sites, I get:

python3 repro_inductor_non_comparable_min.py --device cuda
[INFO] device=cuda api=aoti_compile_and_package
[OK] compile success: /tmp/non_comparable_min.pt2

Test coverage

This PR includes a test based on the #179900 scenario.

I can reproduce the remaining failures with:

python3 test/inductor/test_aot_inductor.py -k test_issue_179900 -v

Without the two new expand(identity=True) additions, this test still hits the later-stage non-comparable failures on my side.

With the patch in this PR, the same case compiles successfully.

Summary

So the current situation appears to be:

issue #179900 exposed a real symbolic comparability problem
one earlier path was effectively fixed by the expand(identity=True) handling introduced in PR #174521
but that did not eliminate the problem globally
the same class of failure still exists in later compilation/codegen paths
this PR extends the same fix pattern to the remaining sites that still fail for this case

This is why I am sending this follow-up PR: from my testing, the issue is not fully gone on main; the failure just moved deeper into the pipeline.

Could you please take a look and confirm whether extending expand(identity=True) in these two additional places is the right upstream direction?

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @laithsakka @jansel @eellison @shunting314 @peterbell10

Changed files

test/inductor/test_aot_inductor.py (modified, +66/-0)
torch/_inductor/codegen/simd.py (modified, +2/-0)
torch/_inductor/sizevars.py (modified, +2/-0)

Code Example

ep = torch.export.export(m, inputs, dynamic_shapes=dynamic_shapes)
pkg_path = torch._inductor.aoti_compile_and_package(ep)

---

torch._inductor.exc.InductorError: ValueError: The argument '((0)) + 48' is not comparable.

---

hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim))

---

hist_tokens = seq_hidden[:, -hist_len:, :]

---

import os
import torch
from torch import nn
from torch.export import Dim

class Repro(nn.Module):
    def __init__(self):
        super().__init__()
        self.refined_pattern_len = 3

    def forward(self, seq_hidden, seq_lens, target_hidden):
        batch_size, seq_len, hidden_dim = seq_hidden.shape
        start_pos = seq_len - seq_lens
        hist_len = self.refined_pattern_len - 1
        hist_offsets = torch.arange(hist_len, device=seq_hidden.device) - hist_len
        hist_idx = (seq_len + hist_offsets.view(1, -1)).expand(batch_size, -1)
        hist_mask = (hist_idx >= start_pos.view(batch_size, 1)) & (hist_idx < seq_len)
        gather_idx = hist_idx.clamp(min=0, max=seq_len - 1)

        if os.getenv("USE_WORKAROUND", "0") == "1":
            hist_tokens = seq_hidden[:, -hist_len:, :]
        else:
            hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim))

        hist_tokens = hist_tokens * hist_mask.unsqueeze(-1).to(hist_tokens.dtype)
        target_pattern = torch.cat([hist_tokens, target_hidden.unsqueeze(1)], dim=1)
        target_mask = torch.cat([hist_mask, torch.ones(batch_size, 1, device=seq_hidden.device, dtype=torch.bool)], dim=1)

        out = (target_pattern * target_mask.unsqueeze(-1).to(target_pattern.dtype)).sum(dim=(1, 2), keepdim=True)
        return out.to("cpu")

def main():
    device = "cuda"
    m = Repro().eval().to(device)
    m.requires_grad_(False)

    seq_hidden = torch.randn(2, 50, 64, device=device)
    seq_lens = torch.tensor([0, 48], device=device)
    target_hidden = torch.randn(2, 64, device=device)
    inputs = (seq_hidden, seq_lens, target_hidden)

    b = Dim("b", max=4096)
    t = Dim("t", min=3, max=1024)
    dynamic_shapes = {
        "seq_hidden": {0: b, 1: t},
        "seq_lens": {0: b},
        "target_hidden": {0: b},
    }

    ep = torch.export.export(m, inputs, dynamic_shapes=dynamic_shapes)
    pkg_path = torch._inductor.aoti_compile_and_package(ep)
    print("OK:", pkg_path)

if __name__ == "__main__":
    main()

---

python3 repro_not_comparable.py
USE_WORKAROUND=1 python3 repro_not_comparable.py

---

---

## Error logs

---

### Versions

## Versions

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

AOTInductor fails to compile a small CUDA repro with dynamic shapes when the graph contains a gather(...) pattern whose indices are derived from seq_len.

This is a compile-time failure, not a runtime failure.

The failure reproduces with:

ep = torch.export.export(m, inputs, dynamic_shapes=dynamic_shapes)
pkg_path = torch._inductor.aoti_compile_and_package(ep)

Compilation fails with:

torch._inductor.exc.InductorError: ValueError: The argument '((0)) + 48' is not comparable.

A small workaround that replaces:

hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim))

with:

hist_tokens = seq_hidden[:, -hist_len:, :]

makes the same script compile successfully in the same environment.

I am not claiming the two forms are generally equivalent. The point is that removing this gather-based symbolic indexing pattern is enough to avoid the compiler failure.

Minimal repro:

import os
import torch
from torch import nn
from torch.export import Dim

class Repro(nn.Module):
    def __init__(self):
        super().__init__()
        self.refined_pattern_len = 3

    def forward(self, seq_hidden, seq_lens, target_hidden):
        batch_size, seq_len, hidden_dim = seq_hidden.shape
        start_pos = seq_len - seq_lens
        hist_len = self.refined_pattern_len - 1
        hist_offsets = torch.arange(hist_len, device=seq_hidden.device) - hist_len
        hist_idx = (seq_len + hist_offsets.view(1, -1)).expand(batch_size, -1)
        hist_mask = (hist_idx >= start_pos.view(batch_size, 1)) & (hist_idx < seq_len)
        gather_idx = hist_idx.clamp(min=0, max=seq_len - 1)

        if os.getenv("USE_WORKAROUND", "0") == "1":
            hist_tokens = seq_hidden[:, -hist_len:, :]
        else:
            hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim))

        hist_tokens = hist_tokens * hist_mask.unsqueeze(-1).to(hist_tokens.dtype)
        target_pattern = torch.cat([hist_tokens, target_hidden.unsqueeze(1)], dim=1)
        target_mask = torch.cat([hist_mask, torch.ones(batch_size, 1, device=seq_hidden.device, dtype=torch.bool)], dim=1)

        out = (target_pattern * target_mask.unsqueeze(-1).to(target_pattern.dtype)).sum(dim=(1, 2), keepdim=True)
        return out.to("cpu")

def main():
    device = "cuda"
    m = Repro().eval().to(device)
    m.requires_grad_(False)

    seq_hidden = torch.randn(2, 50, 64, device=device)
    seq_lens = torch.tensor([0, 48], device=device)
    target_hidden = torch.randn(2, 64, device=device)
    inputs = (seq_hidden, seq_lens, target_hidden)

    b = Dim("b", max=4096)
    t = Dim("t", min=3, max=1024)
    dynamic_shapes = {
        "seq_hidden": {0: b, 1: t},
        "seq_lens": {0: b},
        "target_hidden": {0: b},
    }

    ep = torch.export.export(m, inputs, dynamic_shapes=dynamic_shapes)
    pkg_path = torch._inductor.aoti_compile_and_package(ep)
    print("OK:", pkg_path)

if __name__ == "__main__":
    main()

Error logs

Repro steps:

python3 repro_not_comparable.py
USE_WORKAROUND=1 python3 repro_not_comparable.py


---

## Error logs

```text
root@7cc3211a0e36:/workspace# python3 repro_not_comparable.py
/usr/lib/python3.12/copyreg.py:99: FutureWarning: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.
  return cls.__new__(cls, *args)
Traceback (most recent call last):
  File "/workspace/repro_not_comparable.py", line 61, in <module>
    main()
  File "/workspace/repro_not_comparable.py", line 56, in main
    pkg_path = torch._inductor.aoti_compile_and_package(ep)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/codegen/simd.py", line 2504, in tile_ranges
    strides = V.graph.sizevars.stride_hints(dep.index, rw.range_vars)
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/sizevars.py", line 1119, in stride_hints
    result.append(self.size_hint_or_throw(s))
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/sizevars.py", line 658, in size_hint_or_throw
    out = self.symbolic_hint(expr)
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/sizevars.py", line 592, in symbolic_hint
    return sympy_subs(expr, self.backed_var_to_val)
  File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 1197, in sympy_subs
    return sympy.sympify(expr).xreplace(
  File "/usr/local/lib/python3.12/dist-packages/sympy/functions/elementary/miscellaneous.py", line 559, in _new_args_filter
    raise ValueError("The argument '%s' is not comparable." % arg)
torch._inductor.exc.InductorError: ValueError: The argument '((0)) + 48' is not comparable.

root@7cc3211a0e36:/workspace# USE_WORKAROUND=1 python3 repro_not_comparable.py
/usr/lib/python3.12/copyreg.py:99: FutureWarning: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.
  return cls.__new__(cls, *args)
OK: /tmp/torchinductor_root/c3rzhrszwvt2a3jrnry2nviww3qhgw7girvx3pxhv7b4ndtyhca2/cptqcwcgluyq4ar6tidw7eawjcqak3tanq7j3xwqrybxdi5zgojh.wrapper.pt2

Versions

- Docker image: pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel
- PyTorch: 2.11.0+cu128
- Python: 3.12.3
- CUDA used by PyTorch: 12.8
- cuDNN: 91900
- GPU: Tesla T4
- GPU capability: sm_75
- NVIDIA driver: 535.54.03
- Host kernel: 3.10.0-1160.el7.x86_64
- Container OS: Ubuntu 24.04.4 LTS

cc @chauhang @penguinwu @ezyang @bobrenjc93 @aditvenk @laithsakka @desertfire @yushangdi @benjaminglass1 @jataylo @iupaikov-amd

extent analysis

TL;DR

The issue can be resolved by replacing the gather operation with a workaround that avoids symbolic indexing patterns.

Guidance

Identify the problematic code: The gather operation in the line hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim)) is causing the compilation failure.
Apply the workaround: Replace the problematic line with hist_tokens = seq_hidden[:, -hist_len:, :] to avoid the symbolic indexing pattern.
Verify the fix: Run the script with the workaround and check if the compilation succeeds.
Investigate alternative solutions: If the workaround is not equivalent to the original code, investigate alternative solutions that can replace the gather operation while maintaining the original functionality.

Example

The provided workaround can be applied as follows:

if os.getenv("USE_WORKAROUND", "0") == "1":
    hist_tokens = seq_hidden[:, -hist_len:, :]
else:
    hist_tokens = seq_hidden.gather(1, gather_idx.unsqueeze(-1).expand(-1, -1, hidden_dim))

Notes

The workaround may not be equivalent to the original code, and its applicability depends on the specific use case. Further investigation is needed to determine the best solution.

Recommendation

Apply the workaround to resolve the compilation failure, and then investigate alternative solutions to replace the gather operation while maintaining the original functionality.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

pytorch - ✅(Solved) Fix AOTInductor fails to compile dynamic-shape CUDA gather pattern with `ValueError: The argument '((0)) + 48' is not comparable` [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #180545: fix AOTInductor fails to compile dynamic-shape CUDA gather pattern with ValueError: The argument '((0)) + 48' is not comparable

Description (problem / solution / changelog)

GitHub comment draft for follow-up PR on issue #179900

What I found

Why this PR is still needed

1) torch/_inductor/sizevars.py

2) torch/_inductor/codegen/simd.py

Result on my side

Test coverage

Summary

Changed files

Code Example

🐛 Describe the bug

Error logs

Versions

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1) `torch/_inductor/sizevars.py`

2) `torch/_inductor/codegen/simd.py`