pytorch - 💡(How to fix) Fix InductorError on in-place multidimensional dynamic slicing with Tensor-derived slice bounds

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

torch._inductor.exc.InductorError: LoweringException: NotImplementedError: View target: aten.slice.Tensor ... args[3]: u0

Code Example

torch._inductor.exc.InductorError: LoweringException: NotImplementedError: View
  target: aten.slice.Tensor
  ...
  args[3]: u0

---

mask[i, :span, :span] = 1.0

---

span = tensor_span[i, 0]

---

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import platform
import traceback

import torch
import torch.nn as nn


class DynamicSliceMask(nn.Module):
    def forward(self, tensor_span):
        batch_size = tensor_span.shape[0]
        mask = torch.zeros(
            (batch_size, 100, 100),
            dtype=torch.float32,
            device=tensor_span.device,
        )

        for i in range(batch_size):
            span = tensor_span[i, 0]
            mask[i, :span, :span] = 1.0

        return mask


def print_env():
    print("Python:", platform.python_version())
    print("Platform:", platform.platform())
    print("PyTorch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("CUDA device count:", torch.cuda.device_count())
    print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES", ""))
    if torch.cuda.is_available():
        print("Current CUDA device:", torch.cuda.current_device())
        print("CUDA device name:", torch.cuda.get_device_name(0))


def main():
    print_env()

    if not torch.cuda.is_available():
        raise RuntimeError("This repro expects a CUDA device.")

    device = "cuda"
    torch.manual_seed(0)
    torch.cuda.manual_seed_all(0)

    model = DynamicSliceMask().to(device).eval()

    tensor_span = torch.tensor([[8]], dtype=torch.int64, device=device)

    print("\nInput:")
    print(tensor_span)

    with torch.no_grad():
        eager_out = model(tensor_span)

    print("\nEager succeeded.")
    print("Eager output shape:", tuple(eager_out.shape))
    print("Eager output sum:", eager_out.sum().item())

    compiled_model = torch.compile(
        model,
        backend="inductor",
        fullgraph=True,
        dynamic=True,
    )

    print("\nRunning compiled model...")
    try:
        with torch.no_grad():
            compiled_out = compiled_model(tensor_span)

        print("Compiled succeeded.")
        print("Compiled output shape:", tuple(compiled_out.shape))
        print("Compiled output sum:", compiled_out.sum().item())
        torch.testing.assert_close(eager_out, compiled_out)
        print("Eager and compiled outputs match.")

    except Exception:
        print("\nCompiled execution failed with exception:")
        traceback.print_exc()
        raise


if __name__ == "__main__":
    main()

---

Input:
tensor([[8]], device='cuda:0')

Eager succeeded.
Eager output shape: (1, 100, 100)
Eager output sum: 64.0

---

Running compiled model...

Compiled execution failed with exception:
torch._inductor.exc.InductorError: LoweringException: NotImplementedError: View
  target: aten.slice.Tensor
  args[0]: TensorBox(
    View(
      StorageBox(
        ComputedBuffer(name='buf2', layout=FlexibleLayout('cuda:0', torch.float32, size=[1, 100, 100], stride=[10000, 100, 1]), data=Pointwise(
          'cuda',
          torch.float32,
          def inner_fn(index):
              _, i1, i2 = index
              tmp0 = ops.constant(0, torch.float32)
              return tmp0
          ,
          ranges=[1, 100, 100],
          origin_node=full_default,
          origins=OrderedSet([full_default]),
          stack_traces = {,
            File ".../for_test_1.py", line 15, in forward,
              mask = torch.zeros(,
          ,
          }
        )
      ),
      size=[100, 100],
      reindex=lambda i0, i1: [0, i0, i1],
      origins=OrderedSet([select_3, full_default]),
      stack_traces = {,
        File ".../for_test_1.py", line 23, in forward,
          mask[i, :span, :span] = 1.0,
      ,
      }
    )
  )
  args[1]: 0
  args[2]: 0
  args[3]: u0

Found from:
   File ".../for_test_1.py", line 23, in forward
    mask[i, :span, :span] = 1.0

---

PyTorch version:  2.13.0a0+git059c270
Is debug build: True
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
Is CUDA available: True
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with Inductor fails on an in-place multidimensional slice assignment when the slice bound is derived from a Tensor value.

Eager execution succeeds, but the compiled version fails during Inductor lowering with:

torch._inductor.exc.InductorError: LoweringException: NotImplementedError: View
  target: aten.slice.Tensor
  ...
  args[3]: u0

The problematic line is:

mask[i, :span, :span] = 1.0

where span is obtained from a CUDA int64 Tensor:

span = tensor_span[i, 0]

I understand that data-dependent slicing may have limited support, but this case reaches Inductor lowering and fails with an internal InductorError rather than being rejected earlier as an unsupported graph pattern. This pattern is also common in mask construction, e.g., constructing per-sample 2D masks from sequence lengths or valid spans.

Minimal repro

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import platform
import traceback

import torch
import torch.nn as nn


class DynamicSliceMask(nn.Module):
    def forward(self, tensor_span):
        batch_size = tensor_span.shape[0]
        mask = torch.zeros(
            (batch_size, 100, 100),
            dtype=torch.float32,
            device=tensor_span.device,
        )

        for i in range(batch_size):
            span = tensor_span[i, 0]
            mask[i, :span, :span] = 1.0

        return mask


def print_env():
    print("Python:", platform.python_version())
    print("Platform:", platform.platform())
    print("PyTorch:", torch.__version__)
    print("CUDA available:", torch.cuda.is_available())
    print("CUDA device count:", torch.cuda.device_count())
    print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES", ""))
    if torch.cuda.is_available():
        print("Current CUDA device:", torch.cuda.current_device())
        print("CUDA device name:", torch.cuda.get_device_name(0))


def main():
    print_env()

    if not torch.cuda.is_available():
        raise RuntimeError("This repro expects a CUDA device.")

    device = "cuda"
    torch.manual_seed(0)
    torch.cuda.manual_seed_all(0)

    model = DynamicSliceMask().to(device).eval()

    tensor_span = torch.tensor([[8]], dtype=torch.int64, device=device)

    print("\nInput:")
    print(tensor_span)

    with torch.no_grad():
        eager_out = model(tensor_span)

    print("\nEager succeeded.")
    print("Eager output shape:", tuple(eager_out.shape))
    print("Eager output sum:", eager_out.sum().item())

    compiled_model = torch.compile(
        model,
        backend="inductor",
        fullgraph=True,
        dynamic=True,
    )

    print("\nRunning compiled model...")
    try:
        with torch.no_grad():
            compiled_out = compiled_model(tensor_span)

        print("Compiled succeeded.")
        print("Compiled output shape:", tuple(compiled_out.shape))
        print("Compiled output sum:", compiled_out.sum().item())
        torch.testing.assert_close(eager_out, compiled_out)
        print("Eager and compiled outputs match.")

    except Exception:
        print("\nCompiled execution failed with exception:")
        traceback.print_exc()
        raise


if __name__ == "__main__":
    main()

Actual behavior

Eager execution succeeds:

Input:
tensor([[8]], device='cuda:0')

Eager succeeded.
Eager output shape: (1, 100, 100)
Eager output sum: 64.0

Compiled execution fails:

Running compiled model...

Compiled execution failed with exception:
torch._inductor.exc.InductorError: LoweringException: NotImplementedError: View
  target: aten.slice.Tensor
  args[0]: TensorBox(
    View(
      StorageBox(
        ComputedBuffer(name='buf2', layout=FlexibleLayout('cuda:0', torch.float32, size=[1, 100, 100], stride=[10000, 100, 1]), data=Pointwise(
          'cuda',
          torch.float32,
          def inner_fn(index):
              _, i1, i2 = index
              tmp0 = ops.constant(0, torch.float32)
              return tmp0
          ,
          ranges=[1, 100, 100],
          origin_node=full_default,
          origins=OrderedSet([full_default]),
          stack_traces = {,
            File ".../for_test_1.py", line 15, in forward,
              mask = torch.zeros(,
          ,
          }
        )
      ),
      size=[100, 100],
      reindex=lambda i0, i1: [0, i0, i1],
      origins=OrderedSet([select_3, full_default]),
      stack_traces = {,
        File ".../for_test_1.py", line 23, in forward,
          mask[i, :span, :span] = 1.0,
      ,
      }
    )
  )
  args[1]: 0
  args[2]: 0
  args[3]: u0

Found from:
   File ".../for_test_1.py", line 23, in forward
    mask[i, :span, :span] = 1.0

Versions

PyTorch version:  2.13.0a0+git059c270
Is debug build: True
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
Is CUDA available: True

cc @chauhang @penguinwu @ezyang @bobrenjc93 @aditvenk @laithsakka @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING