pytorch - 💡(How to fix) Fix Pytorch 2.11 regression: Division by zero exception on empty tensor with torch.compile and dynamic size [12 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178530Fetched 2026-04-08 01:35:46
View on GitHub
Comments
12
Participants
4
Timeline
250
Reactions
2
Author
Assignees
Timeline (top)
mentioned ×108subscribed ×108commented ×12labeled ×11

Error Message

Error logs

Root Cause

I have claude to analyze the root cause, I think its analysis is correct:

Code Example

import torch

def compute(dates, query_date, initial_dfs, target_dfs, half_life):

    """

    `dates`      : sorted int32 simulation stopping-dates tensor, shape (N,)

    `query_date` : scalar int32 – maturity date of the trade

    `initial_dfs`: discount factors at benchmark tenors,    shape (B,)

    `target_dfs` : target discount factors at benchmark tenors, shape (B,)

    `half_life`  : mean-reversion half-life scalar

 

    Returns a tensor of shape (t, B) where t = number of stopping dates

    that precede the maturity date.  When the trade is already matured,

    t = 0 and the output is an empty (0, B) tensor.

    """

    # t is a data-dependent size -> unbacked symint inside torch.compile

    t = torch.searchsorted(dates, query_date).item()

    torch._check(t >= 0)

    torch._check(t <= dates.shape[0])

 

    sim_yfs = torch.arange(t, device=dates.device, dtype=initial_dfs.dtype)

    decay   = torch.exp(

        -torch.log(torch.tensor(2.0, device=dates.device, dtype=initial_dfs.dtype))

        / half_life * sim_yfs

    )

    # outer-product broadcast: (1, B) ** (t, 1)  ->  (t, B)

    return (

        (initial_dfs.unsqueeze(0) ** decay.unsqueeze(1))

        * (target_dfs.unsqueeze(0) ** (1.0 - decay.unsqueeze(1)))

    )

 

 

# ---------------------------------------------------------------------------

# Setup

# ---------------------------------------------------------------------------

 

device = "cuda"

dtype  = torch.float64

 

dates       = torch.tensor([100, 200, 300, 400, 500], device=device, dtype=torch.int32)

initial_dfs = torch.rand(12, device=device, dtype=dtype)

target_dfs  = torch.rand(12, device=device, dtype=dtype)

half_life   = torch.tensor(1.0, device=device, dtype=dtype)

 

# query_date=350 -> t=3  (trade still alive,  3 future stopping dates)

future_date  = torch.tensor(350, device=device, dtype=torch.int32)

 

# query_date=50  -> t=0  (trade already matured, 0 future stopping dates)

matured_date = torch.tensor(50, device=device, dtype=torch.int32)

 

# ---------------------------------------------------------------------------

# 1. Without torch.compile – works correctly for both t>0 and t=0

# ---------------------------------------------------------------------------


print("--- Without torch.compile ---")


r1 = compute(dates, future_date, initial_dfs, target_dfs, half_life)

print(f"  t=3 (future trade):          shape = {r1.shape}")   # torch.Size([3, 12])

 

r2 = compute(dates, matured_date, initial_dfs, target_dfs, half_life)

print(f"  t=0 (already-matured trade): shape = {r2.shape}")   # torch.Size([0, 12])

 

# ---------------------------------------------------------------------------

# 2. With torch.compile(dynamic=True) – works for t>0, crashes for t=0

# ---------------------------------------------------------------------------

 
compiled = torch.compile(compute, fullgraph=True, dynamic=True)


r3 = compiled(dates, future_date, initial_dfs, target_dfs, half_life)

print(f"  t=3 (future trade):          shape = {r3.shape}")   # torch.Size([3, 12])

r4 = compiled(dates, matured_date, initial_dfs, target_dfs, half_life)

print(f"  shape = {r4.shape}")   # never reached – crashes above

---

# ---------------------------------------------------------------------------

# ROOT CAUSE

# ---------------------------------------------------------------------------

torch._inductor.runtime.triton_heuristics.Grid2DWithYZOverflow.generate()

builds the Triton kernel launcher as a dynamically exec'd string:



def launcher(..., ynumel, xnumel, stream):

       y_grid_raw_ = -((ynumel) // -(YBLOCK))          # = 0 when ynumel=0

       y_grid_div_ = -((y_grid_raw_) // -(65535))      # = 0

       grid_0 = -((xnumel) // -(XBLOCK))

       grid_1 = -((y_grid_raw_) // -(y_grid_div_))     # line 5: 0 // 0 -> ZeroDivisionError

       grid_2 = y_grid_div_

      runner(grid_0, grid_1, grid_2, stream, ...)



Grid2DWithYZOverflow is chosen (over Grid2D) when the y-dimension is an

unbounded symbolic integer – specifically when torch.compile cannot prove

statically that the dimension is <= 65535.  Here, `t` comes from

searchsorted().item() / torch._check_is_size(t), which makes it an unbacked symint with an unbounded upper range.


=======================================================================

SUGGESTED FIX

=======================================================================

 

In torch/_inductor/runtime/triton_heuristics.py, class Grid2DWithYZOverflow:

 

  BEFORE (buggy):

      self.y_grid = self.ceildiv("y_grid_raw_", "y_grid_div_")

 

  AFTER (fixed):

      self.y_grid = (

          f"(0 if y_grid_div_ == 0 else {self.ceildiv('y_grid_raw_', 'y_grid_div_')})"

      )

 

OR equivalently, add an early-exit guard at the top of the launcher:

 

      grid.prefix.insert(0, "if ynumel == 0: return")

 

================================================================================

ADDITIONAL NOTES

================================================================================

 
- The same pattern appears twice in triton_heuristics.py (lines ~4134 and ~4220).
- 

  Both instances need to be fixed.

 

- triton.next_power_of_2(0) == 0, which is a related footgun. If any heuristics

  use next_power_of_2(ynumel) as a BLOCK size and then divide by that BLOCK

  size in a grid expression, the same ZeroDivisionError will occur.

 

- The correct semantic for a kernel launched with ynumel=0 is a no-op:

  no work is done, the output tensors remain empty.

---

Component  : torch.compile / torch._inductor / triton_heuristics

Version    : 2.11.0+cu130  (regression not present in earlier versions)

 

================================================================================

ENVIRONMENT  (python -c "import torch.utils.collect_env as c; c.main()")

================================================================================

 

PyTorch version: 2.11.0+cu130

Is debug build: False

CUDA used to build PyTorch: 13.0

ROCM used to build PyTorch: N/A

 

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)

GCC version: Could not collect

Clang version: Could not collect

CMake version: version 4.1.0

Libc version: N/A

 

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)

Python platform: Windows-10-10.0.22631-SP0

Is CUDA available: True

CUDA runtime version: 13.0.48

CUDA_MODULE_LOADING set to:

GPU models and configuration: GPU 0: NVIDIA RTX A6000

Nvidia driver version: 582.16

cuDNN version: Could not collect

Is XPU available: False

HIP runtime version: N/A

MIOpen runtime version: N/A

Is XNNPACK available: True

Caching allocator config: N/A

 

CPU:

  Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz (x2 sockets)

 

Versions of relevant libraries:

  torch==2.11.0+cu130

  torchaudio==2.11.0+cu130

  torchvision==0.26.0+cu130

  triton-windows==3.5.1.post22

  numpy==2.2.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

The following code works with torch 2.9.1.

Summary:

torch.compile(dynamic=True) crashes with ZeroDivisionError: integer division or modulo by zero when a compiled function is called with a tensor whose y-dimension is 0 (empty tensor) and the compiled kernel uses Grid2DWithYZOverflow grid type. The crash occurs inside the dynamically-generated launcher function at the line: grid_1 = -((y_grid_raw_) // -(y_grid_div_))

When ynumel = 0:

  • y_grid_raw_ = -((0) // -(YBLOCK)) = 0
  • y_grid_div_ = -((0) // -(65535)) = 0
  • grid_1 = -((0) // -(0)) --> ZeroDivisionError (0 // 0)

Code to reproduce:

import torch

def compute(dates, query_date, initial_dfs, target_dfs, half_life):

    """

    `dates`      : sorted int32 simulation stopping-dates tensor, shape (N,)

    `query_date` : scalar int32 – maturity date of the trade

    `initial_dfs`: discount factors at benchmark tenors,    shape (B,)

    `target_dfs` : target discount factors at benchmark tenors, shape (B,)

    `half_life`  : mean-reversion half-life scalar

 

    Returns a tensor of shape (t, B) where t = number of stopping dates

    that precede the maturity date.  When the trade is already matured,

    t = 0 and the output is an empty (0, B) tensor.

    """

    # t is a data-dependent size -> unbacked symint inside torch.compile

    t = torch.searchsorted(dates, query_date).item()

    torch._check(t >= 0)

    torch._check(t <= dates.shape[0])

 

    sim_yfs = torch.arange(t, device=dates.device, dtype=initial_dfs.dtype)

    decay   = torch.exp(

        -torch.log(torch.tensor(2.0, device=dates.device, dtype=initial_dfs.dtype))

        / half_life * sim_yfs

    )

    # outer-product broadcast: (1, B) ** (t, 1)  ->  (t, B)

    return (

        (initial_dfs.unsqueeze(0) ** decay.unsqueeze(1))

        * (target_dfs.unsqueeze(0) ** (1.0 - decay.unsqueeze(1)))

    )

 

 

# ---------------------------------------------------------------------------

# Setup

# ---------------------------------------------------------------------------

 

device = "cuda"

dtype  = torch.float64

 

dates       = torch.tensor([100, 200, 300, 400, 500], device=device, dtype=torch.int32)

initial_dfs = torch.rand(12, device=device, dtype=dtype)

target_dfs  = torch.rand(12, device=device, dtype=dtype)

half_life   = torch.tensor(1.0, device=device, dtype=dtype)

 

# query_date=350 -> t=3  (trade still alive,  3 future stopping dates)

future_date  = torch.tensor(350, device=device, dtype=torch.int32)

 

# query_date=50  -> t=0  (trade already matured, 0 future stopping dates)

matured_date = torch.tensor(50, device=device, dtype=torch.int32)

 

# ---------------------------------------------------------------------------

# 1. Without torch.compile – works correctly for both t>0 and t=0

# ---------------------------------------------------------------------------


print("--- Without torch.compile ---")


r1 = compute(dates, future_date, initial_dfs, target_dfs, half_life)

print(f"  t=3 (future trade):          shape = {r1.shape}")   # torch.Size([3, 12])

 

r2 = compute(dates, matured_date, initial_dfs, target_dfs, half_life)

print(f"  t=0 (already-matured trade): shape = {r2.shape}")   # torch.Size([0, 12])

 

# ---------------------------------------------------------------------------

# 2. With torch.compile(dynamic=True) – works for t>0, crashes for t=0

# ---------------------------------------------------------------------------

 
compiled = torch.compile(compute, fullgraph=True, dynamic=True)


r3 = compiled(dates, future_date, initial_dfs, target_dfs, half_life)

print(f"  t=3 (future trade):          shape = {r3.shape}")   # torch.Size([3, 12])

r4 = compiled(dates, matured_date, initial_dfs, target_dfs, half_life)

print(f"  shape = {r4.shape}")   # never reached – crashes above

I have claude to analyze the root cause, I think its analysis is correct:

# ---------------------------------------------------------------------------

# ROOT CAUSE

# ---------------------------------------------------------------------------

torch._inductor.runtime.triton_heuristics.Grid2DWithYZOverflow.generate()

builds the Triton kernel launcher as a dynamically exec'd string:



def launcher(..., ynumel, xnumel, stream):

       y_grid_raw_ = -((ynumel) // -(YBLOCK))          # = 0 when ynumel=0

       y_grid_div_ = -((y_grid_raw_) // -(65535))      # = 0

       grid_0 = -((xnumel) // -(XBLOCK))

       grid_1 = -((y_grid_raw_) // -(y_grid_div_))     # line 5: 0 // 0 -> ZeroDivisionError

       grid_2 = y_grid_div_

      runner(grid_0, grid_1, grid_2, stream, ...)



Grid2DWithYZOverflow is chosen (over Grid2D) when the y-dimension is an

unbounded symbolic integer – specifically when torch.compile cannot prove

statically that the dimension is <= 65535.  Here, `t` comes from

searchsorted().item() / torch._check_is_size(t), which makes it an unbacked symint with an unbounded upper range.


=======================================================================

SUGGESTED FIX

=======================================================================

 

In torch/_inductor/runtime/triton_heuristics.py, class Grid2DWithYZOverflow:

 

  BEFORE (buggy):

      self.y_grid = self.ceildiv("y_grid_raw_", "y_grid_div_")

 

  AFTER (fixed):

      self.y_grid = (

          f"(0 if y_grid_div_ == 0 else {self.ceildiv('y_grid_raw_', 'y_grid_div_')})"

      )

 

OR equivalently, add an early-exit guard at the top of the launcher:

 

      grid.prefix.insert(0, "if ynumel == 0: return")

 

================================================================================

ADDITIONAL NOTES

================================================================================

 
- The same pattern appears twice in triton_heuristics.py (lines ~4134 and ~4220).
- 

  Both instances need to be fixed.

 

- triton.next_power_of_2(0) == 0, which is a related footgun. If any heuristics

  use next_power_of_2(ynumel) as a BLOCK size and then divide by that BLOCK

  size in a grid expression, the same ZeroDivisionError will occur.

 

- The correct semantic for a kernel launched with ynumel=0 is a no-op:

  no work is done, the output tensors remain empty.

Error logs

No response

Versions

 

Component  : torch.compile / torch._inductor / triton_heuristics

Version    : 2.11.0+cu130  (regression not present in earlier versions)

 

================================================================================

ENVIRONMENT  (python -c "import torch.utils.collect_env as c; c.main()")

================================================================================

 

PyTorch version: 2.11.0+cu130

Is debug build: False

CUDA used to build PyTorch: 13.0

ROCM used to build PyTorch: N/A

 

OS: Microsoft Windows 11 Enterprise (10.0.22631 64-bit)

GCC version: Could not collect

Clang version: Could not collect

CMake version: version 4.1.0

Libc version: N/A

 

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)

Python platform: Windows-10-10.0.22631-SP0

Is CUDA available: True

CUDA runtime version: 13.0.48

CUDA_MODULE_LOADING set to:

GPU models and configuration: GPU 0: NVIDIA RTX A6000

Nvidia driver version: 582.16

cuDNN version: Could not collect

Is XPU available: False

HIP runtime version: N/A

MIOpen runtime version: N/A

Is XNNPACK available: True

Caching allocator config: N/A

 

CPU:

  Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz (x2 sockets)

 

Versions of relevant libraries:

  torch==2.11.0+cu130

  torchaudio==2.11.0+cu130

  torchvision==0.26.0+cu130

  triton-windows==3.5.1.post22

  numpy==2.2.6

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @chauhang @penguinwu @ezyang @bobrenjc93 @aditvenk @laithsakka @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

To fix the ZeroDivisionError issue in torch.compile with dynamic=True, we need to modify the Grid2DWithYZOverflow class in torch/_inductor/runtime/triton_heuristics.py.

Here are the steps:

  • Open the triton_heuristics.py file and locate the Grid2DWithYZOverflow class.
  • Replace the line self.y_grid = self.ceildiv("y_grid_raw_", "y_grid_div_") with:
self.y_grid = (
    f"(0 if y_grid_div_ == 0 else {self.ceildiv('y_grid_raw_', 'y_grid_div_')})"
)

Alternatively, you can add an early-exit guard at the top of the launcher:

grid.prefix.insert(0, "if ynumel == 0: return")

Make sure to apply the fix to both instances of the pattern in triton_heuristics.py (lines ~4134 and ~4220).

Verification

After applying the fix, you can verify that the issue is resolved by running the original code with torch.compile(dynamic=True). The code should now work correctly for both t>0 and t=0 cases without crashing with a ZeroDivisionError.

Extra Tips

Note that the same pattern appears twice in triton_heuristics.py, so it's essential to fix both instances to avoid similar issues. Additionally, be aware of the triton.next_power_of_2(0) == 0 behavior, which can also lead to division by zero errors if used as a block size in grid expressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING