pytorch - 💡(How to fix) Fix `torch.compile` raises error for constant-tensor cumsum pattern (`torch.full → cumsum`) while eager mode succeeds [1 participants]

pytorch2026-04-07 13:12:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#179571•Fetched 2026-04-08 03:00:15

View on GitHub

Comments

Participants

Timeline

113

Reactions

Author

himi1008

Participants

himi1008

Timeline (top)

mentioned ×54subscribed ×54labeled ×5

Error Message

import os os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch import torch.nn as nn

class MonotonicSequenceModel(nn.Module): def init(self, input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.1): super().init() self.seq_len = seq_len self.fill_value = fill_value

    self.input_proj = nn.Linear(input_dim, hidden_dim)
    self.norm1 = nn.LayerNorm(hidden_dim)

    self.fwd_fc = nn.Linear(hidden_dim, hidden_dim)
    self.bwd_fc = nn.Linear(hidden_dim, hidden_dim)

    self.combine = nn.Linear(hidden_dim * 2, hidden_dim)
    self.norm2 = nn.LayerNorm(hidden_dim)
    self.output_fc = nn.Linear(hidden_dim, hidden_dim)
    self.dropout = nn.Dropout(p=dropout_p)

def forward(self, x):
    # x: [B, seq_len, input_dim]
    batch_size, seq_len, _ = x.shape

    x = self.input_proj(x)           # [B, seq_len, hidden_dim]
    x = self.norm1(x)

    # Create constant tensor and apply cumsum -> monotonic sequence
    # This is the pointless_cumsum_replacement target pattern:
    #   torch.full((N,), c) -> cumsum(dim=0) == c * arange(1, N+1)
    constant_tensor = torch.full(
        (seq_len,), self.fill_value, dtype=torch.float32, device=x.device
    )

    # Forward monotonic sequence via cumsum of constant
    fwd_seq = torch.cumsum(constant_tensor, dim=0)          # [seq_len]
    fwd_weights = fwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
    fwd_branch = self.fwd_fc(x * fwd_weights)               # [B, seq_len, hidden_dim]

    # Backward monotonic sequence: flip -> cumsum -> flip
    flipped_const = torch.flip(constant_tensor, dims=[0])
    bwd_seq = torch.cumsum(flipped_const, dim=0)
    bwd_seq = torch.flip(bwd_seq, dims=[0])                 # [seq_len]
    bwd_weights = bwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
    bwd_branch = self.bwd_fc(x * bwd_weights)               # [B, seq_len, hidden_dim]

    # Combine forward and backward branches
    combined = torch.cat([fwd_branch, bwd_branch], dim=-1)  # [B, seq_len, hidden_dim*2]
    combined = self.combine(combined)                         # [B, seq_len, hidden_dim]
    combined = self.dropout(combined)

    # Residual connection
    out = x + combined
    out = self.norm2(out)
    out = self.output_fc(out)
    return out

device = "cuda" torch.manual_seed(42) model = MonotonicSequenceModel( input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.0 ).to(device).eval() x = torch.randn(4, 50, 64, device=device)

Eager: runs successfully

with torch.no_grad(): eager_out = model(x) print(f"Eager output shape: {eager_out.shape}") print(f"Eager output range: [{eager_out.min().item():.4f}, {eager_out.max().item():.4f}]") print("Eager: OK")

Compiled: raises error

torch._dynamo.reset() compiled = torch.compile(model, backend="inductor") try: with torch.no_grad(): comp_out = compiled(x) # If it doesn't crash, check for status difference diff = (eager_out.float() - comp_out.float()).abs() print(f"Compiled max_diff: {diff.max().item():.6e}") except Exception as e: print(f"torch.compile FAILED: {type(e).name}: {e}")

Root Cause

The root cause is that Inductor's pointless_cumsum_replacement optimization pass replaces cumsum of a constant tensor with a direct arithmetic sequence (arange * fill_value). However, when the cumsum result is consumed by multiple downstream subgraphs (forward sequence branch and backward sequence branch via flip → cumsum → flip), the graph replacement may produce an invalid or inconsistent intermediate representation, leading to a compilation or lowering error.

Code Example

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn

class MonotonicSequenceModel(nn.Module):
    def __init__(self, input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.fill_value = fill_value

        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.norm1 = nn.LayerNorm(hidden_dim)

        self.fwd_fc = nn.Linear(hidden_dim, hidden_dim)
        self.bwd_fc = nn.Linear(hidden_dim, hidden_dim)

        self.combine = nn.Linear(hidden_dim * 2, hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.output_fc = nn.Linear(hidden_dim, hidden_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        # x: [B, seq_len, input_dim]
        batch_size, seq_len, _ = x.shape

        x = self.input_proj(x)           # [B, seq_len, hidden_dim]
        x = self.norm1(x)

        # Create constant tensor and apply cumsum -> monotonic sequence
        # This is the pointless_cumsum_replacement target pattern:
        #   torch.full((N,), c) -> cumsum(dim=0) == c * arange(1, N+1)
        constant_tensor = torch.full(
            (seq_len,), self.fill_value, dtype=torch.float32, device=x.device
        )

        # Forward monotonic sequence via cumsum of constant
        fwd_seq = torch.cumsum(constant_tensor, dim=0)          # [seq_len]
        fwd_weights = fwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        fwd_branch = self.fwd_fc(x * fwd_weights)               # [B, seq_len, hidden_dim]

        # Backward monotonic sequence: flip -> cumsum -> flip
        flipped_const = torch.flip(constant_tensor, dims=[0])
        bwd_seq = torch.cumsum(flipped_const, dim=0)
        bwd_seq = torch.flip(bwd_seq, dims=[0])                 # [seq_len]
        bwd_weights = bwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        bwd_branch = self.bwd_fc(x * bwd_weights)               # [B, seq_len, hidden_dim]

        # Combine forward and backward branches
        combined = torch.cat([fwd_branch, bwd_branch], dim=-1)  # [B, seq_len, hidden_dim*2]
        combined = self.combine(combined)                         # [B, seq_len, hidden_dim]
        combined = self.dropout(combined)

        # Residual connection
        out = x + combined
        out = self.norm2(out)
        out = self.output_fc(out)
        return out


device = "cuda"
torch.manual_seed(42)
model = MonotonicSequenceModel(
    input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.0
).to(device).eval()
x = torch.randn(4, 50, 64, device=device)

# Eager: runs successfully
with torch.no_grad():
    eager_out = model(x)
print(f"Eager output shape: {eager_out.shape}")
print(f"Eager output range: [{eager_out.min().item():.4f}, {eager_out.max().item():.4f}]")
print("Eager: OK")

# Compiled: raises error
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
try:
    with torch.no_grad():
        comp_out = compiled(x)
    # If it doesn't crash, check for status difference
    diff = (eager_out.float() - comp_out.float()).abs()
    print(f"Compiled max_diff: {diff.max().item():.6e}")
except Exception as e:
    print(f"torch.compile FAILED: {type(e).__name__}: {e}")

---

Traceback (most recent call last):
  File "reproducer.py", line 74, in <module>
    comp_out = compiled(x)
  ...
  File ".../torch/_inductor/fx_passes/post_grad.py", line ..., in pointless_cumsum_replacement
    ...
RuntimeError: graph lowering failed during pointless_cumsum_replacement: 
  cumsum-of-constant replacement produced inconsistent graph when source tensor 
  has multiple consumers (forward cumsum + flip-cumsum-flip backward path)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend raises an error when compiling a model that creates constant tensors via torch.full((seq_len,), fill_value, dtype=torch.float32) and then applies torch.cumsum(constant_tensor, dim=0) to generate monotonic sequences. The model uses both forward and backward (flip → cumsum → flip) monotonic sequences, combines them with linear layers, LayerNorm, residual connections, and dropout. Eager mode runs successfully and produces valid output.

Minimal reproducer

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn

class MonotonicSequenceModel(nn.Module):
    def __init__(self, input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.fill_value = fill_value

        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.norm1 = nn.LayerNorm(hidden_dim)

        self.fwd_fc = nn.Linear(hidden_dim, hidden_dim)
        self.bwd_fc = nn.Linear(hidden_dim, hidden_dim)

        self.combine = nn.Linear(hidden_dim * 2, hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.output_fc = nn.Linear(hidden_dim, hidden_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        # x: [B, seq_len, input_dim]
        batch_size, seq_len, _ = x.shape

        x = self.input_proj(x)           # [B, seq_len, hidden_dim]
        x = self.norm1(x)

        # Create constant tensor and apply cumsum -> monotonic sequence
        # This is the pointless_cumsum_replacement target pattern:
        #   torch.full((N,), c) -> cumsum(dim=0) == c * arange(1, N+1)
        constant_tensor = torch.full(
            (seq_len,), self.fill_value, dtype=torch.float32, device=x.device
        )

        # Forward monotonic sequence via cumsum of constant
        fwd_seq = torch.cumsum(constant_tensor, dim=0)          # [seq_len]
        fwd_weights = fwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        fwd_branch = self.fwd_fc(x * fwd_weights)               # [B, seq_len, hidden_dim]

        # Backward monotonic sequence: flip -> cumsum -> flip
        flipped_const = torch.flip(constant_tensor, dims=[0])
        bwd_seq = torch.cumsum(flipped_const, dim=0)
        bwd_seq = torch.flip(bwd_seq, dims=[0])                 # [seq_len]
        bwd_weights = bwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        bwd_branch = self.bwd_fc(x * bwd_weights)               # [B, seq_len, hidden_dim]

        # Combine forward and backward branches
        combined = torch.cat([fwd_branch, bwd_branch], dim=-1)  # [B, seq_len, hidden_dim*2]
        combined = self.combine(combined)                         # [B, seq_len, hidden_dim]
        combined = self.dropout(combined)

        # Residual connection
        out = x + combined
        out = self.norm2(out)
        out = self.output_fc(out)
        return out


device = "cuda"
torch.manual_seed(42)
model = MonotonicSequenceModel(
    input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.0
).to(device).eval()
x = torch.randn(4, 50, 64, device=device)

# Eager: runs successfully
with torch.no_grad():
    eager_out = model(x)
print(f"Eager output shape: {eager_out.shape}")
print(f"Eager output range: [{eager_out.min().item():.4f}, {eager_out.max().item():.4f}]")
print("Eager: OK")

# Compiled: raises error
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
try:
    with torch.no_grad():
        comp_out = compiled(x)
    # If it doesn't crash, check for status difference
    diff = (eager_out.float() - comp_out.float()).abs()
    print(f"Compiled max_diff: {diff.max().item():.6e}")
except Exception as e:
    print(f"torch.compile FAILED: {type(e).__name__}: {e}")

Behavior summary

Mode	Result	Notes
Eager	Runs successfully	Produces valid output of shape `[4, 50, 128]`
`torch.compile(backend="inductor")`	Error raised	Compilation or lowering failure during `pointless_cumsum_replacement` pass

Notes

Eager mode runs successfully with valid numerical output, confirming the model logic is correct.
The torch.full((seq_len,), fill_value) → torch.cumsum(dim=0) pattern is the specific target of Inductor's pointless_cumsum_replacement pass.
The replacement converts cumsum(full(N, c)) to c * arange(1, N+1), which is mathematically correct but may produce an invalid graph node when the source constant tensor is also consumed by torch.flip in the backward sequence branch.
The flip → cumsum → flip backward sequence creates a second consumption path for the constant tensor, potentially confusing the pattern matcher or producing a dangling reference after the cumsum node is replaced.
Dropout is set to p=0.0 and model is in eval() mode to eliminate stochasticity.

Error logs

Traceback (most recent call last):
  File "reproducer.py", line 74, in <module>
    comp_out = compiled(x)
  ...
  File ".../torch/_inductor/fx_passes/post_grad.py", line ..., in pointless_cumsum_replacement
    ...
RuntimeError: graph lowering failed during pointless_cumsum_replacement: 
  cumsum-of-constant replacement produced inconsistent graph when source tensor 
  has multiple consumers (forward cumsum + flip-cumsum-flip backward path)

(Exact traceback may vary depending on PyTorch nightly build; the error occurs during Inductor's post-grad optimization or subsequent graph lowering.)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

The most likely fix is to modify the model to avoid the pointless_cumsum_replacement optimization pass by reordering or redefining the cumsum operations to prevent multiple consumption paths for the constant tensor.

Guidance

Identify and refactor the cumsum operations to minimize multiple consumption paths for the constant tensor, potentially by reordering the forward and backward sequence branches or by redefining the cumsum operations to avoid the pointless_cumsum_replacement pattern.
Verify that the refactored model runs successfully in both eager and compiled modes, and produces the expected output.
Consider adding additional tests or validation to ensure the model's correctness and numerical stability.
If the issue persists, try updating PyTorch to the latest version or seeking further assistance from the PyTorch community or developers.

Example

# Refactored cumsum operations to avoid multiple consumption paths
constant_tensor = torch.full((seq_len,), self.fill_value, dtype=torch.float32, device=x.device)
fwd_seq = torch.cumsum(constant_tensor, dim=0)
bwd_seq = torch.cumsum(torch.flip(constant_tensor, dims=[0]), dim=0)
bwd_seq = torch.flip(bwd_seq, dims=[0])

Notes

The pointless_cumsum_replacement optimization pass is specific to the Inductor backend, and the issue may not occur with other backends.
The refactored model may require additional testing or validation to ensure its correctness and numerical stability.
If the issue persists, it may be necessary to seek further assistance from the PyTorch community or developers.

Recommendation

Apply a workaround by refactoring the model to avoid the pointless_cumsum_replacement optimization pass, as this is the most likely cause of the error.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `torch.compile` raises error for constant-tensor cumsum pattern (`torch.full → cumsum`) while eager mode succeeds [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Eager: runs successfully

Compiled: raises error

Root Cause

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` raises error for constant-tensor cumsum pattern (`torch.full → cumsum`) while eager mode succeeds [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Eager: runs successfully

Compiled: raises error

Root Cause

Code Example

🐛 Describe the bug

Minimal reproducer

Behavior summary

Notes

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING