pytorch - 💡(How to fix) Fix `torch.compile` raises error for constant-tensor cumsum pattern (`torch.full → cumsum`) while eager mode succeeds [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179571Fetched 2026-04-08 03:00:15
View on GitHub
Comments
0
Participants
1
Timeline
113
Reactions
0
Author
Participants
Timeline (top)
mentioned ×54subscribed ×54labeled ×5

Error Message

import os os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch import torch.nn as nn

class MonotonicSequenceModel(nn.Module): def init(self, input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.1): super().init() self.seq_len = seq_len self.fill_value = fill_value

    self.input_proj = nn.Linear(input_dim, hidden_dim)
    self.norm1 = nn.LayerNorm(hidden_dim)

    self.fwd_fc = nn.Linear(hidden_dim, hidden_dim)
    self.bwd_fc = nn.Linear(hidden_dim, hidden_dim)

    self.combine = nn.Linear(hidden_dim * 2, hidden_dim)
    self.norm2 = nn.LayerNorm(hidden_dim)
    self.output_fc = nn.Linear(hidden_dim, hidden_dim)
    self.dropout = nn.Dropout(p=dropout_p)

def forward(self, x):
    # x: [B, seq_len, input_dim]
    batch_size, seq_len, _ = x.shape

    x = self.input_proj(x)           # [B, seq_len, hidden_dim]
    x = self.norm1(x)

    # Create constant tensor and apply cumsum -> monotonic sequence
    # This is the pointless_cumsum_replacement target pattern:
    #   torch.full((N,), c) -> cumsum(dim=0) == c * arange(1, N+1)
    constant_tensor = torch.full(
        (seq_len,), self.fill_value, dtype=torch.float32, device=x.device
    )

    # Forward monotonic sequence via cumsum of constant
    fwd_seq = torch.cumsum(constant_tensor, dim=0)          # [seq_len]
    fwd_weights = fwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
    fwd_branch = self.fwd_fc(x * fwd_weights)               # [B, seq_len, hidden_dim]

    # Backward monotonic sequence: flip -> cumsum -> flip
    flipped_const = torch.flip(constant_tensor, dims=[0])
    bwd_seq = torch.cumsum(flipped_const, dim=0)
    bwd_seq = torch.flip(bwd_seq, dims=[0])                 # [seq_len]
    bwd_weights = bwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
    bwd_branch = self.bwd_fc(x * bwd_weights)               # [B, seq_len, hidden_dim]

    # Combine forward and backward branches
    combined = torch.cat([fwd_branch, bwd_branch], dim=-1)  # [B, seq_len, hidden_dim*2]
    combined = self.combine(combined)                         # [B, seq_len, hidden_dim]
    combined = self.dropout(combined)

    # Residual connection
    out = x + combined
    out = self.norm2(out)
    out = self.output_fc(out)
    return out

device = "cuda" torch.manual_seed(42) model = MonotonicSequenceModel( input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.0 ).to(device).eval() x = torch.randn(4, 50, 64, device=device)

Eager: runs successfully

with torch.no_grad(): eager_out = model(x) print(f"Eager output shape: {eager_out.shape}") print(f"Eager output range: [{eager_out.min().item():.4f}, {eager_out.max().item():.4f}]") print("Eager: OK")

Compiled: raises error

torch._dynamo.reset() compiled = torch.compile(model, backend="inductor") try: with torch.no_grad(): comp_out = compiled(x) # If it doesn't crash, check for status difference diff = (eager_out.float() - comp_out.float()).abs() print(f"Compiled max_diff: {diff.max().item():.6e}") except Exception as e: print(f"torch.compile FAILED: {type(e).name}: {e}")

Root Cause

The root cause is that Inductor's pointless_cumsum_replacement optimization pass replaces cumsum of a constant tensor with a direct arithmetic sequence (arange * fill_value). However, when the cumsum result is consumed by multiple downstream subgraphs (forward sequence branch and backward sequence branch via flip → cumsum → flip), the graph replacement may produce an invalid or inconsistent intermediate representation, leading to a compilation or lowering error.

Code Example

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn

class MonotonicSequenceModel(nn.Module):
    def __init__(self, input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.fill_value = fill_value

        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.norm1 = nn.LayerNorm(hidden_dim)

        self.fwd_fc = nn.Linear(hidden_dim, hidden_dim)
        self.bwd_fc = nn.Linear(hidden_dim, hidden_dim)

        self.combine = nn.Linear(hidden_dim * 2, hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.output_fc = nn.Linear(hidden_dim, hidden_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        # x: [B, seq_len, input_dim]
        batch_size, seq_len, _ = x.shape

        x = self.input_proj(x)           # [B, seq_len, hidden_dim]
        x = self.norm1(x)

        # Create constant tensor and apply cumsum -> monotonic sequence
        # This is the pointless_cumsum_replacement target pattern:
        #   torch.full((N,), c) -> cumsum(dim=0) == c * arange(1, N+1)
        constant_tensor = torch.full(
            (seq_len,), self.fill_value, dtype=torch.float32, device=x.device
        )

        # Forward monotonic sequence via cumsum of constant
        fwd_seq = torch.cumsum(constant_tensor, dim=0)          # [seq_len]
        fwd_weights = fwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        fwd_branch = self.fwd_fc(x * fwd_weights)               # [B, seq_len, hidden_dim]

        # Backward monotonic sequence: flip -> cumsum -> flip
        flipped_const = torch.flip(constant_tensor, dims=[0])
        bwd_seq = torch.cumsum(flipped_const, dim=0)
        bwd_seq = torch.flip(bwd_seq, dims=[0])                 # [seq_len]
        bwd_weights = bwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        bwd_branch = self.bwd_fc(x * bwd_weights)               # [B, seq_len, hidden_dim]

        # Combine forward and backward branches
        combined = torch.cat([fwd_branch, bwd_branch], dim=-1)  # [B, seq_len, hidden_dim*2]
        combined = self.combine(combined)                         # [B, seq_len, hidden_dim]
        combined = self.dropout(combined)

        # Residual connection
        out = x + combined
        out = self.norm2(out)
        out = self.output_fc(out)
        return out


device = "cuda"
torch.manual_seed(42)
model = MonotonicSequenceModel(
    input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.0
).to(device).eval()
x = torch.randn(4, 50, 64, device=device)

# Eager: runs successfully
with torch.no_grad():
    eager_out = model(x)
print(f"Eager output shape: {eager_out.shape}")
print(f"Eager output range: [{eager_out.min().item():.4f}, {eager_out.max().item():.4f}]")
print("Eager: OK")

# Compiled: raises error
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
try:
    with torch.no_grad():
        comp_out = compiled(x)
    # If it doesn't crash, check for status difference
    diff = (eager_out.float() - comp_out.float()).abs()
    print(f"Compiled max_diff: {diff.max().item():.6e}")
except Exception as e:
    print(f"torch.compile FAILED: {type(e).__name__}: {e}")

---

Traceback (most recent call last):
  File "reproducer.py", line 74, in <module>
    comp_out = compiled(x)
  ...
  File ".../torch/_inductor/fx_passes/post_grad.py", line ..., in pointless_cumsum_replacement
    ...
RuntimeError: graph lowering failed during pointless_cumsum_replacement: 
  cumsum-of-constant replacement produced inconsistent graph when source tensor 
  has multiple consumers (forward cumsum + flip-cumsum-flip backward path)

---

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with inductor backend raises an error when compiling a model that creates constant tensors via torch.full((seq_len,), fill_value, dtype=torch.float32) and then applies torch.cumsum(constant_tensor, dim=0) to generate monotonic sequences. The model uses both forward and backward (flip → cumsum → flip) monotonic sequences, combines them with linear layers, LayerNorm, residual connections, and dropout. Eager mode runs successfully and produces valid output.

The root cause is that Inductor's pointless_cumsum_replacement optimization pass replaces cumsum of a constant tensor with a direct arithmetic sequence (arange * fill_value). However, when the cumsum result is consumed by multiple downstream subgraphs (forward sequence branch and backward sequence branch via flip → cumsum → flip), the graph replacement may produce an invalid or inconsistent intermediate representation, leading to a compilation or lowering error.

Minimal reproducer

import os
os.environ["TRITON_BACKENDS_IN_TREE"] = "1"

import torch
import torch.nn as nn

class MonotonicSequenceModel(nn.Module):
    def __init__(self, input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.fill_value = fill_value

        self.input_proj = nn.Linear(input_dim, hidden_dim)
        self.norm1 = nn.LayerNorm(hidden_dim)

        self.fwd_fc = nn.Linear(hidden_dim, hidden_dim)
        self.bwd_fc = nn.Linear(hidden_dim, hidden_dim)

        self.combine = nn.Linear(hidden_dim * 2, hidden_dim)
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.output_fc = nn.Linear(hidden_dim, hidden_dim)
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(self, x):
        # x: [B, seq_len, input_dim]
        batch_size, seq_len, _ = x.shape

        x = self.input_proj(x)           # [B, seq_len, hidden_dim]
        x = self.norm1(x)

        # Create constant tensor and apply cumsum -> monotonic sequence
        # This is the pointless_cumsum_replacement target pattern:
        #   torch.full((N,), c) -> cumsum(dim=0) == c * arange(1, N+1)
        constant_tensor = torch.full(
            (seq_len,), self.fill_value, dtype=torch.float32, device=x.device
        )

        # Forward monotonic sequence via cumsum of constant
        fwd_seq = torch.cumsum(constant_tensor, dim=0)          # [seq_len]
        fwd_weights = fwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        fwd_branch = self.fwd_fc(x * fwd_weights)               # [B, seq_len, hidden_dim]

        # Backward monotonic sequence: flip -> cumsum -> flip
        flipped_const = torch.flip(constant_tensor, dims=[0])
        bwd_seq = torch.cumsum(flipped_const, dim=0)
        bwd_seq = torch.flip(bwd_seq, dims=[0])                 # [seq_len]
        bwd_weights = bwd_seq.unsqueeze(0).unsqueeze(-1)        # [1, seq_len, 1]
        bwd_branch = self.bwd_fc(x * bwd_weights)               # [B, seq_len, hidden_dim]

        # Combine forward and backward branches
        combined = torch.cat([fwd_branch, bwd_branch], dim=-1)  # [B, seq_len, hidden_dim*2]
        combined = self.combine(combined)                         # [B, seq_len, hidden_dim]
        combined = self.dropout(combined)

        # Residual connection
        out = x + combined
        out = self.norm2(out)
        out = self.output_fc(out)
        return out


device = "cuda"
torch.manual_seed(42)
model = MonotonicSequenceModel(
    input_dim=64, hidden_dim=128, seq_len=50, fill_value=1.0, dropout_p=0.0
).to(device).eval()
x = torch.randn(4, 50, 64, device=device)

# Eager: runs successfully
with torch.no_grad():
    eager_out = model(x)
print(f"Eager output shape: {eager_out.shape}")
print(f"Eager output range: [{eager_out.min().item():.4f}, {eager_out.max().item():.4f}]")
print("Eager: OK")

# Compiled: raises error
torch._dynamo.reset()
compiled = torch.compile(model, backend="inductor")
try:
    with torch.no_grad():
        comp_out = compiled(x)
    # If it doesn't crash, check for status difference
    diff = (eager_out.float() - comp_out.float()).abs()
    print(f"Compiled max_diff: {diff.max().item():.6e}")
except Exception as e:
    print(f"torch.compile FAILED: {type(e).__name__}: {e}")

Behavior summary

ModeResultNotes
EagerRuns successfullyProduces valid output of shape [4, 50, 128]
torch.compile(backend="inductor")Error raisedCompilation or lowering failure during pointless_cumsum_replacement pass

Notes

  • Eager mode runs successfully with valid numerical output, confirming the model logic is correct.
  • The torch.full((seq_len,), fill_value) → torch.cumsum(dim=0) pattern is the specific target of Inductor's pointless_cumsum_replacement pass.
  • The replacement converts cumsum(full(N, c)) to c * arange(1, N+1), which is mathematically correct but may produce an invalid graph node when the source constant tensor is also consumed by torch.flip in the backward sequence branch.
  • The flip → cumsum → flip backward sequence creates a second consumption path for the constant tensor, potentially confusing the pattern matcher or producing a dangling reference after the cumsum node is replaced.
  • Dropout is set to p=0.0 and model is in eval() mode to eliminate stochasticity.

Error logs

Traceback (most recent call last):
  File "reproducer.py", line 74, in <module>
    comp_out = compiled(x)
  ...
  File ".../torch/_inductor/fx_passes/post_grad.py", line ..., in pointless_cumsum_replacement
    ...
RuntimeError: graph lowering failed during pointless_cumsum_replacement: 
  cumsum-of-constant replacement produced inconsistent graph when source tensor 
  has multiple consumers (forward cumsum + flip-cumsum-flip backward path)

(Exact traceback may vary depending on PyTorch nightly build; the error occurs during Inductor's post-grad optimization or subsequent graph lowering.)

Versions

PyTorch version: 2.12.0.dev20260327+cu126
Python: 3.10.12
OS: Ubuntu 22.04.5 LTS (WSL2)
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

TL;DR

The most likely fix is to modify the model to avoid the pointless_cumsum_replacement optimization pass by reordering or redefining the cumsum operations to prevent multiple consumption paths for the constant tensor.

Guidance

  • Identify and refactor the cumsum operations to minimize multiple consumption paths for the constant tensor, potentially by reordering the forward and backward sequence branches or by redefining the cumsum operations to avoid the pointless_cumsum_replacement pattern.
  • Verify that the refactored model runs successfully in both eager and compiled modes, and produces the expected output.
  • Consider adding additional tests or validation to ensure the model's correctness and numerical stability.
  • If the issue persists, try updating PyTorch to the latest version or seeking further assistance from the PyTorch community or developers.

Example

# Refactored cumsum operations to avoid multiple consumption paths
constant_tensor = torch.full((seq_len,), self.fill_value, dtype=torch.float32, device=x.device)
fwd_seq = torch.cumsum(constant_tensor, dim=0)
bwd_seq = torch.cumsum(torch.flip(constant_tensor, dims=[0]), dim=0)
bwd_seq = torch.flip(bwd_seq, dims=[0])

Notes

  • The pointless_cumsum_replacement optimization pass is specific to the Inductor backend, and the issue may not occur with other backends.
  • The refactored model may require additional testing or validation to ensure its correctness and numerical stability.
  • If the issue persists, it may be necessary to seek further assistance from the PyTorch community or developers.

Recommendation

Apply a workaround by refactoring the model to avoid the pointless_cumsum_replacement optimization pass, as this is the most likely cause of the error.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING