pytorch - ✅(Solved) Fix `torch.compile` raises RuntimeError on valid `torch.addmm` with shape mismatch where eager succeeds [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178040Fetched 2026-04-08 01:07:35
View on GitHub
Comments
0
Participants
1
Timeline
55
Reactions
0
Author
Participants
Timeline (top)
subscribed ×25mentioned ×24labeled ×6

Error Message

RuntimeError: The size of tensor a (512) must match the size of tensor b (10000) at non-singleton dimension 1

Root Cause

In eager mode, torch.addmm(bias, input, weight, beta=0.0, alpha=0.1) with beta=0.0 algebraically means 0.0 * bias + 0.1 * (input @ weight), so the bias value and shape are irrelevant — it's purely 0.1 * (input @ weight). The eager CUDA kernel appears to skip the bias contribution entirely when beta=0.0, so no shape check is performed on the bias.

In torch.compile / Inductor, the lowering validates tensor shapes statically before executing, detecting the shape mismatch between bias (dim 512) and the matrix product output (dim 10000) regardless of beta value.

This is a valid compile regression: users can reasonably pass dummy bias tensors with beta=0.0 to effectively perform scaled matrix multiply.

PR fix notes

PR #180716: Fix torch.compile addmm with beta=0 and mismatched bias

Description (problem / solution / changelog)

Fixes #178040 Fixes torch.compile raising RuntimeError on torch.addmm calls where beta=0 and bias shape doesn't match output shape. When beta=0 the bias term is zeroed out so its shape is irrelevant. This now matches eager

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos

Changed files

  • test/inductor/test_torchinductor.py (modified, +21/-0)
  • torch/_inductor/decomposition.py (modified, +6/-0)

Code Example

import torch

d_model = 512
vocab_size = 10000
batch_size = 4

x = torch.randn(batch_size, d_model, device="cuda")
weight = torch.randn(vocab_size, d_model, device="cuda")
bias = torch.zeros(d_model, device="cuda")  # shape [512], not [10000]

# Eager: succeeds because beta=0.0 zeros out the bias term
try:
    out = torch.addmm(bias, x, weight.t(), beta=0.0, alpha=0.1)
    print(f"eager: OK shape={out.shape}")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: raises shape mismatch error
torch._dynamo.reset()

@torch.compile(fullgraph=True)
def compiled_addmm(bias, x, weight):
    return torch.addmm(bias, x, weight.t(), beta=0.0, alpha=0.1)

try:
    out = compiled_addmm(bias, x, weight)
    print(f"compile: OK shape={out.shape}")
except RuntimeError as e:
    print(f"compile: ERROR — {e}")

---

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerStyleModel(nn.Module):
    def __init__(self, d_model=512, nhead=8, num_layers=3, vocab_size=10000, max_seq_len=512):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1, max_seq_len, d_model) * 0.02)
        self.ffn_weight1 = nn.Parameter(torch.randn(d_model * 4, d_model) * 0.02)
        self.ffn_bias1 = nn.Parameter(torch.zeros(d_model * 4))
        self.ffn_weight2 = nn.Parameter(torch.randn(d_model, d_model * 4) * 0.02)
        self.ffn_bias2 = nn.Parameter(torch.zeros(d_model))
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.output_proj = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        batch_size, seq_len = x.shape
        x = self.embedding(x) * self.d_model ** 0.5
        x = x + self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x)
        x_flat = x.view(-1, self.d_model)
        ffn_intermediate = torch.addmm(self.ffn_bias1, x_flat, self.ffn_weight1.t(), beta=1.0, alpha=1.0)
        ffn_intermediate = F.gelu(ffn_intermediate)
        ffn_intermediate = ffn_intermediate * 0.5
        ffn_intermediate = self.dropout(ffn_intermediate)
        ffn_output = torch.addmm(self.ffn_bias2, ffn_intermediate, self.ffn_weight2.t(), beta=1.0, alpha=1.0)
        ffn_output = torch.tanh(ffn_output)
        ffn_output = ffn_output + x_flat
        ffn_output = self.ln1(ffn_output)
        ffn_output = ffn_output.view(batch_size, seq_len, self.d_model)
        x_flat2 = ffn_output.view(-1, self.d_model)
        # BUG: bias shape [512] mismatches output [10000], but beta=0.0 makes it work in eager
        logits = torch.addmm(torch.zeros(self.d_model, device=x.device), x_flat2,
                              self.output_proj.weight.t(), beta=0.0, alpha=0.1)
        logits = logits + self.output_proj.bias
        logits = torch.softmax(logits, dim=-1)
        logits = logits * 0.9 + 0.05
        logits = logits.view(batch_size, seq_len, -1)
        return logits

model = TransformerStyleModel().cuda().eval()
x = torch.randint(0, 10000, (2, 32), dtype=torch.long, device="cuda")

# Eager: succeeds
try:
    with torch.no_grad():
        out = model(x)
    print(f"eager: OK shape={out.shape}")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: fails
torch._dynamo.reset()
compiled_model = torch.compile(model, fullgraph=True)
try:
    with torch.no_grad():
        out = compiled_model(x)
    print(f"compile: OK shape={out.shape}")
except RuntimeError as e:
    print(f"compile: ERROR — {e}")

---

(no error — returns tensor of shape [batch*seq, vocab_size])

---

RuntimeError: The size of tensor a (512) must match the size of tensor b (10000)
at non-singleton dimension 1

---

PyTorch version: 2.12.0.dev20260315+cu126
OS: Ubuntu 22.04.5 LTS (x86_64)
Python version: 3.10.12
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile raises RuntimeError: The size of tensor a (512) must match the size of tensor b (10000) at non-singleton dimension 1 on a valid torch.addmm call that eager mode executes successfully. This is a compile regression — the model runs fine in eager mode but fails under compilation.

The root cause is a torch.addmm(bias, input, weight.t(), beta=0.0, alpha=0.1) call where the bias tensor has a wrong size (d_model=512 instead of vocab_size=10000), but in eager mode beta=0.0 causes the bias term to be zeroed out, making its shape irrelevant. torch.compile / Inductor appears to lower this in a way that validates the bias shape even when beta=0.0.

This was discovered via a fuzzer-generated Transformer model targeting the unfuse_bias_add_to_pointwise optimization pattern.

Minimal reproducer

import torch

d_model = 512
vocab_size = 10000
batch_size = 4

x = torch.randn(batch_size, d_model, device="cuda")
weight = torch.randn(vocab_size, d_model, device="cuda")
bias = torch.zeros(d_model, device="cuda")  # shape [512], not [10000]

# Eager: succeeds because beta=0.0 zeros out the bias term
try:
    out = torch.addmm(bias, x, weight.t(), beta=0.0, alpha=0.1)
    print(f"eager: OK shape={out.shape}")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: raises shape mismatch error
torch._dynamo.reset()

@torch.compile(fullgraph=True)
def compiled_addmm(bias, x, weight):
    return torch.addmm(bias, x, weight.t(), beta=0.0, alpha=0.1)

try:
    out = compiled_addmm(bias, x, weight)
    print(f"compile: OK shape={out.shape}")
except RuntimeError as e:
    print(f"compile: ERROR — {e}")

Full model-level reproducer (as found by fuzzer)

import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerStyleModel(nn.Module):
    def __init__(self, d_model=512, nhead=8, num_layers=3, vocab_size=10000, max_seq_len=512):
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1, max_seq_len, d_model) * 0.02)
        self.ffn_weight1 = nn.Parameter(torch.randn(d_model * 4, d_model) * 0.02)
        self.ffn_bias1 = nn.Parameter(torch.zeros(d_model * 4))
        self.ffn_weight2 = nn.Parameter(torch.randn(d_model, d_model * 4) * 0.02)
        self.ffn_bias2 = nn.Parameter(torch.zeros(d_model))
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.output_proj = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        batch_size, seq_len = x.shape
        x = self.embedding(x) * self.d_model ** 0.5
        x = x + self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x)
        x_flat = x.view(-1, self.d_model)
        ffn_intermediate = torch.addmm(self.ffn_bias1, x_flat, self.ffn_weight1.t(), beta=1.0, alpha=1.0)
        ffn_intermediate = F.gelu(ffn_intermediate)
        ffn_intermediate = ffn_intermediate * 0.5
        ffn_intermediate = self.dropout(ffn_intermediate)
        ffn_output = torch.addmm(self.ffn_bias2, ffn_intermediate, self.ffn_weight2.t(), beta=1.0, alpha=1.0)
        ffn_output = torch.tanh(ffn_output)
        ffn_output = ffn_output + x_flat
        ffn_output = self.ln1(ffn_output)
        ffn_output = ffn_output.view(batch_size, seq_len, self.d_model)
        x_flat2 = ffn_output.view(-1, self.d_model)
        # BUG: bias shape [512] mismatches output [10000], but beta=0.0 makes it work in eager
        logits = torch.addmm(torch.zeros(self.d_model, device=x.device), x_flat2,
                              self.output_proj.weight.t(), beta=0.0, alpha=0.1)
        logits = logits + self.output_proj.bias
        logits = torch.softmax(logits, dim=-1)
        logits = logits * 0.9 + 0.05
        logits = logits.view(batch_size, seq_len, -1)
        return logits

model = TransformerStyleModel().cuda().eval()
x = torch.randint(0, 10000, (2, 32), dtype=torch.long, device="cuda")

# Eager: succeeds
try:
    with torch.no_grad():
        out = model(x)
    print(f"eager: OK shape={out.shape}")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: fails
torch._dynamo.reset()
compiled_model = torch.compile(model, fullgraph=True)
try:
    with torch.no_grad():
        out = compiled_model(x)
    print(f"compile: OK shape={out.shape}")
except RuntimeError as e:
    print(f"compile: ERROR — {e}")

Behavior summary

OperationEagertorch.compileConsistent?
torch.addmm(bias_512, x, weight_10000x512.t(), beta=0.0)SucceedsRuntimeError (shape mismatch)No

Root cause analysis

In eager mode, torch.addmm(bias, input, weight, beta=0.0, alpha=0.1) with beta=0.0 algebraically means 0.0 * bias + 0.1 * (input @ weight), so the bias value and shape are irrelevant — it's purely 0.1 * (input @ weight). The eager CUDA kernel appears to skip the bias contribution entirely when beta=0.0, so no shape check is performed on the bias.

In torch.compile / Inductor, the lowering validates tensor shapes statically before executing, detecting the shape mismatch between bias (dim 512) and the matrix product output (dim 10000) regardless of beta value.

This is a valid compile regression: users can reasonably pass dummy bias tensors with beta=0.0 to effectively perform scaled matrix multiply.

Ablation

This bug was discovered in E6 (full system) round-3, where the advanced feedback and self-repair pipeline generated models targeting the unfuse_bias_add_to_pointwise optimization pattern.

Error logs

Eager mode (correct behavior — succeeds):

(no error — returns tensor of shape [batch*seq, vocab_size])

torch.compile (incorrect — should also succeed):

RuntimeError: The size of tensor a (512) must match the size of tensor b (10000)
at non-singleton dimension 1

Versions

PyTorch version: 2.12.0.dev20260315+cu126
OS: Ubuntu 22.04.5 LTS (x86_64)
Python version: 3.10.12
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

To fix the issue, we need to ensure that the bias tensor has the correct shape when using torch.addmm with beta=0.0 in compiled mode.

Here are the steps to fix the issue:

  • Update the bias tensor shape to match the output shape of the matrix product.
  • Alternatively, pass a dummy bias tensor with the correct shape when beta=0.0.

Code Changes

# Update the bias tensor shape
bias = torch.zeros(vocab_size, device="cuda")  # shape [10000]

# Alternatively, pass a dummy bias tensor with the correct shape when beta=0.0
logits = torch.addmm(torch.zeros(vocab_size, device=x.device), x_flat2, self.output_proj.weight.t(), beta=0.0, alpha=0.1)

Verification

To verify that the fix worked, run the compiled model with the updated bias tensor shape and check that it no longer raises a RuntimeError.

# Eager: succeeds
try:
    with torch.no_grad():
        out = model(x)
    print(f"eager: OK shape={out.shape}")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: should now succeed
torch._dynamo.reset()
compiled_model = torch.compile(model, fullgraph=True)
try:
    with torch.no_grad():
        out = compiled_model(x)
    print(f"compile: OK shape={out.shape}")
except RuntimeError as e:
    print(f"compile: ERROR — {e}")

Extra Tips

  • Always ensure that tensor shapes are correct when using torch.addmm, even when beta=0.0.
  • Use the torch.zeros function to create a dummy bias tensor with the correct shape when beta=0.0.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING