pytorch - ✅(Solved) Fix `torch.compile` silently succeeds on `TransformerEncoder` with all-masked `src_key_padding_mask` where eager raises RuntimeError [1 pull requests, 2 comments, 3 participants]

pytorch2026-03-28 05:47:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178677•Fetched 2026-04-08 01:40:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×30subscribed ×30labeled ×10commented ×2

Error Message

import torch import torch.nn as nn

class RandomPermutationTransformer(nn.Module): def init(self, vocab_size=10000, d_model=512, nhead=8, num_layers=6, slice_shape=8): super().init() self.slice_shape = slice_shape self.embedding = nn.Embedding(vocab_size, d_model) self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model)) encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, nhead=nhead, dim_feedforward=2048, dropout=0.1, batch_first=True ) self.transformer = nn.TransformerEncoder( encoder_layer, num_layers=num_layers ) self.classifier = nn.Sequential( nn.Linear(d_model, 256), nn.ReLU(), nn.Dropout(0.2), nn.Linear(256, 2) ) self.bn = nn.BatchNorm1d(d_model)

def forward(self, x, attention_mask=None):
    batch_size, seq_len = x.shape
    embedded = self.embedding(x)
    embedded = embedded + self.pos_encoding[:, :seq_len, :]
    if attention_mask is not None:
        transformer_out = self.transformer(
            embedded, src_key_padding_mask=attention_mask
        )
    else:
        transformer_out = self.transformer(embedded)
    pooled = transformer_out.mean(dim=1)
    pooled = self.bn(pooled)
    perm = torch.randperm(batch_size, device=pooled.device)
    selected_indices = perm[:self.slice_shape]
    selected_pooled = pooled[selected_indices]
    logits = self.classifier(selected_pooled)
    return logits

model = RandomPermutationTransformer( vocab_size=10000, d_model=512, nhead=8, num_layers=6, slice_shape=8 ).cuda() model.eval()

input_ids = torch.randint(0, 10000, (16, 128), dtype=torch.long).cuda()

All positions masked — edge case

attention_mask = torch.ones(16, 128, dtype=torch.bool).cuda()

Eager: fails

try: with torch.no_grad(): model(input_ids, attention_mask) print("eager: OK") except RuntimeError as e: print(f"eager: ERROR — {e}")

Compiled: succeeds (inconsistent)

torch._dynamo.reset() compiled_model = torch.compile(model) try: with torch.no_grad(): out = compiled_model(input_ids, attention_mask) print(f"compile: OK — shape={out.shape}") except Exception as e: print(f"compile: ERROR — {e}")

Root Cause

The TransformerEncoder in eager mode converts src_key_padding_mask to a nested tensor via torch._nested_tensor_from_mask. With all positions masked (all True), this creates nested tensors with zero elements, and the subsequent to_padded_tensor call fails because it requires at least one constituent tensor with non-zero numel.

The compiled path takes a different execution route — Dynamo traces through the transformer layers without going through the nested tensor conversion, sidestepping the check entirely.

This is an eager vs compiled consistency bug: both paths should either succeed or fail for the same input.

PR fix notes

PR #179627: Fix all masked Transformer eager/compile inconsistency

Repository: pytorch/pytorch
Author: dsashidh
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/179627

Description (problem / solution / changelog)

Fixes #178677 eager/compiled inconsistency when src_key_padding_mask is all True. Compiled mode skips nested tensor optimization (which breaks in torch compile see the 481-482 comment) so it bypasses the error check that eager hits. Adds upfront validation so both modes reject invalid input

Changed files

test/test_transformers.py (modified, +30/-0)
torch/nn/modules/transformer.py (modified, +7/-0)

Code Example

import torch
import torch.nn as nn

class RandomPermutationTransformer(nn.Module):
    def __init__(self, vocab_size=10000, d_model=512, nhead=8,
                 num_layers=6, slice_shape=8):
        super().__init__()
        self.slice_shape = slice_shape
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead,
            dim_feedforward=2048, dropout=0.1, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        self.classifier = nn.Sequential(
            nn.Linear(d_model, 256), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(256, 2)
        )
        self.bn = nn.BatchNorm1d(d_model)

    def forward(self, x, attention_mask=None):
        batch_size, seq_len = x.shape
        embedded = self.embedding(x)
        embedded = embedded + self.pos_encoding[:, :seq_len, :]
        if attention_mask is not None:
            transformer_out = self.transformer(
                embedded, src_key_padding_mask=attention_mask
            )
        else:
            transformer_out = self.transformer(embedded)
        pooled = transformer_out.mean(dim=1)
        pooled = self.bn(pooled)
        perm = torch.randperm(batch_size, device=pooled.device)
        selected_indices = perm[:self.slice_shape]
        selected_pooled = pooled[selected_indices]
        logits = self.classifier(selected_pooled)
        return logits


model = RandomPermutationTransformer(
    vocab_size=10000, d_model=512, nhead=8,
    num_layers=6, slice_shape=8
).cuda()
model.eval()

input_ids = torch.randint(0, 10000, (16, 128), dtype=torch.long).cuda()
# All positions masked — edge case
attention_mask = torch.ones(16, 128, dtype=torch.bool).cuda()

# Eager: fails
try:
    with torch.no_grad():
        model(input_ids, attention_mask)
    print("eager: OK")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: succeeds (inconsistent)
torch._dynamo.reset()
compiled_model = torch.compile(model)
try:
    with torch.no_grad():
        out = compiled_model(input_ids, attention_mask)
    print(f"compile: OK — shape={out.shape}")
except Exception as e:
    print(f"compile: ERROR — {e}")

---

UserWarning: The PyTorch API of nested tensors is in prototype stage...
RuntimeError: to_padded_tensor: at least one constituent tensor should have non-zero numel

---

compile: OK — shape=torch.Size([8, 2])

---

PyTorch version: 2.12.0.dev20260315+cu126
OS: Ubuntu 22.04.5 LTS (x86_64)
Python version: 3.10.12
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile silently succeeds when nn.TransformerEncoder receives a src_key_padding_mask with all True values (all positions masked), while eager mode raises RuntimeError: to_padded_tensor: at least one constituent tensor should have non-zero numel.

In PyTorch's convention, True in src_key_padding_mask means "this position is padding and should be masked." With an all-True mask, every position is padding — the eager path attempts to create nested tensors via torch._nested_tensor_from_mask, which fails because no constituent tensor has non-zero elements. The compiled path bypasses the nested tensor path entirely and computes a result from all-masked (NaN/zero) data.

Affected files

File	Source	Pattern
`randperm_index_pattern-5.py`	E8 (struct+route+thompson), round-4	`randperm_index_pattern`

Root cause

The compiled path takes a different execution route — Dynamo traces through the transformer layers without going through the nested tensor conversion, sidestepping the check entirely.

This is an eager vs compiled consistency bug: both paths should either succeed or fail for the same input.

Full model-level reproducer

import torch
import torch.nn as nn

class RandomPermutationTransformer(nn.Module):
    def __init__(self, vocab_size=10000, d_model=512, nhead=8,
                 num_layers=6, slice_shape=8):
        super().__init__()
        self.slice_shape = slice_shape
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead,
            dim_feedforward=2048, dropout=0.1, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        self.classifier = nn.Sequential(
            nn.Linear(d_model, 256), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(256, 2)
        )
        self.bn = nn.BatchNorm1d(d_model)

    def forward(self, x, attention_mask=None):
        batch_size, seq_len = x.shape
        embedded = self.embedding(x)
        embedded = embedded + self.pos_encoding[:, :seq_len, :]
        if attention_mask is not None:
            transformer_out = self.transformer(
                embedded, src_key_padding_mask=attention_mask
            )
        else:
            transformer_out = self.transformer(embedded)
        pooled = transformer_out.mean(dim=1)
        pooled = self.bn(pooled)
        perm = torch.randperm(batch_size, device=pooled.device)
        selected_indices = perm[:self.slice_shape]
        selected_pooled = pooled[selected_indices]
        logits = self.classifier(selected_pooled)
        return logits


model = RandomPermutationTransformer(
    vocab_size=10000, d_model=512, nhead=8,
    num_layers=6, slice_shape=8
).cuda()
model.eval()

input_ids = torch.randint(0, 10000, (16, 128), dtype=torch.long).cuda()
# All positions masked — edge case
attention_mask = torch.ones(16, 128, dtype=torch.bool).cuda()

# Eager: fails
try:
    with torch.no_grad():
        model(input_ids, attention_mask)
    print("eager: OK")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: succeeds (inconsistent)
torch._dynamo.reset()
compiled_model = torch.compile(model)
try:
    with torch.no_grad():
        out = compiled_model(input_ids, attention_mask)
    print(f"compile: OK — shape={out.shape}")
except Exception as e:
    print(f"compile: ERROR — {e}")

Behavior summary

Mode	Result	Output
Eager	RuntimeError	`to_padded_tensor: at least one constituent tensor should have non-zero numel`
`torch.compile`	Success (inconsistent)	`torch.Size([8, 2])` — numerical output from all-masked data

The bug requires TransformerEncoder with src_key_padding_mask=all_True. It is specific to the edge case where every position is masked. With normal masks (some True, some False), both eager and compiled paths succeed consistently.

Error logs

Eager mode (fails):

UserWarning: The PyTorch API of nested tensors is in prototype stage...
RuntimeError: to_padded_tensor: at least one constituent tensor should have non-zero numel

torch.compile (succeeds inconsistently):

compile: OK — shape=torch.Size([8, 2])

Versions

PyTorch version: 2.12.0.dev20260315+cu126
OS: Ubuntu 22.04.5 LTS (x86_64)
Python version: 3.10.12
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki @cpuhrsch @bhosmer @drisspg @soulitzer @davidberard98 @YuqingJ @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

Fix Plan

To fix the inconsistency between eager and compiled modes, we need to ensure that both paths handle the edge case where every position is masked. We can achieve this by adding a check before calling TransformerEncoder to verify if all positions are masked.

Step-by-Step Solution

Check for all-masked positions: Before calling TransformerEncoder, check if all positions in src_key_padding_mask are True.
Raise an error or handle the edge case: If all positions are masked, either raise a RuntimeError or handle this edge case by returning a specific value or tensor.

Example Code

def forward(self, x, attention_mask=None):
    batch_size, seq_len = x.shape
    embedded = self.embedding(x)
    embedded = embedded + self.pos_encoding[:, :seq_len, :]
    if attention_mask is not None:
        # Check if all positions are masked
        if attention_mask.all():
            raise RuntimeError("All positions are masked")
        transformer_out = self.transformer(
            embedded, src_key_padding_mask=attention_mask
        )
    else:
        transformer_out = self.transformer(embedded)
    # ... rest of the code

Verification

To verify that the fix worked, run the model in both eager and compiled modes with an all-True mask. Both modes should now either succeed or fail consistently.

Extra Tips

When working with masked positions in TransformerEncoder, always verify that not all positions are masked to avoid inconsistencies between eager and compiled modes.
Consider adding additional checks or handling for other edge cases that may arise during the execution of your model.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix `torch.compile` silently succeeds on `TransformerEncoder` with all-masked `src_key_padding_mask` where eager raises RuntimeError [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

All positions masked — edge case

Eager: fails

Compiled: succeeds (inconsistent)

Root Cause

PR fix notes

PR #179627: Fix all masked Transformer eager/compile inconsistency

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Affected files

Root cause

Full model-level reproducer

Behavior summary

Error logs

Versions

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix `torch.compile` silently succeeds on `TransformerEncoder` with all-masked `src_key_padding_mask` where eager raises RuntimeError [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

All positions masked — edge case

Eager: fails

Compiled: succeeds (inconsistent)

Root Cause

PR fix notes

PR #179627: Fix all masked Transformer eager/compile inconsistency

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Affected files

Root cause

Full model-level reproducer

Behavior summary

Error logs

Versions

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING