pytorch - ✅(Solved) Fix `torch.compile` silently succeeds on `TransformerEncoder` with all-masked `src_key_padding_mask` where eager raises RuntimeError [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178677Fetched 2026-04-08 01:40:20
View on GitHub
Comments
2
Participants
3
Timeline
75
Reactions
0
Author
Timeline (top)
mentioned ×30subscribed ×30labeled ×10commented ×2

Error Message

import torch import torch.nn as nn

class RandomPermutationTransformer(nn.Module): def init(self, vocab_size=10000, d_model=512, nhead=8, num_layers=6, slice_shape=8): super().init() self.slice_shape = slice_shape self.embedding = nn.Embedding(vocab_size, d_model) self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model)) encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, nhead=nhead, dim_feedforward=2048, dropout=0.1, batch_first=True ) self.transformer = nn.TransformerEncoder( encoder_layer, num_layers=num_layers ) self.classifier = nn.Sequential( nn.Linear(d_model, 256), nn.ReLU(), nn.Dropout(0.2), nn.Linear(256, 2) ) self.bn = nn.BatchNorm1d(d_model)

def forward(self, x, attention_mask=None):
    batch_size, seq_len = x.shape
    embedded = self.embedding(x)
    embedded = embedded + self.pos_encoding[:, :seq_len, :]
    if attention_mask is not None:
        transformer_out = self.transformer(
            embedded, src_key_padding_mask=attention_mask
        )
    else:
        transformer_out = self.transformer(embedded)
    pooled = transformer_out.mean(dim=1)
    pooled = self.bn(pooled)
    perm = torch.randperm(batch_size, device=pooled.device)
    selected_indices = perm[:self.slice_shape]
    selected_pooled = pooled[selected_indices]
    logits = self.classifier(selected_pooled)
    return logits

model = RandomPermutationTransformer( vocab_size=10000, d_model=512, nhead=8, num_layers=6, slice_shape=8 ).cuda() model.eval()

input_ids = torch.randint(0, 10000, (16, 128), dtype=torch.long).cuda()

All positions masked — edge case

attention_mask = torch.ones(16, 128, dtype=torch.bool).cuda()

Eager: fails

try: with torch.no_grad(): model(input_ids, attention_mask) print("eager: OK") except RuntimeError as e: print(f"eager: ERROR — {e}")

Compiled: succeeds (inconsistent)

torch._dynamo.reset() compiled_model = torch.compile(model) try: with torch.no_grad(): out = compiled_model(input_ids, attention_mask) print(f"compile: OK — shape={out.shape}") except Exception as e: print(f"compile: ERROR — {e}")

Root Cause

The TransformerEncoder in eager mode converts src_key_padding_mask to a nested tensor via torch._nested_tensor_from_mask. With all positions masked (all True), this creates nested tensors with zero elements, and the subsequent to_padded_tensor call fails because it requires at least one constituent tensor with non-zero numel.

The compiled path takes a different execution route — Dynamo traces through the transformer layers without going through the nested tensor conversion, sidestepping the check entirely.

This is an eager vs compiled consistency bug: both paths should either succeed or fail for the same input.

PR fix notes

PR #179627: Fix all masked Transformer eager/compile inconsistency

Description (problem / solution / changelog)

Fixes #178677 eager/compiled inconsistency when src_key_padding_mask is all True. Compiled mode skips nested tensor optimization (which breaks in torch compile see the 481-482 comment) so it bypasses the error check that eager hits. Adds upfront validation so both modes reject invalid input

Changed files

  • test/test_transformers.py (modified, +30/-0)
  • torch/nn/modules/transformer.py (modified, +7/-0)

Code Example

import torch
import torch.nn as nn

class RandomPermutationTransformer(nn.Module):
    def __init__(self, vocab_size=10000, d_model=512, nhead=8,
                 num_layers=6, slice_shape=8):
        super().__init__()
        self.slice_shape = slice_shape
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead,
            dim_feedforward=2048, dropout=0.1, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        self.classifier = nn.Sequential(
            nn.Linear(d_model, 256), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(256, 2)
        )
        self.bn = nn.BatchNorm1d(d_model)

    def forward(self, x, attention_mask=None):
        batch_size, seq_len = x.shape
        embedded = self.embedding(x)
        embedded = embedded + self.pos_encoding[:, :seq_len, :]
        if attention_mask is not None:
            transformer_out = self.transformer(
                embedded, src_key_padding_mask=attention_mask
            )
        else:
            transformer_out = self.transformer(embedded)
        pooled = transformer_out.mean(dim=1)
        pooled = self.bn(pooled)
        perm = torch.randperm(batch_size, device=pooled.device)
        selected_indices = perm[:self.slice_shape]
        selected_pooled = pooled[selected_indices]
        logits = self.classifier(selected_pooled)
        return logits


model = RandomPermutationTransformer(
    vocab_size=10000, d_model=512, nhead=8,
    num_layers=6, slice_shape=8
).cuda()
model.eval()

input_ids = torch.randint(0, 10000, (16, 128), dtype=torch.long).cuda()
# All positions masked — edge case
attention_mask = torch.ones(16, 128, dtype=torch.bool).cuda()

# Eager: fails
try:
    with torch.no_grad():
        model(input_ids, attention_mask)
    print("eager: OK")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: succeeds (inconsistent)
torch._dynamo.reset()
compiled_model = torch.compile(model)
try:
    with torch.no_grad():
        out = compiled_model(input_ids, attention_mask)
    print(f"compile: OK — shape={out.shape}")
except Exception as e:
    print(f"compile: ERROR — {e}")

---

UserWarning: The PyTorch API of nested tensors is in prototype stage...
RuntimeError: to_padded_tensor: at least one constituent tensor should have non-zero numel

---

compile: OK — shape=torch.Size([8, 2])

---

PyTorch version: 2.12.0.dev20260315+cu126
OS: Ubuntu 22.04.5 LTS (x86_64)
Python version: 3.10.12
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile silently succeeds when nn.TransformerEncoder receives a src_key_padding_mask with all True values (all positions masked), while eager mode raises RuntimeError: to_padded_tensor: at least one constituent tensor should have non-zero numel.

In PyTorch's convention, True in src_key_padding_mask means "this position is padding and should be masked." With an all-True mask, every position is padding — the eager path attempts to create nested tensors via torch._nested_tensor_from_mask, which fails because no constituent tensor has non-zero elements. The compiled path bypasses the nested tensor path entirely and computes a result from all-masked (NaN/zero) data.

Affected files

FileSourcePattern
randperm_index_pattern-5.pyE8 (struct+route+thompson), round-4randperm_index_pattern

Root cause

The TransformerEncoder in eager mode converts src_key_padding_mask to a nested tensor via torch._nested_tensor_from_mask. With all positions masked (all True), this creates nested tensors with zero elements, and the subsequent to_padded_tensor call fails because it requires at least one constituent tensor with non-zero numel.

The compiled path takes a different execution route — Dynamo traces through the transformer layers without going through the nested tensor conversion, sidestepping the check entirely.

This is an eager vs compiled consistency bug: both paths should either succeed or fail for the same input.

Full model-level reproducer

import torch
import torch.nn as nn

class RandomPermutationTransformer(nn.Module):
    def __init__(self, vocab_size=10000, d_model=512, nhead=8,
                 num_layers=6, slice_shape=8):
        super().__init__()
        self.slice_shape = slice_shape
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model))
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead,
            dim_feedforward=2048, dropout=0.1, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(
            encoder_layer, num_layers=num_layers
        )
        self.classifier = nn.Sequential(
            nn.Linear(d_model, 256), nn.ReLU(),
            nn.Dropout(0.2), nn.Linear(256, 2)
        )
        self.bn = nn.BatchNorm1d(d_model)

    def forward(self, x, attention_mask=None):
        batch_size, seq_len = x.shape
        embedded = self.embedding(x)
        embedded = embedded + self.pos_encoding[:, :seq_len, :]
        if attention_mask is not None:
            transformer_out = self.transformer(
                embedded, src_key_padding_mask=attention_mask
            )
        else:
            transformer_out = self.transformer(embedded)
        pooled = transformer_out.mean(dim=1)
        pooled = self.bn(pooled)
        perm = torch.randperm(batch_size, device=pooled.device)
        selected_indices = perm[:self.slice_shape]
        selected_pooled = pooled[selected_indices]
        logits = self.classifier(selected_pooled)
        return logits


model = RandomPermutationTransformer(
    vocab_size=10000, d_model=512, nhead=8,
    num_layers=6, slice_shape=8
).cuda()
model.eval()

input_ids = torch.randint(0, 10000, (16, 128), dtype=torch.long).cuda()
# All positions masked — edge case
attention_mask = torch.ones(16, 128, dtype=torch.bool).cuda()

# Eager: fails
try:
    with torch.no_grad():
        model(input_ids, attention_mask)
    print("eager: OK")
except RuntimeError as e:
    print(f"eager: ERROR — {e}")

# Compiled: succeeds (inconsistent)
torch._dynamo.reset()
compiled_model = torch.compile(model)
try:
    with torch.no_grad():
        out = compiled_model(input_ids, attention_mask)
    print(f"compile: OK — shape={out.shape}")
except Exception as e:
    print(f"compile: ERROR — {e}")

Behavior summary

ModeResultOutput
EagerRuntimeErrorto_padded_tensor: at least one constituent tensor should have non-zero numel
torch.compileSuccess (inconsistent)torch.Size([8, 2]) — numerical output from all-masked data

The bug requires TransformerEncoder with src_key_padding_mask=all_True. It is specific to the edge case where every position is masked. With normal masks (some True, some False), both eager and compiled paths succeed consistently.

Error logs

Eager mode (fails):

UserWarning: The PyTorch API of nested tensors is in prototype stage...
RuntimeError: to_padded_tensor: at least one constituent tensor should have non-zero numel

torch.compile (succeeds inconsistently):

compile: OK — shape=torch.Size([8, 2])

Versions

PyTorch version: 2.12.0.dev20260315+cu126
OS: Ubuntu 22.04.5 LTS (x86_64)
Python version: 3.10.12
GPU: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA: 12.6

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki @cpuhrsch @bhosmer @drisspg @soulitzer @davidberard98 @YuqingJ @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @ezyang @msaroufim @bdhirsh @anijain2305

topic: fuzzer

extent analysis

Fix Plan

To fix the inconsistency between eager and compiled modes, we need to ensure that both paths handle the edge case where every position is masked. We can achieve this by adding a check before calling TransformerEncoder to verify if all positions are masked.

Step-by-Step Solution

  1. Check for all-masked positions: Before calling TransformerEncoder, check if all positions in src_key_padding_mask are True.
  2. Raise an error or handle the edge case: If all positions are masked, either raise a RuntimeError or handle this edge case by returning a specific value or tensor.

Example Code

def forward(self, x, attention_mask=None):
    batch_size, seq_len = x.shape
    embedded = self.embedding(x)
    embedded = embedded + self.pos_encoding[:, :seq_len, :]
    if attention_mask is not None:
        # Check if all positions are masked
        if attention_mask.all():
            raise RuntimeError("All positions are masked")
        transformer_out = self.transformer(
            embedded, src_key_padding_mask=attention_mask
        )
    else:
        transformer_out = self.transformer(embedded)
    # ... rest of the code

Verification

To verify that the fix worked, run the model in both eager and compiled modes with an all-True mask. Both modes should now either succeed or fail consistently.

Extra Tips

  • When working with masked positions in TransformerEncoder, always verify that not all positions are masked to avoid inconsistencies between eager and compiled modes.
  • Consider adding additional checks or handling for other edge cases that may arise during the execution of your model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix `torch.compile` silently succeeds on `TransformerEncoder` with all-masked `src_key_padding_mask` where eager raises RuntimeError [1 pull requests, 2 comments, 3 participants]