pytorch - 💡(How to fix) Fix [RFC] nn.BlockAttentionResidual: depth-wise softmax attention over layer outputs as a drop-in residual replacement [3 comments, 3 participants]

pytorch2026-03-16 15:54:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177537•Fetched 2026-04-08 00:47:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×12subscribed ×12labeled ×8commented ×3

Standard residual connections in nn.TransformerEncoder accumulate all layer outputs with fixed unit weights: x = x + sublayer(x). This uniform aggregation causes hidden-state magnitudes to grow as O(L) with depth under PreNorm, progressively diluting each layer's relative contribution — a phenomenon documented as "PreNorm dilution" [Xiong et al., 2020; Li et al., 2026].

This RFC proposes adding nn.BlockAttentionResidual, a drop-in residual operator based on the Attention Residuals (AttnRes) paper published today by the Kimi Team (Moonshot AI): https://github.com/MoonshotAI/Attention-Residuals

The core idea: replace the fixed accumulation h_l = Σ v_i with learned, input-dependent softmax attention over preceding layer outputs: h_l = Σ α_{i→l} · v_i, where α_{i→l} are softmax weights computed from a single learned pseudo-query w_l ∈ ℝ^d per layer.

Error Message

shape checks, num_layers % num_blocks != 0 error)

Root Cause

Code Example

# Current behaviour — every layer in TransformerEncoder:
x = x + self._sa_block(self.norm1(x), ...)   # fixed weight: 1
x = x + self._ff_block(self.norm2(x))        # fixed weight: 1

---

import torch
import torch.nn as nn

encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
    num_layers=12,
)

residual = nn.BlockAttentionResidual(
    d_model=512,
    num_layers=12,
    num_blocks=8,     # N; paper finds N≈8 recovers most of Full AttnRes gain
)

# Usage: residual manages layer output history, encoder layers delegate to it
src = torch.randn(2, 64, 512)
out = residual(encoder, src)

---

encoder = nn.TransformerEncoder(
    layer,
    num_layers=12,
    residual="block_attention",   # "standard" (default) | "block_attention"
    attn_residual_num_blocks=8,
)

---

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List


class BlockAttentionResidual(nn.Module):
    """
    Block Attention Residuals (Block AttnRes) as described in:
        "Attention Residuals", Kimi Team, 2025.
        https://github.com/MoonshotAI/Attention-Residuals

    Replaces standard additive residual connections in a Transformer stack
    with learned, input-dependent softmax attention over block-level
    summaries of preceding layer outputs.

    Instead of:
        h_l = h_{l-1} + f_{l-1}(h_{l-1})          # uniform weight = 1

    Each layer computes:
        h_l = Σ_{i} α_{i→l} · v_i                  # softmax weights α

    where v_i are block-level summaries of earlier layer outputs and
    α_{i→l} = softmax( w_l^T · RMSNorm(v_i) ) for a learned pseudo-query w_l.

    Layers are grouped into N blocks of size S = num_layers / num_blocks.
    Within each block, layer outputs are summed into a single representation b_n.
    Cross-block attention is then applied over [b_0, b_1, ..., b_{n-1}] plus
    the evolving intra-block partial sum, reducing memory from O(Ld) to O(Nd).

    Args:
        d_model:    Hidden dimension of the transformer.
        num_layers: Total number of transformer sublayers (attn + ffn each count
                    as one sublayer, so a standard TransformerEncoderLayer with
                    one attn + one ffn = 2 sublayers).
        num_blocks: Number of blocks N. Paper finds N≈8 recovers most of the
                    gain of Full AttnRes across scales. Must divide num_layers.

    Complexity:
        Memory:     O(N · T · d) for block representations vs O(L · T · d) for
                    Full AttnRes (T = sequence length).
        Training:   Negligible overhead without pipeline parallelism; <4% with it.
        Inference:  <2% latency overhead on typical workloads.

    Notes:
        - All pseudo-query vectors w_l are zero-initialized, ensuring uniform
          attention weights at the start of training (= standard residual).
          This is critical for training stability (validated in the paper).
        - RMSNorm on keys prevents layers with large-magnitude outputs from
          dominating the softmax. Especially important for block-level
          representations that accumulate over multiple layers.
        - The token embedding is always included as source b_0, allowing
          any layer to attend back to the original input.
    """

    def __init__(self, d_model: int, num_layers: int, num_blocks: int = 8) -> None:
        super().__init__()

        if num_layers % num_blocks != 0:
            raise ValueError(
                f"num_layers ({num_layers}) must be divisible by num_blocks ({num_blocks})"
            )

        self.d_model = d_model
        self.num_layers = num_layers
        self.num_blocks = num_blocks
        self.block_size = num_layers // num_blocks  # S

        # One learned pseudo-query per sublayer, zero-initialized (critical).
        # Shape: (num_layers, d_model)
        self.pseudo_queries = nn.Parameter(torch.zeros(num_layers, d_model))

        # One RMSNorm per sublayer for key normalization.
        self.key_norms = nn.ModuleList([
            nn.RMSNorm(d_model) for _ in range(num_layers)
        ])

    def _attn_residual_op(
        self,
        layer_idx: int,
        block_reps: List[torch.Tensor],
        partial_block: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute the attention-weighted mixture of block representations
        for a single sublayer.

        Args:
            layer_idx:    Global sublayer index (0-based).
            block_reps:   List of completed block representations [b_0, ..., b_{n-1}].
                          b_0 is always the token embedding.
            partial_block: Intra-block partial sum b_n^i (current block so far).

        Returns:
            h_l: Input to sublayer `layer_idx`, shape (B, T, d_model).
        """
        # Stack all sources: completed blocks + current partial sum
        # V: (num_sources, B, T, d_model)
        sources = block_reps + [partial_block]
        V = torch.stack(sources, dim=0)

        # Normalize keys (prevents magnitude-dominant blocks from winning)
        norm = self.key_norms[layer_idx]
        K = norm(V)  # (num_sources, B, T, d_model)

        # Compute attention logits with the learned pseudo-query w_l
        # w_l: (d_model,)  →  logits: (num_sources, B, T)
        w_l = self.pseudo_queries[layer_idx]
        logits = torch.einsum("d, n b t d -> n b t", w_l, K)

        # Softmax over the source (depth) dimension
        weights = torch.softmax(logits, dim=0)  # (num_sources, B, T)

        # Weighted sum
        h = torch.einsum("n b t, n b t d -> b t d", weights, V)
        return h

    def forward(
        self,
        embedding: torch.Tensor,
        sublayers: nn.ModuleList,
        **sublayer_kwargs,
    ) -> torch.Tensor:
        """
        Run all sublayers with Block AttnRes residual connections.

        Args:
            embedding:       Token embeddings, shape (B, T, d_model).
                             Treated as b_0 — always available as a source.
            sublayers:       nn.ModuleList of sublayer callables (each takes a
                             tensor and returns a tensor of the same shape).
                             Typically the attention and FFN blocks of a
                             TransformerEncoderLayer, unrolled into a flat list.
            **sublayer_kwargs: Passed through to each sublayer (e.g., attn_mask).

        Returns:
            Output tensor of shape (B, T, d_model).
        """
        assert len(sublayers) == self.num_layers, (
            f"Expected {self.num_layers} sublayers, got {len(sublayers)}"
        )

        # b_0 is always the token embedding — allows any layer to attend back to input
        block_reps: List[torch.Tensor] = [embedding]
        partial_block: torch.Tensor = torch.zeros_like(embedding)

        for layer_idx, sublayer in enumerate(sublayers):
            # Check if we are at a block boundary (start of new block)
            pos_in_block = layer_idx % self.block_size

            if pos_in_block == 0 and layer_idx > 0:
                # Completed the previous block — save its representation and reset
                block_reps.append(partial_block)
                partial_block = torch.zeros_like(embedding)

            # Compute attention-weighted input for this sublayer
            h = self._attn_residual_op(layer_idx, block_reps, partial_block)

            # Apply the sublayer (e.g., self-attention or FFN)
            sublayer_out = sublayer(h, **sublayer_kwargs)

            # Accumulate into the current block's partial sum
            partial_block = partial_block + sublayer_out

        return partial_block

---

import torch
import torch.nn as nn

d_model, nhead, num_layers, num_blocks = 512, 8, 12, 6
batch_size, seq_len = 2, 64

# Build sublayers as a flat ModuleList (attn + ffn per TransformerEncoderLayer)
# In a real integration this would hook into TransformerEncoderLayer internals.
sublayers = nn.ModuleList([
    nn.Linear(d_model, d_model)   # placeholder; real layers are attn/ffn blocks
    for _ in range(num_layers)
])

attn_res = BlockAttentionResidual(d_model=d_model, num_layers=num_layers, num_blocks=num_blocks)

embedding = torch.randn(batch_size, seq_len, d_model)
output = attn_res(embedding, sublayers)
print(output.shape)  # torch.Size([2, 64, 512])

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

🚀 Feature Request / RFC

Summary

Motivation

The problem with current nn.TransformerEncoder:

# Current behaviour — every layer in TransformerEncoder:
x = x + self._sa_block(self.norm1(x), ...)   # fixed weight: 1
x = x + self._ff_block(self.norm2(x))        # fixed weight: 1

Each layer only sees its immediate predecessor h_{l-1}, a single compressed state that conflates all earlier outputs. Three concrete limitations follow:

No selective access: attention vs. MLP layers receive the same aggregated state despite potentially benefiting from different depth-wise mixtures.
Irreversible loss: information lost through accumulation cannot be selectively retrieved by deeper layers.
Output magnitude growth: later layers are compelled to produce increasingly large outputs to remain influential over the growing residual sum, which can destabilize training.

Why this belongs in PyTorch core (not just a third-party library):

The efficiency of the scalable variant (Block AttnRes) requires non-trivial infrastructure: a two-phase inference schedule, online softmax merging, and (for distributed training) cross-stage caching that reduces pipeline communication from O(Ld) to O(Nd). These are hard to implement correctly and efficiently without access to PyTorch internals, and they are exactly the kind of primitive that benefits from a well-tested, maintained core implementation.

Evidence of effectiveness (from the paper, validated at scale):

Consistent validation loss improvement across all 5 model sizes tested
At 5.6 PFLOP/s-days compute budget: Block AttnRes matches baseline trained with 1.25× more compute
Evaluated on a 48B-parameter MoE model pretrained on 1.4T tokens; AttnRes improves over the baseline on all 15 downstream benchmarks (MMLU, GPQA, HumanEval, MATH, etc.)
Training overhead: < 4% under pipeline parallelism, negligible otherwise
Inference latency overhead: < 2%

Proposed API

The simplest composable interface — a standalone nn.Module that wraps any nn.TransformerEncoder:

import torch
import torch.nn as nn

encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
    num_layers=12,
)

residual = nn.BlockAttentionResidual(
    d_model=512,
    num_layers=12,
    num_blocks=8,     # N; paper finds N≈8 recovers most of Full AttnRes gain
)

# Usage: residual manages layer output history, encoder layers delegate to it
src = torch.randn(2, 64, 512)
out = residual(encoder, src)

Alternatively, as an opt-in parameter on nn.TransformerEncoder itself:

encoder = nn.TransformerEncoder(
    layer,
    num_layers=12,
    residual="block_attention",   # "standard" (default) | "block_attention"
    attn_residual_num_blocks=8,
)

I lean toward the standalone module approach as it avoids adding parameters to TransformerEncoder, consistent with the direction noted in the docs:

"Given the fast pace of innovation in transformer-like architectures, we recommend exploring building blocks in core..."

Reference Implementation

Below is a self-contained, pure-PyTorch implementation of BlockAttentionResidual faithful to the paper (Algorithm 1 / Figure 2). It is intentionally written for clarity over performance; a PR would add proper kernel fusion and the two-phase inference schedule.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List


class BlockAttentionResidual(nn.Module):
    """
    Block Attention Residuals (Block AttnRes) as described in:
        "Attention Residuals", Kimi Team, 2025.
        https://github.com/MoonshotAI/Attention-Residuals

    Replaces standard additive residual connections in a Transformer stack
    with learned, input-dependent softmax attention over block-level
    summaries of preceding layer outputs.

    Instead of:
        h_l = h_{l-1} + f_{l-1}(h_{l-1})          # uniform weight = 1

    Each layer computes:
        h_l = Σ_{i} α_{i→l} · v_i                  # softmax weights α

    where v_i are block-level summaries of earlier layer outputs and
    α_{i→l} = softmax( w_l^T · RMSNorm(v_i) ) for a learned pseudo-query w_l.

    Layers are grouped into N blocks of size S = num_layers / num_blocks.
    Within each block, layer outputs are summed into a single representation b_n.
    Cross-block attention is then applied over [b_0, b_1, ..., b_{n-1}] plus
    the evolving intra-block partial sum, reducing memory from O(Ld) to O(Nd).

    Args:
        d_model:    Hidden dimension of the transformer.
        num_layers: Total number of transformer sublayers (attn + ffn each count
                    as one sublayer, so a standard TransformerEncoderLayer with
                    one attn + one ffn = 2 sublayers).
        num_blocks: Number of blocks N. Paper finds N≈8 recovers most of the
                    gain of Full AttnRes across scales. Must divide num_layers.

    Complexity:
        Memory:     O(N · T · d) for block representations vs O(L · T · d) for
                    Full AttnRes (T = sequence length).
        Training:   Negligible overhead without pipeline parallelism; <4% with it.
        Inference:  <2% latency overhead on typical workloads.

    Notes:
        - All pseudo-query vectors w_l are zero-initialized, ensuring uniform
          attention weights at the start of training (= standard residual).
          This is critical for training stability (validated in the paper).
        - RMSNorm on keys prevents layers with large-magnitude outputs from
          dominating the softmax. Especially important for block-level
          representations that accumulate over multiple layers.
        - The token embedding is always included as source b_0, allowing
          any layer to attend back to the original input.
    """

    def __init__(self, d_model: int, num_layers: int, num_blocks: int = 8) -> None:
        super().__init__()

        if num_layers % num_blocks != 0:
            raise ValueError(
                f"num_layers ({num_layers}) must be divisible by num_blocks ({num_blocks})"
            )

        self.d_model = d_model
        self.num_layers = num_layers
        self.num_blocks = num_blocks
        self.block_size = num_layers // num_blocks  # S

        # One learned pseudo-query per sublayer, zero-initialized (critical).
        # Shape: (num_layers, d_model)
        self.pseudo_queries = nn.Parameter(torch.zeros(num_layers, d_model))

        # One RMSNorm per sublayer for key normalization.
        self.key_norms = nn.ModuleList([
            nn.RMSNorm(d_model) for _ in range(num_layers)
        ])

    def _attn_residual_op(
        self,
        layer_idx: int,
        block_reps: List[torch.Tensor],
        partial_block: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute the attention-weighted mixture of block representations
        for a single sublayer.

        Args:
            layer_idx:    Global sublayer index (0-based).
            block_reps:   List of completed block representations [b_0, ..., b_{n-1}].
                          b_0 is always the token embedding.
            partial_block: Intra-block partial sum b_n^i (current block so far).

        Returns:
            h_l: Input to sublayer `layer_idx`, shape (B, T, d_model).
        """
        # Stack all sources: completed blocks + current partial sum
        # V: (num_sources, B, T, d_model)
        sources = block_reps + [partial_block]
        V = torch.stack(sources, dim=0)

        # Normalize keys (prevents magnitude-dominant blocks from winning)
        norm = self.key_norms[layer_idx]
        K = norm(V)  # (num_sources, B, T, d_model)

        # Compute attention logits with the learned pseudo-query w_l
        # w_l: (d_model,)  →  logits: (num_sources, B, T)
        w_l = self.pseudo_queries[layer_idx]
        logits = torch.einsum("d, n b t d -> n b t", w_l, K)

        # Softmax over the source (depth) dimension
        weights = torch.softmax(logits, dim=0)  # (num_sources, B, T)

        # Weighted sum
        h = torch.einsum("n b t, n b t d -> b t d", weights, V)
        return h

    def forward(
        self,
        embedding: torch.Tensor,
        sublayers: nn.ModuleList,
        **sublayer_kwargs,
    ) -> torch.Tensor:
        """
        Run all sublayers with Block AttnRes residual connections.

        Args:
            embedding:       Token embeddings, shape (B, T, d_model).
                             Treated as b_0 — always available as a source.
            sublayers:       nn.ModuleList of sublayer callables (each takes a
                             tensor and returns a tensor of the same shape).
                             Typically the attention and FFN blocks of a
                             TransformerEncoderLayer, unrolled into a flat list.
            **sublayer_kwargs: Passed through to each sublayer (e.g., attn_mask).

        Returns:
            Output tensor of shape (B, T, d_model).
        """
        assert len(sublayers) == self.num_layers, (
            f"Expected {self.num_layers} sublayers, got {len(sublayers)}"
        )

        # b_0 is always the token embedding — allows any layer to attend back to input
        block_reps: List[torch.Tensor] = [embedding]
        partial_block: torch.Tensor = torch.zeros_like(embedding)

        for layer_idx, sublayer in enumerate(sublayers):
            # Check if we are at a block boundary (start of new block)
            pos_in_block = layer_idx % self.block_size

            if pos_in_block == 0 and layer_idx > 0:
                # Completed the previous block — save its representation and reset
                block_reps.append(partial_block)
                partial_block = torch.zeros_like(embedding)

            # Compute attention-weighted input for this sublayer
            h = self._attn_residual_op(layer_idx, block_reps, partial_block)

            # Apply the sublayer (e.g., self-attention or FFN)
            sublayer_out = sublayer(h, **sublayer_kwargs)

            # Accumulate into the current block's partial sum
            partial_block = partial_block + sublayer_out

        return partial_block

Minimal usage example (integrating with existing nn.TransformerEncoderLayer):

import torch
import torch.nn as nn

d_model, nhead, num_layers, num_blocks = 512, 8, 12, 6
batch_size, seq_len = 2, 64

# Build sublayers as a flat ModuleList (attn + ffn per TransformerEncoderLayer)
# In a real integration this would hook into TransformerEncoderLayer internals.
sublayers = nn.ModuleList([
    nn.Linear(d_model, d_model)   # placeholder; real layers are attn/ffn blocks
    for _ in range(num_layers)
])

attn_res = BlockAttentionResidual(d_model=d_model, num_layers=num_layers, num_blocks=num_blocks)

embedding = torch.randn(batch_size, seq_len, d_model)
output = attn_res(embedding, sublayers)
print(output.shape)  # torch.Size([2, 64, 512])

Scope of a potential PR

A first PR would include:

torch/nn/modules/attention_residual.py — BlockAttentionResidual module
torch/nn/modules/__init__.py — export
test/nn/test_attention_residual.py — unit tests (correctness, zero-init stability, shape checks, num_layers % num_blocks != 0 error)
docs/source/nn.rst — API documentation entry

Out of scope for v1 (follow-up PRs):

CUDA kernel for the two-phase inference schedule
Cross-stage caching for pipeline parallelism
Full AttnRes variant (O(Ld) memory, practical only without activation recomputation)

Alternatives considered

Alternative	Why not
Keep in ecosystem only	The two-phase inference schedule and pipeline caching require PyTorch internals; a correct, performant implementation is non-trivial to build externally
Extend `TransformerEncoder` with a flag	More invasive API change; standalone module is composable with any architecture
Full AttnRes instead of Block	O(Ld) memory is prohibitive at scale; Block AttnRes with N≈8 recovers most of the gain (paper: loss gap shrinks to 0.001 at largest scale)
DenseFormer-style fixed scalar weights	Ablations in the paper show input-independent mixing provides no gain over baseline (1.767 vs 1.766); input-dependent softmax is what drives improvement

References

Paper: Attention Residuals, Kimi Team, 2025 — https://github.com/MoonshotAI/Attention-Residuals
PreNorm dilution: Xiong et al., On Layer Normalization in the Transformer Architecture, 2020
PyTorch TransformerEncoderLayer docs note on fast innovation pace: https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html

I'm happy to work on this. Before writing a full PR, I'd like to confirm:

Is this in scope for torch.nn core, or would maintainers prefer it lives in a separate package (e.g., torchvision / ecosystem)?
Is the standalone-module API preferred over an opt-in flag on TransformerEncoder?
Any concerns about the num_layers being a constructor argument rather than inferred at forward time?

Alternatives

No response

Additional context

No response

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

extent analysis

Fix Plan

To implement the proposed BlockAttentionResidual module in PyTorch, follow these steps:

Create a new Python file attention_residual.py in the torch/nn/modules directory.
Define the BlockAttentionResidual class with the specified __init__, _attn_residual_op, and forward methods.
Initialize the pseudo_queries and key_norms attributes in the __init__ method.
Implement the attention residual operation in the _attn_residual_op method.
Define the forward method to apply the attention residual connections to the input tensor.

Example code:

import torch
import torch.nn as nn
import torch.nn.functional as F

class BlockAttentionResidual(nn.Module):
    def __init__(self, d_model, num_layers, num_blocks):
        super().__init__()
        self.d_model = d_model
        self.num_layers = num_layers
        self.num_blocks = num_blocks
        self.block_size = num_layers // num_blocks
        self.pseudo_queries = nn.Parameter(torch.zeros(num_layers, d_model))
        self.key_norms = nn.ModuleList([nn.RMSNorm(d_model) for _ in range(num_layers)])

    def _attn_residual_op(self, layer_idx, block_reps, partial_block):
        # Compute attention logits and weights
        sources = block_reps + [partial_block]
        V = torch.stack(sources, dim=0)
        K = self.key_norms[layer_idx](V)
        w_l = self.pseudo_queries[layer_idx]
        logits = torch.einsum("d, n b t d -> n b t", w_l, K)
        weights = torch.softmax(logits, dim=0)
        # Compute attention-weighted sum
        h = torch.einsum("n b t, n b t d -> b t d", weights, V)
        return h

    def forward(self, embedding, sublayers, **sublayer_kwargs):
        block_reps = [embedding]
        partial_block = torch.zeros_like(embedding)
        for layer_idx, sublayer in enumerate(sublayers):
            pos_in_block = layer_idx % self.block_size
            if pos_in_block == 0 and layer_idx > 0:
                block_reps.append(partial_block)
                partial_block = torch.zeros_like(embedding)
            h = self._attn_residual_op(layer_idx, block_reps, partial_block)
            sublayer_out = sublayer(h, **sublayer_kwargs)
            partial_block = partial_block + sublayer_out
        return partial_block

Verification

To verify the implementation, create a test case that applies the BlockAttentionResidual module to a sample input tensor and checks the output shape and values.

Example test code:

import torch
import unittest

class TestBlockAttentionResidual(unittest

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [RFC] nn.BlockAttentionResidual: depth-wise softmax attention over layer outputs as a drop-in residual replacement [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

🚀 The feature, motivation and pitch

🚀 Feature Request / RFC

Summary

Motivation

Proposed API

Reference Implementation

Scope of a potential PR

Alternatives considered

References

Alternatives

Additional context

extent analysis

Fix Plan

Verification

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [RFC] nn.BlockAttentionResidual: depth-wise softmax attention over layer outputs as a drop-in residual replacement [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

🚀 The feature, motivation and pitch

🚀 Feature Request / RFC

Summary

Motivation

Proposed API

Reference Implementation

Scope of a potential PR

Alternatives considered

References

Alternatives

Additional context

extent analysis

Fix Plan

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING