pytorch - 💡(How to fix) Fix [RFC] nn.BlockAttentionResidual: depth-wise softmax attention over layer outputs as a drop-in residual replacement [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177537Fetched 2026-04-08 00:47:33
View on GitHub
Comments
3
Participants
3
Timeline
37
Reactions
2
Timeline (top)
mentioned ×12subscribed ×12labeled ×8commented ×3

Standard residual connections in nn.TransformerEncoder accumulate all layer outputs with fixed unit weights: x = x + sublayer(x). This uniform aggregation causes hidden-state magnitudes to grow as O(L) with depth under PreNorm, progressively diluting each layer's relative contribution — a phenomenon documented as "PreNorm dilution" [Xiong et al., 2020; Li et al., 2026].

This RFC proposes adding nn.BlockAttentionResidual, a drop-in residual operator based on the Attention Residuals (AttnRes) paper published today by the Kimi Team (Moonshot AI): https://github.com/MoonshotAI/Attention-Residuals

The core idea: replace the fixed accumulation h_l = Σ v_i with learned, input-dependent softmax attention over preceding layer outputs: h_l = Σ α_{i→l} · v_i, where α_{i→l} are softmax weights computed from a single learned pseudo-query w_l ∈ ℝ^d per layer.


Error Message

shape checks, num_layers % num_blocks != 0 error)

Root Cause

Standard residual connections in nn.TransformerEncoder accumulate all layer outputs with fixed unit weights: x = x + sublayer(x). This uniform aggregation causes hidden-state magnitudes to grow as O(L) with depth under PreNorm, progressively diluting each layer's relative contribution — a phenomenon documented as "PreNorm dilution" [Xiong et al., 2020; Li et al., 2026].

This RFC proposes adding nn.BlockAttentionResidual, a drop-in residual operator based on the Attention Residuals (AttnRes) paper published today by the Kimi Team (Moonshot AI): https://github.com/MoonshotAI/Attention-Residuals

The core idea: replace the fixed accumulation h_l = Σ v_i with learned, input-dependent softmax attention over preceding layer outputs: h_l = Σ α_{i→l} · v_i, where α_{i→l} are softmax weights computed from a single learned pseudo-query w_l ∈ ℝ^d per layer.


Code Example

# Current behaviour — every layer in TransformerEncoder:
x = x + self._sa_block(self.norm1(x), ...)   # fixed weight: 1
x = x + self._ff_block(self.norm2(x))        # fixed weight: 1

---

import torch
import torch.nn as nn

encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
    num_layers=12,
)

residual = nn.BlockAttentionResidual(
    d_model=512,
    num_layers=12,
    num_blocks=8,     # N; paper finds N8 recovers most of Full AttnRes gain
)

# Usage: residual manages layer output history, encoder layers delegate to it
src = torch.randn(2, 64, 512)
out = residual(encoder, src)

---

encoder = nn.TransformerEncoder(
    layer,
    num_layers=12,
    residual="block_attention",   # "standard" (default) | "block_attention"
    attn_residual_num_blocks=8,
)

---

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List


class BlockAttentionResidual(nn.Module):
    """
    Block Attention Residuals (Block AttnRes) as described in:
        "Attention Residuals", Kimi Team, 2025.
        https://github.com/MoonshotAI/Attention-Residuals

    Replaces standard additive residual connections in a Transformer stack
    with learned, input-dependent softmax attention over block-level
    summaries of preceding layer outputs.

    Instead of:
        h_l = h_{l-1} + f_{l-1}(h_{l-1})          # uniform weight = 1

    Each layer computes:
        h_l = Σ_{i} α_{i→l} · v_i                  # softmax weights α

    where v_i are block-level summaries of earlier layer outputs and
    α_{i→l} = softmax( w_l^T · RMSNorm(v_i) ) for a learned pseudo-query w_l.

    Layers are grouped into N blocks of size S = num_layers / num_blocks.
    Within each block, layer outputs are summed into a single representation b_n.
    Cross-block attention is then applied over [b_0, b_1, ..., b_{n-1}] plus
    the evolving intra-block partial sum, reducing memory from O(Ld) to O(Nd).

    Args:
        d_model:    Hidden dimension of the transformer.
        num_layers: Total number of transformer sublayers (attn + ffn each count
                    as one sublayer, so a standard TransformerEncoderLayer with
                    one attn + one ffn = 2 sublayers).
        num_blocks: Number of blocks N. Paper finds N8 recovers most of the
                    gain of Full AttnRes across scales. Must divide num_layers.

    Complexity:
        Memory:     O(N · T · d) for block representations vs O(L · T · d) for
                    Full AttnRes (T = sequence length).
        Training:   Negligible overhead without pipeline parallelism; <4% with it.
        Inference:  <2% latency overhead on typical workloads.

    Notes:
        - All pseudo-query vectors w_l are zero-initialized, ensuring uniform
          attention weights at the start of training (= standard residual).
          This is critical for training stability (validated in the paper).
        - RMSNorm on keys prevents layers with large-magnitude outputs from
          dominating the softmax. Especially important for block-level
          representations that accumulate over multiple layers.
        - The token embedding is always included as source b_0, allowing
          any layer to attend back to the original input.
    """

    def __init__(self, d_model: int, num_layers: int, num_blocks: int = 8) -> None:
        super().__init__()

        if num_layers % num_blocks != 0:
            raise ValueError(
                f"num_layers ({num_layers}) must be divisible by num_blocks ({num_blocks})"
            )

        self.d_model = d_model
        self.num_layers = num_layers
        self.num_blocks = num_blocks
        self.block_size = num_layers // num_blocks  # S

        # One learned pseudo-query per sublayer, zero-initialized (critical).
        # Shape: (num_layers, d_model)
        self.pseudo_queries = nn.Parameter(torch.zeros(num_layers, d_model))

        # One RMSNorm per sublayer for key normalization.
        self.key_norms = nn.ModuleList([
            nn.RMSNorm(d_model) for _ in range(num_layers)
        ])

    def _attn_residual_op(
        self,
        layer_idx: int,
        block_reps: List[torch.Tensor],
        partial_block: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute the attention-weighted mixture of block representations
        for a single sublayer.

        Args:
            layer_idx:    Global sublayer index (0-based).
            block_reps:   List of completed block representations [b_0, ..., b_{n-1}].
                          b_0 is always the token embedding.
            partial_block: Intra-block partial sum b_n^i (current block so far).

        Returns:
            h_l: Input to sublayer `layer_idx`, shape (B, T, d_model).
        """
        # Stack all sources: completed blocks + current partial sum
        # V: (num_sources, B, T, d_model)
        sources = block_reps + [partial_block]
        V = torch.stack(sources, dim=0)

        # Normalize keys (prevents magnitude-dominant blocks from winning)
        norm = self.key_norms[layer_idx]
        K = norm(V)  # (num_sources, B, T, d_model)

        # Compute attention logits with the learned pseudo-query w_l
        # w_l: (d_model,)  →  logits: (num_sources, B, T)
        w_l = self.pseudo_queries[layer_idx]
        logits = torch.einsum("d, n b t d -> n b t", w_l, K)

        # Softmax over the source (depth) dimension
        weights = torch.softmax(logits, dim=0)  # (num_sources, B, T)

        # Weighted sum
        h = torch.einsum("n b t, n b t d -> b t d", weights, V)
        return h

    def forward(
        self,
        embedding: torch.Tensor,
        sublayers: nn.ModuleList,
        **sublayer_kwargs,
    ) -> torch.Tensor:
        """
        Run all sublayers with Block AttnRes residual connections.

        Args:
            embedding:       Token embeddings, shape (B, T, d_model).
                             Treated as b_0 — always available as a source.
            sublayers:       nn.ModuleList of sublayer callables (each takes a
                             tensor and returns a tensor of the same shape).
                             Typically the attention and FFN blocks of a
                             TransformerEncoderLayer, unrolled into a flat list.
            **sublayer_kwargs: Passed through to each sublayer (e.g., attn_mask).

        Returns:
            Output tensor of shape (B, T, d_model).
        """
        assert len(sublayers) == self.num_layers, (
            f"Expected {self.num_layers} sublayers, got {len(sublayers)}"
        )

        # b_0 is always the token embedding — allows any layer to attend back to input
        block_reps: List[torch.Tensor] = [embedding]
        partial_block: torch.Tensor = torch.zeros_like(embedding)

        for layer_idx, sublayer in enumerate(sublayers):
            # Check if we are at a block boundary (start of new block)
            pos_in_block = layer_idx % self.block_size

            if pos_in_block == 0 and layer_idx > 0:
                # Completed the previous block — save its representation and reset
                block_reps.append(partial_block)
                partial_block = torch.zeros_like(embedding)

            # Compute attention-weighted input for this sublayer
            h = self._attn_residual_op(layer_idx, block_reps, partial_block)

            # Apply the sublayer (e.g., self-attention or FFN)
            sublayer_out = sublayer(h, **sublayer_kwargs)

            # Accumulate into the current block's partial sum
            partial_block = partial_block + sublayer_out

        return partial_block

---

import torch
import torch.nn as nn

d_model, nhead, num_layers, num_blocks = 512, 8, 12, 6
batch_size, seq_len = 2, 64

# Build sublayers as a flat ModuleList (attn + ffn per TransformerEncoderLayer)
# In a real integration this would hook into TransformerEncoderLayer internals.
sublayers = nn.ModuleList([
    nn.Linear(d_model, d_model)   # placeholder; real layers are attn/ffn blocks
    for _ in range(num_layers)
])

attn_res = BlockAttentionResidual(d_model=d_model, num_layers=num_layers, num_blocks=num_blocks)

embedding = torch.randn(batch_size, seq_len, d_model)
output = attn_res(embedding, sublayers)
print(output.shape)  # torch.Size([2, 64, 512])
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

🚀 Feature Request / RFC

Summary

Standard residual connections in nn.TransformerEncoder accumulate all layer outputs with fixed unit weights: x = x + sublayer(x). This uniform aggregation causes hidden-state magnitudes to grow as O(L) with depth under PreNorm, progressively diluting each layer's relative contribution — a phenomenon documented as "PreNorm dilution" [Xiong et al., 2020; Li et al., 2026].

This RFC proposes adding nn.BlockAttentionResidual, a drop-in residual operator based on the Attention Residuals (AttnRes) paper published today by the Kimi Team (Moonshot AI): https://github.com/MoonshotAI/Attention-Residuals

The core idea: replace the fixed accumulation h_l = Σ v_i with learned, input-dependent softmax attention over preceding layer outputs: h_l = Σ α_{i→l} · v_i, where α_{i→l} are softmax weights computed from a single learned pseudo-query w_l ∈ ℝ^d per layer.


Motivation

The problem with current nn.TransformerEncoder:

# Current behaviour — every layer in TransformerEncoder:
x = x + self._sa_block(self.norm1(x), ...)   # fixed weight: 1
x = x + self._ff_block(self.norm2(x))        # fixed weight: 1

Each layer only sees its immediate predecessor h_{l-1}, a single compressed state that conflates all earlier outputs. Three concrete limitations follow:

  1. No selective access: attention vs. MLP layers receive the same aggregated state despite potentially benefiting from different depth-wise mixtures.
  2. Irreversible loss: information lost through accumulation cannot be selectively retrieved by deeper layers.
  3. Output magnitude growth: later layers are compelled to produce increasingly large outputs to remain influential over the growing residual sum, which can destabilize training.

Why this belongs in PyTorch core (not just a third-party library):

The efficiency of the scalable variant (Block AttnRes) requires non-trivial infrastructure: a two-phase inference schedule, online softmax merging, and (for distributed training) cross-stage caching that reduces pipeline communication from O(Ld) to O(Nd). These are hard to implement correctly and efficiently without access to PyTorch internals, and they are exactly the kind of primitive that benefits from a well-tested, maintained core implementation.

Evidence of effectiveness (from the paper, validated at scale):

  • Consistent validation loss improvement across all 5 model sizes tested
  • At 5.6 PFLOP/s-days compute budget: Block AttnRes matches baseline trained with 1.25× more compute
  • Evaluated on a 48B-parameter MoE model pretrained on 1.4T tokens; AttnRes improves over the baseline on all 15 downstream benchmarks (MMLU, GPQA, HumanEval, MATH, etc.)
  • Training overhead: < 4% under pipeline parallelism, negligible otherwise
  • Inference latency overhead: < 2%

Proposed API

The simplest composable interface — a standalone nn.Module that wraps any nn.TransformerEncoder:

import torch
import torch.nn as nn

encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
    num_layers=12,
)

residual = nn.BlockAttentionResidual(
    d_model=512,
    num_layers=12,
    num_blocks=8,     # N; paper finds N≈8 recovers most of Full AttnRes gain
)

# Usage: residual manages layer output history, encoder layers delegate to it
src = torch.randn(2, 64, 512)
out = residual(encoder, src)

Alternatively, as an opt-in parameter on nn.TransformerEncoder itself:

encoder = nn.TransformerEncoder(
    layer,
    num_layers=12,
    residual="block_attention",   # "standard" (default) | "block_attention"
    attn_residual_num_blocks=8,
)

I lean toward the standalone module approach as it avoids adding parameters to TransformerEncoder, consistent with the direction noted in the docs:

"Given the fast pace of innovation in transformer-like architectures, we recommend exploring building blocks in core..."


Reference Implementation

Below is a self-contained, pure-PyTorch implementation of BlockAttentionResidual faithful to the paper (Algorithm 1 / Figure 2). It is intentionally written for clarity over performance; a PR would add proper kernel fusion and the two-phase inference schedule.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import List


class BlockAttentionResidual(nn.Module):
    """
    Block Attention Residuals (Block AttnRes) as described in:
        "Attention Residuals", Kimi Team, 2025.
        https://github.com/MoonshotAI/Attention-Residuals

    Replaces standard additive residual connections in a Transformer stack
    with learned, input-dependent softmax attention over block-level
    summaries of preceding layer outputs.

    Instead of:
        h_l = h_{l-1} + f_{l-1}(h_{l-1})          # uniform weight = 1

    Each layer computes:
        h_l = Σ_{i} α_{i→l} · v_i                  # softmax weights α

    where v_i are block-level summaries of earlier layer outputs and
    α_{i→l} = softmax( w_l^T · RMSNorm(v_i) ) for a learned pseudo-query w_l.

    Layers are grouped into N blocks of size S = num_layers / num_blocks.
    Within each block, layer outputs are summed into a single representation b_n.
    Cross-block attention is then applied over [b_0, b_1, ..., b_{n-1}] plus
    the evolving intra-block partial sum, reducing memory from O(Ld) to O(Nd).

    Args:
        d_model:    Hidden dimension of the transformer.
        num_layers: Total number of transformer sublayers (attn + ffn each count
                    as one sublayer, so a standard TransformerEncoderLayer with
                    one attn + one ffn = 2 sublayers).
        num_blocks: Number of blocks N. Paper finds N≈8 recovers most of the
                    gain of Full AttnRes across scales. Must divide num_layers.

    Complexity:
        Memory:     O(N · T · d) for block representations vs O(L · T · d) for
                    Full AttnRes (T = sequence length).
        Training:   Negligible overhead without pipeline parallelism; <4% with it.
        Inference:  <2% latency overhead on typical workloads.

    Notes:
        - All pseudo-query vectors w_l are zero-initialized, ensuring uniform
          attention weights at the start of training (= standard residual).
          This is critical for training stability (validated in the paper).
        - RMSNorm on keys prevents layers with large-magnitude outputs from
          dominating the softmax. Especially important for block-level
          representations that accumulate over multiple layers.
        - The token embedding is always included as source b_0, allowing
          any layer to attend back to the original input.
    """

    def __init__(self, d_model: int, num_layers: int, num_blocks: int = 8) -> None:
        super().__init__()

        if num_layers % num_blocks != 0:
            raise ValueError(
                f"num_layers ({num_layers}) must be divisible by num_blocks ({num_blocks})"
            )

        self.d_model = d_model
        self.num_layers = num_layers
        self.num_blocks = num_blocks
        self.block_size = num_layers // num_blocks  # S

        # One learned pseudo-query per sublayer, zero-initialized (critical).
        # Shape: (num_layers, d_model)
        self.pseudo_queries = nn.Parameter(torch.zeros(num_layers, d_model))

        # One RMSNorm per sublayer for key normalization.
        self.key_norms = nn.ModuleList([
            nn.RMSNorm(d_model) for _ in range(num_layers)
        ])

    def _attn_residual_op(
        self,
        layer_idx: int,
        block_reps: List[torch.Tensor],
        partial_block: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute the attention-weighted mixture of block representations
        for a single sublayer.

        Args:
            layer_idx:    Global sublayer index (0-based).
            block_reps:   List of completed block representations [b_0, ..., b_{n-1}].
                          b_0 is always the token embedding.
            partial_block: Intra-block partial sum b_n^i (current block so far).

        Returns:
            h_l: Input to sublayer `layer_idx`, shape (B, T, d_model).
        """
        # Stack all sources: completed blocks + current partial sum
        # V: (num_sources, B, T, d_model)
        sources = block_reps + [partial_block]
        V = torch.stack(sources, dim=0)

        # Normalize keys (prevents magnitude-dominant blocks from winning)
        norm = self.key_norms[layer_idx]
        K = norm(V)  # (num_sources, B, T, d_model)

        # Compute attention logits with the learned pseudo-query w_l
        # w_l: (d_model,)  →  logits: (num_sources, B, T)
        w_l = self.pseudo_queries[layer_idx]
        logits = torch.einsum("d, n b t d -> n b t", w_l, K)

        # Softmax over the source (depth) dimension
        weights = torch.softmax(logits, dim=0)  # (num_sources, B, T)

        # Weighted sum
        h = torch.einsum("n b t, n b t d -> b t d", weights, V)
        return h

    def forward(
        self,
        embedding: torch.Tensor,
        sublayers: nn.ModuleList,
        **sublayer_kwargs,
    ) -> torch.Tensor:
        """
        Run all sublayers with Block AttnRes residual connections.

        Args:
            embedding:       Token embeddings, shape (B, T, d_model).
                             Treated as b_0 — always available as a source.
            sublayers:       nn.ModuleList of sublayer callables (each takes a
                             tensor and returns a tensor of the same shape).
                             Typically the attention and FFN blocks of a
                             TransformerEncoderLayer, unrolled into a flat list.
            **sublayer_kwargs: Passed through to each sublayer (e.g., attn_mask).

        Returns:
            Output tensor of shape (B, T, d_model).
        """
        assert len(sublayers) == self.num_layers, (
            f"Expected {self.num_layers} sublayers, got {len(sublayers)}"
        )

        # b_0 is always the token embedding — allows any layer to attend back to input
        block_reps: List[torch.Tensor] = [embedding]
        partial_block: torch.Tensor = torch.zeros_like(embedding)

        for layer_idx, sublayer in enumerate(sublayers):
            # Check if we are at a block boundary (start of new block)
            pos_in_block = layer_idx % self.block_size

            if pos_in_block == 0 and layer_idx > 0:
                # Completed the previous block — save its representation and reset
                block_reps.append(partial_block)
                partial_block = torch.zeros_like(embedding)

            # Compute attention-weighted input for this sublayer
            h = self._attn_residual_op(layer_idx, block_reps, partial_block)

            # Apply the sublayer (e.g., self-attention or FFN)
            sublayer_out = sublayer(h, **sublayer_kwargs)

            # Accumulate into the current block's partial sum
            partial_block = partial_block + sublayer_out

        return partial_block

Minimal usage example (integrating with existing nn.TransformerEncoderLayer):

import torch
import torch.nn as nn

d_model, nhead, num_layers, num_blocks = 512, 8, 12, 6
batch_size, seq_len = 2, 64

# Build sublayers as a flat ModuleList (attn + ffn per TransformerEncoderLayer)
# In a real integration this would hook into TransformerEncoderLayer internals.
sublayers = nn.ModuleList([
    nn.Linear(d_model, d_model)   # placeholder; real layers are attn/ffn blocks
    for _ in range(num_layers)
])

attn_res = BlockAttentionResidual(d_model=d_model, num_layers=num_layers, num_blocks=num_blocks)

embedding = torch.randn(batch_size, seq_len, d_model)
output = attn_res(embedding, sublayers)
print(output.shape)  # torch.Size([2, 64, 512])

Scope of a potential PR

A first PR would include:

  • torch/nn/modules/attention_residual.pyBlockAttentionResidual module
  • torch/nn/modules/__init__.py — export
  • test/nn/test_attention_residual.py — unit tests (correctness, zero-init stability, shape checks, num_layers % num_blocks != 0 error)
  • docs/source/nn.rst — API documentation entry

Out of scope for v1 (follow-up PRs):

  • CUDA kernel for the two-phase inference schedule
  • Cross-stage caching for pipeline parallelism
  • Full AttnRes variant (O(Ld) memory, practical only without activation recomputation)

Alternatives considered

AlternativeWhy not
Keep in ecosystem onlyThe two-phase inference schedule and pipeline caching require PyTorch internals; a correct, performant implementation is non-trivial to build externally
Extend TransformerEncoder with a flagMore invasive API change; standalone module is composable with any architecture
Full AttnRes instead of BlockO(Ld) memory is prohibitive at scale; Block AttnRes with N≈8 recovers most of the gain (paper: loss gap shrinks to 0.001 at largest scale)
DenseFormer-style fixed scalar weightsAblations in the paper show input-independent mixing provides no gain over baseline (1.767 vs 1.766); input-dependent softmax is what drives improvement

References


I'm happy to work on this. Before writing a full PR, I'd like to confirm:

  1. Is this in scope for torch.nn core, or would maintainers prefer it lives in a separate package (e.g., torchvision / ecosystem)?
  2. Is the standalone-module API preferred over an opt-in flag on TransformerEncoder?
  3. Any concerns about the num_layers being a constructor argument rather than inferred at forward time?

Alternatives

No response

Additional context

No response

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

extent analysis

Fix Plan

To implement the proposed BlockAttentionResidual module in PyTorch, follow these steps:

  • Create a new Python file attention_residual.py in the torch/nn/modules directory.
  • Define the BlockAttentionResidual class with the specified __init__, _attn_residual_op, and forward methods.
  • Initialize the pseudo_queries and key_norms attributes in the __init__ method.
  • Implement the attention residual operation in the _attn_residual_op method.
  • Define the forward method to apply the attention residual connections to the input tensor.

Example code:

import torch
import torch.nn as nn
import torch.nn.functional as F

class BlockAttentionResidual(nn.Module):
    def __init__(self, d_model, num_layers, num_blocks):
        super().__init__()
        self.d_model = d_model
        self.num_layers = num_layers
        self.num_blocks = num_blocks
        self.block_size = num_layers // num_blocks
        self.pseudo_queries = nn.Parameter(torch.zeros(num_layers, d_model))
        self.key_norms = nn.ModuleList([nn.RMSNorm(d_model) for _ in range(num_layers)])

    def _attn_residual_op(self, layer_idx, block_reps, partial_block):
        # Compute attention logits and weights
        sources = block_reps + [partial_block]
        V = torch.stack(sources, dim=0)
        K = self.key_norms[layer_idx](V)
        w_l = self.pseudo_queries[layer_idx]
        logits = torch.einsum("d, n b t d -> n b t", w_l, K)
        weights = torch.softmax(logits, dim=0)
        # Compute attention-weighted sum
        h = torch.einsum("n b t, n b t d -> b t d", weights, V)
        return h

    def forward(self, embedding, sublayers, **sublayer_kwargs):
        block_reps = [embedding]
        partial_block = torch.zeros_like(embedding)
        for layer_idx, sublayer in enumerate(sublayers):
            pos_in_block = layer_idx % self.block_size
            if pos_in_block == 0 and layer_idx > 0:
                block_reps.append(partial_block)
                partial_block = torch.zeros_like(embedding)
            h = self._attn_residual_op(layer_idx, block_reps, partial_block)
            sublayer_out = sublayer(h, **sublayer_kwargs)
            partial_block = partial_block + sublayer_out
        return partial_block

Verification

To verify the implementation, create a test case that applies the BlockAttentionResidual module to a sample input tensor and checks the output shape and values.

Example test code:

import torch
import unittest

class TestBlockAttentionResidual(unittest

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING