transformers - ✅(Solved) Fix [refactor] gpt-oss `eager_attention_forward` for modularity (Ex: models with dual eager attn: sink/no sink) [5 pull requests, 1 comments, 2 participants]

casinca · 2026-03-31T12:42:42Z

[transformers] PR 45142: refactor gpt-oss : rename eager attention forward to eager attention forward with sink - Repository: huggingface/transformers - Author… # PR #45142: refactor(gpt-oss): rename `eager_attention_forward` to `eager_attention_forward_with_sink` - Repository: huggingface/transformers - Author: casinca - State: closed | merged: False - Link: https://github.com/huggingface/transformers/pull/45142 ## Description (problem / solution / changelog) # What does this PR do?  Following: This would resolve #45141 For modularity and also linked to this PR #45144. Potentially re-using mulitple eagers for hybrid models (eg MiMo non-sink/sink eager attn) ## Code Agent Policy The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, **we ask that new users do not submit pure code agent PRs** at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment. PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously. This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read [`CONTRIBUTING.md`](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md). - [x] I confirm that this is not a pure code agent PR. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  ## Changed files - `src/transformers/models/gpt_oss/modeling_gpt_oss.py` (modified, +2/-2) - `src/transformers/models/gpt_oss/modular_gpt_oss.py` (modified, +2/-2) --- # PR #45144: Add Xiaomi MiMo-V2 - Repository: huggingface/transformers - Author: casinca - State: open | merged: False - Link: https://github.com/huggingface/transformers/pull/45144 ## Description (problem /

transformers2026-03-31 12:42:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45141•Fetched 2026-04-08 01:57:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

casinca

Participants

casinca

Rocketknight1

Timeline (top)

mentioned ×3subscribed ×3cross-referenced ×2closed ×1

Error Message

I'm asking this because if I import both with aliases the modular converter triggers name error for the modeling file.

Root Cause

I'm asking this because if I import both with aliases the modular converter triggers name error for the modeling file.

Fix Action

Fix / Workaround

So that the full attention layers can benefit from SDPA/FLA2/flex backends without being entangled with incompatible sink + SWA layers. Besides speed, this allows me to do a nicer dispatching in the init and forward (+ see motivation below):

Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager

    # (which is compatible with SDPA/FA2/flex backends)
    ...
    if add_sink:
        self.sinks = nn.Parameter(torch.empty(num_attn_heads), requires_grad=False)
        self._eager_attention_forward = eager_attention_forward_with_sink
    else:
        self.sinks = None
        self._eager_attention_forward = eager_attention_forward

Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).

    # Non-sink layers can dispatch to any configured backend.
    if self.sinks is not None:
        attention_interface = self._eager_attention_forward
    else:
        attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
            self.config._attn_implementation, self._eager_attention_forward
        )

Code Example

# (in the init...)

        # Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager
        # (which is compatible with SDPA/FA2/flex backends)
        ...
        if add_sink:
            self.sinks = nn.Parameter(torch.empty(num_attn_heads), requires_grad=False)
            self._eager_attention_forward = eager_attention_forward_with_sink
        else:
            self.sinks = None
            self._eager_attention_forward = eager_attention_forward

      ...
     # (later in the forward...)

        # Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).
        # Non-sink layers can dispatch to any configured backend.
        if self.sinks is not None:
            attention_interface = self._eager_attention_forward
        else:
            attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
                self.config._attn_implementation, self._eager_attention_forward
            )

---

def eager_attention_forward(
--
module: nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: float,
dropout: float = 0.0,
sinks: Optional[torch.Tensor] = None,
):
key_states = repeat_kv(key, module.num_key_value_groups)
value_states = repeat_kv(value, module.num_key_value_groups)
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
if attention_mask is not None:
causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
 
if sinks is not None:
sinks = module.attention_sink_bias.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
attn_weights = torch.cat([attn_weights, sinks], dim=-1)
 
attn_weights = attn_weights - attn_weights.max(dim=-1, keepdim=True).values
probs = F.softmax(attn_weights, dim=-1, dtype=attn_weights.dtype)
 
if sinks is not None:
probs = probs[..., :-1]  # we drop the sink here
 
attn_weights = nn.functional.dropout(probs, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights

RAW_BUFFERClick to expand / collapse

Feature request

Hi,

I'm finalizing a PR for integrating MiMo-V2, but for my MiMoV2FlashAttention class I wanted to reuse, for modularity, both the classic (no sink) eager_attention_forward from LLama and the perma sink one from Arthur in gpt-oss.

~Pseudo code (a bit outdated):


       # (in the init...)

        # Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager
        # (which is compatible with SDPA/FA2/flex backends)
        ...
        if add_sink:
            self.sinks = nn.Parameter(torch.empty(num_attn_heads), requires_grad=False)
            self._eager_attention_forward = eager_attention_forward_with_sink
        else:
            self.sinks = None
            self._eager_attention_forward = eager_attention_forward

      ...
     # (later in the forward...)

        # Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).
        # Non-sink layers can dispatch to any configured backend.
        if self.sinks is not None:
            attention_interface = self._eager_attention_forward
        else:
            attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
                self.config._attn_implementation, self._eager_attention_forward
            )

Motivation

I'm asking this because if I import both with aliases the modular converter triggers name error for the modeling file.

Even so, let's say Vasqu prefers I try to inherit from GLM4, I think having the gpt-oss eager attention not having the same name as the classic (non sink) eager could be a good thing regardless for future reuse.

Tagging @ArthurZucker for the gpt-oss impl and Matt for his opinion @Rocketknight1, I think this is safe.

for reference the OG Mimo is a single class doing both conditionally forced eager (and if someone activates the backends, it will get caught at runtime by the attention_interface.

def eager_attention_forward(
--
module: nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: float,
dropout: float = 0.0,
sinks: Optional[torch.Tensor] = None,
):
key_states = repeat_kv(key, module.num_key_value_groups)
value_states = repeat_kv(value, module.num_key_value_groups)
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
if attention_mask is not None:
causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
 
if sinks is not None:
sinks = module.attention_sink_bias.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
attn_weights = torch.cat([attn_weights, sinks], dim=-1)
 
attn_weights = attn_weights - attn_weights.max(dim=-1, keepdim=True).values
probs = F.softmax(attn_weights, dim=-1, dtype=attn_weights.dtype)
 
if sinks is not None:
probs = probs[..., :-1]  # we drop the sink here
 
attn_weights = nn.functional.dropout(probs, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights

Your contribution

#45142

extent analysis

TL;DR

To resolve the naming conflict, consider renaming the eager_attention_forward function from the gpt-oss implementation to a distinct name, such as eager_attention_forward_with_sink, to allow for modular reuse.

Guidance

Identify the naming conflict between the eager_attention_forward functions from LLama and gpt-oss, and consider renaming one of them to avoid the conflict.
Verify that the renamed function is correctly referenced in the MiMoV2FlashAttention class, particularly in the init and forward methods.
Ensure that the ALL_ATTENTION_FUNCTIONS dictionary is updated to include the renamed function, if necessary.
Test the modified code to ensure that the naming conflict is resolved and the attention layers function as expected.

Example

# Renamed function in gpt-oss implementation
def eager_attention_forward_with_sink(
    module: nn.Module,
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    attention_mask: Optional[torch.Tensor],
    scaling: float,
    dropout: float = 0.0,
    sinks: Optional[torch.Tensor] = None,
):
    # function implementation remains the same

Notes

The solution assumes that the naming conflict is the primary issue and that renaming one of the functions will resolve the conflict. However, additional modifications may be necessary to ensure that the attention layers function correctly with the renamed function.

Recommendation

Apply the workaround by renaming the eager_attention_forward function from the gpt-oss implementation to eager_attention_forward_with_sink, as this will allow for modular reuse and avoid the naming conflict.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [refactor] gpt-oss `eager_attention_forward` for modularity (Ex: models with dual eager attn: sink/no sink) [5 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager

Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).

PR fix notes

PR #45142: refactor(gpt-oss): rename eager_attention_forward to eager_attention_forward_with_sink

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Changed files

PR #45144: Add Xiaomi MiMo-V2

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Additional useful info

Changed files

PR #45327: [docs] modular transformers

Description (problem / solution / changelog)

Changed files

Code Example

Feature request

Motivation

Your contribution

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #45142: refactor(gpt-oss): rename `eager_attention_forward` to `eager_attention_forward_with_sink`