transformers - ✅(Solved) Fix [refactor] gpt-oss `eager_attention_forward` for modularity (Ex: models with dual eager attn: sink/no sink) [5 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45141Fetched 2026-04-08 01:57:47
View on GitHub
Comments
1
Participants
2
Timeline
12
Reactions
0
Author
Timeline (top)
mentioned ×3subscribed ×3cross-referenced ×2closed ×1

Error Message

I'm asking this because if I import both with aliases the modular converter triggers name error for the modeling file.

Root Cause

I'm asking this because if I import both with aliases the modular converter triggers name error for the modeling file.

Fix Action

Fix / Workaround

So that the full attention layers can benefit from SDPA/FLA2/flex backends without being entangled with incompatible sink + SWA layers. Besides speed, this allows me to do a nicer dispatching in the init and forward (+ see motivation below):

Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager

    # (which is compatible with SDPA/FA2/flex backends)
    ...
    if add_sink:
        self.sinks = nn.Parameter(torch.empty(num_attn_heads), requires_grad=False)
        self._eager_attention_forward = eager_attention_forward_with_sink
    else:
        self.sinks = None
        self._eager_attention_forward = eager_attention_forward

Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).

    # Non-sink layers can dispatch to any configured backend.
    if self.sinks is not None:
        attention_interface = self._eager_attention_forward
    else:
        attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
            self.config._attn_implementation, self._eager_attention_forward
        )

PR fix notes

PR #45142: refactor(gpt-oss): rename eager_attention_forward to eager_attention_forward_with_sink

Description (problem / solution / changelog)

What does this PR do?

<!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable -->

Following: This would resolve #45141

For modularity and also linked to this PR #45144. Potentially re-using mulitple eagers for hybrid models (eg MiMo non-sink/sink eager attn)

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc - kernels: @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Changed files

  • src/transformers/models/gpt_oss/modeling_gpt_oss.py (modified, +2/-2)
  • src/transformers/models/gpt_oss/modular_gpt_oss.py (modified, +2/-2)

PR #45144: Add Xiaomi MiMo-V2

Description (problem / solution / changelog)

What does this PR do?

Hello, this PR aims to add the MiMo-V2-Flash model to the Transformers library Fixes https://github.com/huggingface/transformers/issues/42954

MiMo-V2 is "The last of the OSS SOTAs" that isn't natively supported by the Transformers library, so I hope we can make this work.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc - kernels: @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Only pinging HF engineers who were involved in the PRs mentioned below:

Modular Sensei @vasqu , I took your comments from the other PR, to find the best candidates to inherit from. Although see https://github.com/huggingface/transformers/pull/45144#issuecomment-4162843437 for some important points.

@ArthurZucker for gpt-oss similarity and text model

For reference, there were already 2 previous PRs (1 closed and 1 non modular/abandoned):

 

Afaik, MiMo in the codebase is a bit novel as it's the only hybrid SWA that combines both dual theta RoPE and dual head dims + attn sinks (like gpt-oss). Most similar models are either one or the other, but not both.
So modularity wasn't easy mode pass, pass, pass.

I added some verbose NOTES for context/choices, and make the review easier. I'll obviously remove these once things are settled.

Additional useful info

(copy pasta from my own readme impl)

  • First layer is fully dense (GA+FFN, not MoE)
  • No shared experts in MoE
  • SWA and GA layers have different RoPE theta bases
  • SWA and GA layers have a different number of KV groups (GQA)
  • Values head dim is decoupled from QK head dim
  • Attention sink is only applied for SWA layers
  • Partial RoPE (rotating first 33% of the head dim)

Changed files

  • docs/source/en/_toctree.yml (modified, +2/-0)
  • docs/source/en/model_doc/mimo_v2_flash.md (added, +124/-0)
  • src/transformers/conversion_mapping.py (modified, +4/-0)
  • src/transformers/models/__init__.py (modified, +1/-0)
  • src/transformers/models/auto/configuration_auto.py (modified, +2/-0)
  • src/transformers/models/auto/modeling_auto.py (modified, +2/-0)
  • src/transformers/models/mimo_v2_flash/__init__.py (added, +27/-0)
  • src/transformers/models/mimo_v2_flash/configuration_mimo_v2_flash.py (added, +136/-0)
  • src/transformers/models/mimo_v2_flash/convert_mimo_v2_flash_weights_to_hf.py (added, +123/-0)
  • src/transformers/models/mimo_v2_flash/modeling_mimo_v2_flash.py (added, +695/-0)
  • src/transformers/models/mimo_v2_flash/modular_mimo_v2_flash.py (added, +491/-0)
  • tests/models/mimo_v2_flash/__init__.py (added, +0/-0)
  • tests/models/mimo_v2_flash/test_modeling_mimo_v2_flash.py (added, +435/-0)

PR #45327: [docs] modular transformers

Description (problem / solution / changelog)

refactors the how to add a model with modular transformers doc:

  • structure:

    • flipped the order so you learn how to write the modular file first before generating it
    • remove the motivator examples with BERT/RoBERTa
    • merge the two super() sections with context on when to use each
    • replace full code blocks with diffs so its easier to see whats actually changing
  • missing features:

    • @strict, __post_init__, __all__, AttributeError(), base_model_tp_plan, base_model_pp_plan
  • updates:

    • update the### Config section with new patterns like __post_init__, AttributeError() and class-level type annotations for new attributes
    • super_kwargs example references GemmaForCausalLM now
    • adds the modular-detector-v2 Space

let me know if there are other commonly missed modular practices!

todo:

  • add link to #45271 once merged on how to add vision processing components
  • add link to #45152 once merged on how to write model tests and required PR checks
  • add link to #45130 once merged on how to use @auto_docstring decorator

Changed files

  • docs/source/en/_toctree.yml (modified, +3/-3)
  • docs/source/en/modular_transformers.md (modified, +334/-405)

Code Example

# (in the init...)

        # Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager
        # (which is compatible with SDPA/FA2/flex backends)
        ...
        if add_sink:
            self.sinks = nn.Parameter(torch.empty(num_attn_heads), requires_grad=False)
            self._eager_attention_forward = eager_attention_forward_with_sink
        else:
            self.sinks = None
            self._eager_attention_forward = eager_attention_forward

      ...
     # (later in the forward...)

        # Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).
        # Non-sink layers can dispatch to any configured backend.
        if self.sinks is not None:
            attention_interface = self._eager_attention_forward
        else:
            attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
                self.config._attn_implementation, self._eager_attention_forward
            )

---

def eager_attention_forward(
--
module: nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: float,
dropout: float = 0.0,
sinks: Optional[torch.Tensor] = None,
):
key_states = repeat_kv(key, module.num_key_value_groups)
value_states = repeat_kv(value, module.num_key_value_groups)
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
if attention_mask is not None:
causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
 
if sinks is not None:
sinks = module.attention_sink_bias.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
attn_weights = torch.cat([attn_weights, sinks], dim=-1)
 
attn_weights = attn_weights - attn_weights.max(dim=-1, keepdim=True).values
probs = F.softmax(attn_weights, dim=-1, dtype=attn_weights.dtype)
 
if sinks is not None:
probs = probs[..., :-1]  # we drop the sink here
 
attn_weights = nn.functional.dropout(probs, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights
RAW_BUFFERClick to expand / collapse

Feature request

Hi,

I'm finalizing a PR for integrating MiMo-V2, but for my MiMoV2FlashAttention class I wanted to reuse, for modularity, both the classic (no sink) eager_attention_forward from LLama and the perma sink one from Arthur in gpt-oss.

So that the full attention layers can benefit from SDPA/FLA2/flex backends without being entangled with incompatible sink + SWA layers. Besides speed, this allows me to do a nicer dispatching in the init and forward (+ see motivation below):

~Pseudo code (a bit outdated):


       # (in the init...)

        # Dispatch attention: sink layers use GPT-OSS always-sink eager, non-sink layers use standard Llama eager
        # (which is compatible with SDPA/FA2/flex backends)
        ...
        if add_sink:
            self.sinks = nn.Parameter(torch.empty(num_attn_heads), requires_grad=False)
            self._eager_attention_forward = eager_attention_forward_with_sink
        else:
            self.sinks = None
            self._eager_attention_forward = eager_attention_forward

      ...
     # (later in the forward...)

        # Sink layers always use their custom eager attention (incompatible with SDPA/FA2/flex).
        # Non-sink layers can dispatch to any configured backend.
        if self.sinks is not None:
            attention_interface = self._eager_attention_forward
        else:
            attention_interface = ALL_ATTENTION_FUNCTIONS.get_interface(
                self.config._attn_implementation, self._eager_attention_forward
            )

Motivation

I'm asking this because if I import both with aliases the modular converter triggers name error for the modeling file.

Even so, let's say Vasqu prefers I try to inherit from GLM4, I think having the gpt-oss eager attention not having the same name as the classic (non sink) eager could be a good thing regardless for future reuse.

Tagging @ArthurZucker for the gpt-oss impl and Matt for his opinion @Rocketknight1, I think this is safe.

 

for reference the OG Mimo is a single class doing both conditionally forced eager (and if someone activates the backends, it will get caught at runtime by the attention_interface.

def eager_attention_forward(
--
module: nn.Module,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: Optional[torch.Tensor],
scaling: float,
dropout: float = 0.0,
sinks: Optional[torch.Tensor] = None,
):
key_states = repeat_kv(key, module.num_key_value_groups)
value_states = repeat_kv(value, module.num_key_value_groups)
attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
if attention_mask is not None:
causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
 
if sinks is not None:
sinks = module.attention_sink_bias.reshape(1, -1, 1, 1).expand(query.shape[0], -1, query.shape[-2], -1)
attn_weights = torch.cat([attn_weights, sinks], dim=-1)
 
attn_weights = attn_weights - attn_weights.max(dim=-1, keepdim=True).values
probs = F.softmax(attn_weights, dim=-1, dtype=attn_weights.dtype)
 
if sinks is not None:
probs = probs[..., :-1]  # we drop the sink here
 
attn_weights = nn.functional.dropout(probs, p=dropout, training=module.training)
attn_output = torch.matmul(attn_weights, value_states)
attn_output = attn_output.transpose(1, 2).contiguous()
return attn_output, attn_weights

Your contribution

#45142

extent analysis

TL;DR

To resolve the naming conflict, consider renaming the eager_attention_forward function from the gpt-oss implementation to a distinct name, such as eager_attention_forward_with_sink, to allow for modular reuse.

Guidance

  • Identify the naming conflict between the eager_attention_forward functions from LLama and gpt-oss, and consider renaming one of them to avoid the conflict.
  • Verify that the renamed function is correctly referenced in the MiMoV2FlashAttention class, particularly in the init and forward methods.
  • Ensure that the ALL_ATTENTION_FUNCTIONS dictionary is updated to include the renamed function, if necessary.
  • Test the modified code to ensure that the naming conflict is resolved and the attention layers function as expected.

Example

# Renamed function in gpt-oss implementation
def eager_attention_forward_with_sink(
    module: nn.Module,
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    attention_mask: Optional[torch.Tensor],
    scaling: float,
    dropout: float = 0.0,
    sinks: Optional[torch.Tensor] = None,
):
    # function implementation remains the same

Notes

The solution assumes that the naming conflict is the primary issue and that renaming one of the functions will resolve the conflict. However, additional modifications may be necessary to ensure that the attention layers function correctly with the renamed function.

Recommendation

Apply the workaround by renaming the eager_attention_forward function from the gpt-oss implementation to eager_attention_forward_with_sink, as this will allow for modular reuse and avoid the naming conflict.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING