transformers - ✅(Solved) Fix Allow for "pure" linear attention based Qwen3.5 models [1 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45146Fetched 2026-04-08 01:57:45
View on GitHub
Comments
4
Participants
3
Timeline
11
Reactions
0
Timeline (top)
commented ×4mentioned ×2subscribed ×2closed ×1

Error Message

) # This would crash due to zero division error

Fix Action

Fixed

PR fix notes

PR #45148: Allow for all layers in Qwen3.5 architecture to be Gated Deltanet.

Description (problem / solution / changelog)

What does this PR do?

Fixes #45146

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Changed files

  • src/transformers/models/qwen3_5/configuration_qwen3_5.py (modified, +7/-4)
  • src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +6/-2)

Code Example

import torch

from transformers.models import Qwen3_5ForCausalLM, Qwen3_5TextConfig

config = Qwen3_5TextConfig(
    full_attention_interval=0 # This means only GDN layers
) # This would crash due to zero division error

model = Qwen3_5ForCausalLM(config).cuda()

input_ids = torch.tensor([[1, 2, 3, 4, 5]]).cuda()

model(input_ids) # This would crash due accessing transformer_layer mapping in cache.

---

# configuration_qwen3_5.py, line 108
interval_pattern = kwargs.pop("full_attention_interval", 4)
if interval_pattern <= 0: # Edge case to support full linear attention
    self.layer_types = ["linear_attention"] * self.num_hidden_layers
else:
    self.layer_types = [
        "linear_attention" if bool((i + 1) % interval_pattern) else "full_attention"
        for i in range(self.num_hidden_layers)
     ]

---

# modeling_qwen3_5.py, line 136
layer_idx = (self.transformer_layers[0] if len(self.transformer_layers) > 0 else 0) if layer_idx not in self.transformer_layers else layer_idx
RAW_BUFFERClick to expand / collapse

Feature request

This feature requests proposes to allow for the creation of "pure" linear attention Qwen3.5 models. Which means that every layers should be allowed to be a Gated Deltanet token mixer.

Following code should therefore be "allowed":

import torch

from transformers.models import Qwen3_5ForCausalLM, Qwen3_5TextConfig

config = Qwen3_5TextConfig(
    full_attention_interval=0 # This means only GDN layers
) # This would crash due to zero division error

model = Qwen3_5ForCausalLM(config).cuda()

input_ids = torch.tensor([[1, 2, 3, 4, 5]]).cuda()

model(input_ids) # This would crash due accessing transformer_layer mapping in cache.

Motivation

Qwen3.5 introduces a hybrid architecture with interleaved Gated Deltanet and Softmax Attentio layers. All Qwen published models contain some amount of softmax layers and is therefore implemented in that way.

With the first model implementation that supports a Gated Deltanet as token mixer (and possibly other types of linear attention backbones) the community might be interested in having the possibility to build other "pure" GDN models on top of the qwen3_5 architecture.

Your contribution

The changes to the code would be quite easy to implement.

  1. Adjusting the init for Qwen3_5TextConfig for the case that full_attention_interval is 0:
# configuration_qwen3_5.py, line 108
interval_pattern = kwargs.pop("full_attention_interval", 4)
if interval_pattern <= 0: # Edge case to support full linear attention
    self.layer_types = ["linear_attention"] * self.num_hidden_layers
else:
    self.layer_types = [
        "linear_attention" if bool((i + 1) % interval_pattern) else "full_attention"
        for i in range(self.num_hidden_layers)
     ]
  1. Adjust the logic for the cache to figure out the current sequence length:

OUTDATED:

# modeling_qwen3_5.py, line 136
layer_idx = (self.transformer_layers[0] if len(self.transformer_layers) > 0 else 0) if layer_idx not in self.transformer_layers else layer_idx

UPDATE: The current dev branch already is using a transformers wide cache implementation. In the PR (#45148) I changed it here accordingly.

extent analysis

TL;DR

To fix the issue, adjust the Qwen3_5TextConfig initialization to handle the case where full_attention_interval is 0 and update the cache logic to accommodate "pure" linear attention models.

Guidance

  • Update the Qwen3_5TextConfig initialization to set self.layer_types to ["linear_attention"] * self.num_hidden_layers when full_attention_interval is 0.
  • Modify the cache logic to correctly determine the current sequence length for "pure" linear attention models.
  • Review the changes made in PR #45148 to ensure they are compatible with the updated cache implementation.
  • Test the updated model with the provided example code to verify that it no longer crashes due to zero division error or accessing transformer layer mapping in cache.

Example

config = Qwen3_5TextConfig(
    full_attention_interval=0
)
model = Qwen3_5ForCausalLM(config).cuda()
input_ids = torch.tensor([[1, 2, 3, 4, 5]]).cuda()
model(input_ids)

Notes

The provided code changes are specific to the qwen3_5 architecture and may not be applicable to other models. Additionally, the updated cache implementation in PR #45148 should be reviewed to ensure compatibility with the changes.

Recommendation

Apply the workaround by updating the Qwen3_5TextConfig initialization and cache logic as described, as this will allow for the creation of "pure" linear attention Qwen3.5 models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING