transformers - 💡(How to fix) Fix [Inquiry] What is the current status and roadmap for supporting packed sequences (packing) on Qwen3.5/hybrid linear attention models?

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

With the release of Qwen3.5, we see a powerful hybrid architecture that heavily relies on GatedDeltaNet linear attention (powered by flash-linear-attention / FLA) and causal convolutions, alongside traditional self-attention layers.

In standard LLM training (like Llama/Mistral), Packing (padding-free training via TRL or custom trainers) is a vital optimization to eliminate padding waste and boost throughput, especially during SFT and RLHF.

Root Cause

With the release of Qwen3.5, we see a powerful hybrid architecture that heavily relies on GatedDeltaNet linear attention (powered by flash-linear-attention / FLA) and causal convolutions, alongside traditional self-attention layers.

In standard LLM training (like Llama/Mistral), Packing (padding-free training via TRL or custom trainers) is a vital optimization to eliminate padding waste and boost throughput, especially during SFT and RLHF.

Fix Action

Fix / Workaround

Community Efforts: Are there any ongoing PRs, refactoring branches, or recommended best practices/workarounds that the community should follow if we want to leverage packing for Qwen3.5?

RAW_BUFFERClick to expand / collapse

Context

With the release of Qwen3.5, we see a powerful hybrid architecture that heavily relies on GatedDeltaNet linear attention (powered by flash-linear-attention / FLA) and causal convolutions, alongside traditional self-attention layers.

In standard LLM training (like Llama/Mistral), Packing (padding-free training via TRL or custom trainers) is a vital optimization to eliminate padding waste and boost throughput, especially during SFT and RLHF.

Our Question

We would like to learn about the official stance, current support status, or future roadmap regarding packed sequence training specifically for Qwen3.5 (and similar linear attention hybrid models) within the transformers ecosystem.

Specifically, we want to clarify:

Current Compatibility: Does the current implementation of Qwen3.5 in transformers safely support packing=True out-of-the-box when specialized kernels like FLA or causal-conv1d are enabled?

Boundary Handling: If it is supported or planned, how does the framework ensure that sample boundaries within a packed sequence are properly isolated (e.g., preventing cross-sample information leakage in the convolution states and linear attention blocks via seq_idx or cu_seqlens)?

Community Efforts: Are there any ongoing PRs, refactoring branches, or recommended best practices/workarounds that the community should follow if we want to leverage packing for Qwen3.5?

We are keen to understand the best path forward for scaling Qwen3.5 training using Hugging Face tooling, and we would be more than happy to help test any experimental branches or contribute to discussions if this is actively being worked on.

Thanks for the amazing work on supporting these new architectures!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix [Inquiry] What is the current status and roadmap for supporting packed sequences (packing) on Qwen3.5/hybrid linear attention models?