transformers - 💡(How to fix) Fix Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence` [2 comments, 3 participants]

transformers2026-03-13 00:11:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44643•Fetched 2026-04-08 00:43:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×4subscribed ×4commented ×2closed ×1

Error Message

Fine-tuning Qwen3.5-9B with attn_implementation="flash_attention_2" crashes with CUDA error: an illegal memory access inside flash_attn_varlen_func.

Root Cause

Qwen3_5TextModel.forward passes 3D M-RoPE position_ids [3, batch, seq_len] to decoder layers. The attention layer doesn't use them (it uses position_embeddings), but they leak through **kwargs to _flash_attention_forward → _is_packed_sequence, which misinterprets the 3D shape and incorrectly routes to flash_varlen_fn with garbage cu_seqlens.

Qwen3VLTextModel.forward (same M-RoPE architecture) avoids this by passing the 2D text_position_ids to decoder layers instead:

# Qwen3-VL (correct): passes 2D text_position_ids
position_ids=text_position_ids,

# Qwen3.5 (buggy): passes 3D position_ids that leaks to flash attention
position_ids=position_ids,

Fix Action

Workaround

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

Code Example

# Qwen3-VL (correct): passes 2D text_position_ids
position_ids=text_position_ids,

# Qwen3.5 (buggy): passes 3D position_ids that leaks to flash attention
position_ids=position_ids,

---

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

RAW_BUFFERClick to expand / collapse

Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence`

System Info

transformers: 5.3.0, PyTorch: 2.6.0+cu124, flash-attn: 2.8.3, Python: 3.10, Linux

Reproduction

Fine-tuning Qwen3.5-9B with attn_implementation="flash_attention_2" crashes with CUDA error: an illegal memory access inside flash_attn_varlen_func.

Root Cause

Qwen3VLTextModel.forward (same M-RoPE architecture) avoids this by passing the 2D text_position_ids to decoder layers instead:

# Qwen3-VL (correct): passes 2D text_position_ids
position_ids=text_position_ids,

# Qwen3.5 (buggy): passes 3D position_ids that leaks to flash attention
position_ids=position_ids,

Suggested Fix

Either pass text_position_ids (2D) to decoder layers in Qwen3_5TextModel.forward (matching Qwen3-VL), or add if position_ids.ndim > 2: return False to _is_packed_sequence.

Workaround

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

Who can help?

@zucchini-nlp @ArthurZucker @Cyrilvallez

extent analysis

Fix Plan

To fix the issue, we can apply one of the following solutions:

Pass text_position_ids (2D) to decoder layers in Qwen3_5TextModel.forward:

position_ids = text_position_ids

Add a check to _is_packed_sequence to handle 3D position_ids:

def _is_packed_sequence(position_ids, **kwargs):
    if position_ids is not None and position_ids.ndim > 2:
        return False
    # original implementation

Alternatively, apply the provided workaround:

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

Verification

To verify the fix, re-run the fine-tuning process with attn_implementation="flash_attention_2" and check that it no longer crashes with a CUDA error.

Extra Tips

Ensure that the transformers and flash-attn libraries are up-to-date, as newer versions may include fixes for similar issues.
When working with custom models and attention implementations, carefully review the input shapes and dimensions to avoid similar errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence` [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence`

System Info

Reproduction

Root Cause

Suggested Fix

Workaround

Who can help?

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence` [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Qwen3.5 + flash_attention_2 crashes: 3D M-RoPE position_ids leak to _is_packed_sequence

System Info

Reproduction

Root Cause

Suggested Fix

Workaround

Who can help?

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence`