transformers - 💡(How to fix) Fix Qwen3.5 + `flash_attention_2` crashes: 3D M-RoPE position_ids leak to `_is_packed_sequence` [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44643Fetched 2026-04-08 00:43:01
View on GitHub
Comments
2
Participants
3
Timeline
13
Reactions
0
Timeline (top)
mentioned ×4subscribed ×4commented ×2closed ×1

Error Message

Fine-tuning Qwen3.5-9B with attn_implementation="flash_attention_2" crashes with CUDA error: an illegal memory access inside flash_attn_varlen_func.

Root Cause

Qwen3_5TextModel.forward passes 3D M-RoPE position_ids [3, batch, seq_len] to decoder layers. The attention layer doesn't use them (it uses position_embeddings), but they leak through **kwargs to _flash_attention_forward_is_packed_sequence, which misinterprets the 3D shape and incorrectly routes to flash_varlen_fn with garbage cu_seqlens.

Qwen3VLTextModel.forward (same M-RoPE architecture) avoids this by passing the 2D text_position_ids to decoder layers instead:

# Qwen3-VL (correct): passes 2D text_position_ids
position_ids=text_position_ids,

# Qwen3.5 (buggy): passes 3D position_ids that leaks to flash attention
position_ids=position_ids,

Fix Action

Workaround

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

Code Example

# Qwen3-VL (correct): passes 2D text_position_ids
position_ids=text_position_ids,

# Qwen3.5 (buggy): passes 3D position_ids that leaks to flash attention
position_ids=position_ids,

---

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched
RAW_BUFFERClick to expand / collapse

Qwen3.5 + flash_attention_2 crashes: 3D M-RoPE position_ids leak to _is_packed_sequence

System Info

  • transformers: 5.3.0, PyTorch: 2.6.0+cu124, flash-attn: 2.8.3, Python: 3.10, Linux

Reproduction

Fine-tuning Qwen3.5-9B with attn_implementation="flash_attention_2" crashes with CUDA error: an illegal memory access inside flash_attn_varlen_func.

Root Cause

Qwen3_5TextModel.forward passes 3D M-RoPE position_ids [3, batch, seq_len] to decoder layers. The attention layer doesn't use them (it uses position_embeddings), but they leak through **kwargs to _flash_attention_forward_is_packed_sequence, which misinterprets the 3D shape and incorrectly routes to flash_varlen_fn with garbage cu_seqlens.

Qwen3VLTextModel.forward (same M-RoPE architecture) avoids this by passing the 2D text_position_ids to decoder layers instead:

# Qwen3-VL (correct): passes 2D text_position_ids
position_ids=text_position_ids,

# Qwen3.5 (buggy): passes 3D position_ids that leaks to flash attention
position_ids=position_ids,

Suggested Fix

Either pass text_position_ids (2D) to decoder layers in Qwen3_5TextModel.forward (matching Qwen3-VL), or add if position_ids.ndim > 2: return False to _is_packed_sequence.

Workaround

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

Who can help?

@zucchini-nlp @ArthurZucker @Cyrilvallez

extent analysis

Fix Plan

To fix the issue, we can apply one of the following solutions:

  • Pass text_position_ids (2D) to decoder layers in Qwen3_5TextModel.forward:
position_ids = text_position_ids
  • Add a check to _is_packed_sequence to handle 3D position_ids:
def _is_packed_sequence(position_ids, **kwargs):
    if position_ids is not None and position_ids.ndim > 2:
        return False
    # original implementation

Alternatively, apply the provided workaround:

import transformers.modeling_flash_attention_utils as _fa_utils
_orig = _fa_utils._is_packed_sequence

def _patched(position_ids, **kwargs):
    if position_ids is not None and position_ids.dim() > 2:
        return False
    return _orig(position_ids, **kwargs)

_fa_utils._is_packed_sequence = _patched

Verification

To verify the fix, re-run the fine-tuning process with attn_implementation="flash_attention_2" and check that it no longer crashes with a CUDA error.

Extra Tips

  • Ensure that the transformers and flash-attn libraries are up-to-date, as newer versions may include fixes for similar issues.
  • When working with custom models and attention implementations, carefully review the input shapes and dimensions to avoid similar errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING