transformers - ✅(Solved) Fix [Bug] Flash Attention crashes with illegal memory access on Qwen3.5 due to 3D position_ids being misinterpreted as packed sequence [2 pull requests, 6 comments, 3 participants]

Q: Expected behavior

Model forward pass with `attn_implementation="flash_attention_2"` should complete successfully without CUDA errors. After the fix (adding dimensionality check for >2D position_ids), all 8 standard attention layers in Qwen3.5-9B pass flash attention forward correctly.

transformers2026-03-21 15:38:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44910•Fetched 2026-04-08 01:12:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×7commented ×6mentioned ×6cross-referenced ×3

When using attn_implementation="flash_attention_2" with Qwen3.5 models, all forward passes crash with CUDA error: an illegal memory access was encountered. This affects both training and inference.

Root cause: Qwen3.5 uses a hybrid architecture (GatedDeltaNet linear attention + standard attention) and passes 3D position_ids with shape [3, batch_size, seq_len] (for multi-dimensional rotary embedding). The function _is_packed_sequence() in modeling_flash_attention_utils.py misinterprets this 3D tensor as a packed sequence indicator, causing cu_seqlens to be constructed with 3× the actual token count. Flash Attention then reads beyond the q/k/v tensor boundaries, resulting in an illegal memory access.

Error Message

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Root Cause

Fix Action

Fix

Add a dimensionality check in _is_packed_sequence() to reject tensors with more than 2 dimensions, since packed sequences always use 2D position_ids [batch, seq_len]:

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    if position_ids.dim() > 2:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

This fix has been validated: all 8 standard attention layers in Qwen3.5-9B pass flash attention forward successfully after applying the patch.

PR fix notes

PR #44911: Fix flash attention crash with 3D position_ids (Qwen3.5)

Repository: huggingface/transformers
Author: ouroborosscr
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44911

Description (problem / solution / changelog)

Qwen3.5 uses 3D position_ids [3, batch, seq_len] for multi-dimensional rotary embedding. _is_packed_sequence() misinterprets this as a packed sequence, causing cu_seqlens to be constructed with 3x the actual token count. Flash attention then reads beyond tensor boundaries, resulting in CUDA illegal memory access.

Add a dimensionality check to reject >2D position_ids, since packed sequences always use 2D [batch, seq_len] format.

What does this PR do?

Qwen3.5 uses a hybrid architecture (GatedDeltaNet + standard attention) with 3D position_ids of shape [3, batch_size, seq_len] for multi-dimensional rotary embedding. The function _is_packed_sequence() in modeling_flash_attention_utils.py does not handle >2D tensors, causing it to misidentify the input as a packed sequence. This leads to cu_seqlens being constructed with 3× the actual token count, and flash_attn_varlen_func reads beyond tensor boundaries, resulting in CUDA error: illegal memory access.

The fix: Add if position_ids.dim() > 2: return False at the top of _is_packed_sequence(), since packed sequences always use 2D [batch, seq_len] position_ids.

Intercepted evidence before crash:

q: torch.Size([256, 16, 256])           ← 256 tokens
cu_seqlens_q: tensor([0, 256, 512, 768]) ← claims 768 tokens (3×256)
q total=256 vs cu_seqlens_q[-1]=768      ← MISMATCH → illegal memory access

Fixes #44910

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?
Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
Did you write any new necessary tests?

Who can review?

@vasqu @ArthurZucker @CyrilVallez (attention)

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Changed files

src/transformers/modeling_flash_attention_utils.py (modified, +8/-4)

PR #1487: [multimodal] add language_model_only flag for models like qwen3.5

Repository: NovaSky-AI/SkyRL
Author: erictang000
State: closed | merged: True
Link: https://github.com/NovaSky-AI/SkyRL/pull/1487

Description (problem / solution / changelog)

Add `language_model_only` flag for multimodal models (Qwen3.5)

Summary

Add language_model_only config flag across policy, ref, and inference engine configs to skip vision encoder initialization for multimodal models like Qwen3.5, reducing GPU memory usage
Fix FSDP weight sync: remap CausalLM param names (model.layers.*) to vLLM's expected namespace (language_model.model.layers.*) via new weight_prefix in FSDPWeightExtractor
Make FSDP wrap policy resilient to missing vision-only layer classes (warn + skip instead of crash)
Add flash-linear-attention and causal-conv1d dependencies; unblock causal-conv1d install override -- required for performant GDN layer execution
Add run_qwen3.5_0.8b.sh example with use_sample_packing=false (GDN layers are incompatible with packing)

Runs

FSDP and megatron reward matching <img width="487" height="257" alt="image" src="https://github.com/user-attachments/assets/efb388d2-52b2-4789-ae88-0d29b93acdff" />

Test plan

Run run_qwen3.5_0.8b.sh on 4 GPUs -- verify weight sync, no GDN fallback warnings, avg_final_rewards trends up
Run existing non-multimodal FSDP test to confirm no regression
Verify config validation rejects mismatched language_model_only across policy/ref/generator

Changed files

examples/train/megatron/run_megatron_qwen3.5.sh (modified, +4/-4)
examples/train/models/run_qwen3.5_0.8b.sh (added, +68/-0)
pyproject.toml (modified, +4/-2)
skyrl/backends/skyrl_train/distributed/fsdp_utils.py (modified, +11/-1)
skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py (modified, +2/-0)
skyrl/backends/skyrl_train/inference_servers/utils.py (modified, +1/-0)
skyrl/backends/skyrl_train/workers/fsdp/fsdp_worker.py (modified, +21/-2)
skyrl/backends/skyrl_train/workers/model_wrapper.py (modified, +20/-12)
skyrl/backends/skyrl_train_backend.py (modified, +1/-0)
skyrl/train/config/config.py (modified, +9/-0)
skyrl/train/entrypoints/main_base.py (modified, +1/-0)
skyrl/train/utils/utils.py (modified, +7/-1)
uv.lock (modified, +64/-26)

Code Example

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",  # or any Qwen3.5 variant
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

# This crashes immediately
input_ids = torch.randint(100, 5000, (1, 256), device="cuda")
with torch.no_grad():
    out = model(input_ids=input_ids, use_cache=False)

---

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

File "transformers/modeling_flash_attention_utils.py", line 692, in _flash_attention_forward
    out = flash_varlen_fn(
File "flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
File "flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
    out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

---

elif is_fa_with_varlen_kwargs or is_fa_with_position_ids:
    if cu_seq_lens_q is None or cu_seq_lens_k is None:
        q, k, v, (cu_seq_lens_q, cu_seq_lens_k), (max_length_q, max_length_k) = _prepare_from_posids(
            query_states, key_states, value_states, position_ids
        )

---

position_ids = position_ids.reshape(-1)  # [3, 1, 256] → [768]
indices_q = (position_ids == 0).nonzero().view(-1)  # Finds 3 zero positions

---

🔍 varlen_fwd parameters:
  q: torch.Size([256, 16, 256])           ← 256 tokens
  cu_seqlens_q: tensor([0, 256, 512, 768]) ← claims 768 tokens
  q total=256 vs cu_seqlens_q[-1]=768      ← MISMATCH → crash

---

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    if position_ids.dim() > 2:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

---

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

input_ids = torch.randint(100, 5000, (1, 256), device="cuda")
with torch.no_grad():
    out = model(input_ids=input_ids, use_cache=False)  # crashes here

---

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

---

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

---

position_ids = position_ids.reshape(-1)  # [3, 1, 256] → [768]
indices_q = (position_ids == 0).nonzero()  # finds 3 zero positions
# constructs cu_seqlens = [0, 256, 512, 768] — claims 768 tokens

---

q: torch.Size([256, 16, 256])           ← 256 tokens
cu_seqlens_q: tensor([0, 256, 512, 768]) ← claims 768 tokens
q total=256 vs cu_seqlens_q[-1]=768      ← MISMATCH → crash

RAW_BUFFERClick to expand / collapse

System Info

[Bug] Flash Attention crashes with `illegal memory access` on Qwen3.5 due to 3D `position_ids` being misinterpreted as packed sequence

We fixed it in https://github.com/ouroborosscr/transformers/tree/fix/qwen35-flash-attn-3d-position-ids

Description

Reproduction

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",  # or any Qwen3.5 variant
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

# This crashes immediately
input_ids = torch.randint(100, 5000, (1, 256), device="cuda")
with torch.no_grad():
    out = model(input_ids=input_ids, use_cache=False)

Error:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Traceback (abbreviated):

File "transformers/modeling_flash_attention_utils.py", line 692, in _flash_attention_forward
    out = flash_varlen_fn(
File "flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
    return FlashAttnVarlenFunc.apply(
File "flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
    out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Root Cause Analysis

Qwen3.5 hybrid architecture

Qwen3.5 uses a mixed attention architecture: 24 layers of Qwen3_5GatedDeltaNet (linear attention) and 8 layers of Qwen3_5Attention (standard attention, at layers 3, 7, 11, 15, 19, 23, 27, 31). Only the standard attention layers use flash attention.

Qwen3.5 passes 3D position_ids with shape [3, batch_size, seq_len] for its multi-dimensional rotary embedding (3 sets of position indices).

The bug

In modeling_flash_attention_utils.py, the function _is_packed_sequence() (line 444) does not handle tensors with more than 2 dimensions:

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

When position_ids has shape [3, 1, 256]:

position_ids.shape[1] returns 1 (not 256 as expected for a 2D [batch, seq_len] tensor)
The function returns True, misidentifying this as a packed sequence

This triggers the packed-sequence code path at line 677:

elif is_fa_with_varlen_kwargs or is_fa_with_position_ids:
    if cu_seq_lens_q is None or cu_seq_lens_k is None:
        q, k, v, (cu_seq_lens_q, cu_seq_lens_k), (max_length_q, max_length_k) = _prepare_from_posids(
            query_states, key_states, value_states, position_ids
        )

Inside prepare_fa_kwargs_from_position_ids() (line 362):

position_ids = position_ids.reshape(-1)  # [3, 1, 256] → [768]
indices_q = (position_ids == 0).nonzero().view(-1)  # Finds 3 zero positions

This constructs cu_seqlens = [0, 256, 512, 768], claiming 3 sequences with 768 total tokens. But the actual q/k/v tensors only contain 256 tokens. Flash Attention reads up to index 768, causing the illegal memory access.

Intercepted parameters confirming the mismatch

🔍 varlen_fwd parameters:
  q: torch.Size([256, 16, 256])           ← 256 tokens
  cu_seqlens_q: tensor([0, 256, 512, 768]) ← claims 768 tokens
  q total=256 vs cu_seqlens_q[-1]=768      ← MISMATCH → crash

Fix

Add a dimensionality check in _is_packed_sequence() to reject tensors with more than 2 dimensions, since packed sequences always use 2D position_ids [batch, seq_len]:

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    if position_ids.dim() > 2:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

This fix has been validated: all 8 standard attention layers in Qwen3.5-9B pass flash attention forward successfully after applying the patch.

Environment

Model: Qwen3.5-9B (hybrid GatedDeltaNet + standard attention)
GPU: NVIDIA A100-SXM4-80GB
PyTorch: 2.9.0 / 2.10.0 (both affected)
Transformers: 5.3.0
flash-attn: 2.8.3
CUDA: 12.8

Impact

Affects all Qwen3.5 variants (and potentially any future model using >2D position_ids)
Blocks both training and inference when using flash_attention_2
No workaround other than falling back to sdpa or eager attention implementations

Who can help?

@vasqu @ArthurZucker (attention)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

GRPO reinforcement learning training with Qwen3.5-9B using TRL GRPOTrainer When using attn_implementation="flash_attention_2" with Qwen3.5, all forward passes crash with CUDA illegal memory access. Minimal reproduction:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

input_ids = torch.randint(100, 5000, (1, 256), device="cuda")
with torch.no_grad():
    out = model(input_ids=input_ids, use_cache=False)  # crashes here

Error:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Root cause:

Qwen3.5 is a hybrid architecture (24 GatedDeltaNet layers + 8 standard attention layers). It uses 3D position_ids with shape [3, batch_size, seq_len] for multi-dimensional rotary embedding.

_is_packed_sequence() in modeling_flash_attention_utils.py (line 444) does not handle >2D tensors:

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

When position_ids has shape [3, 1, 256], position_ids.shape[1] returns 1 instead of the sequence length, and the function returns True, misidentifying this as a packed sequence.

This triggers prepare_fa_kwargs_from_position_ids() which does:

position_ids = position_ids.reshape(-1)  # [3, 1, 256] → [768]
indices_q = (position_ids == 0).nonzero()  # finds 3 zero positions
# constructs cu_seqlens = [0, 256, 512, 768] — claims 768 tokens

But q/k/v only contain 256 tokens. Flash attention reads up to index 768, causing the illegal memory access.

Intercepted evidence:

q: torch.Size([256, 16, 256])           ← 256 tokens
cu_seqlens_q: tensor([0, 256, 512, 768]) ← claims 768 tokens
q total=256 vs cu_seqlens_q[-1]=768      ← MISMATCH → crash

Environment:

Model: Qwen3.5-9B
GPU: NVIDIA A100-SXM4-80GB
PyTorch: 2.9.0 and 2.10.0 (both affected)
transformers: 5.3.0
flash-attn: 2.8.3
CUDA: 12.8

Fix: Add if position_ids.dim() > 2: return False in _is_packed_sequence(). Packed sequences always use 2D [batch, seq_len] position_ids.

Expected behavior

Model forward pass with attn_implementation="flash_attention_2" should complete successfully without CUDA errors. After the fix (adding dimensionality check for >2D position_ids), all 8 standard attention layers in Qwen3.5-9B pass flash attention forward correctly.

extent analysis

Fix Plan

To resolve the issue, you need to modify the _is_packed_sequence() function in modeling_flash_attention_utils.py to correctly handle 3D position_ids tensors.

Here are the steps:

Open the modeling_flash_attention_utils.py file.
Locate the _is_packed_sequence() function.
Add a dimensionality check at the beginning of the function to return False for tensors with more than 2 dimensions.

Example code:

def _is_packed_sequence(position_ids, batch_size):
    if position_ids is None:
        return False
    if position_ids.dim() > 2:  # Add this line to check for >2D tensors
        return False
    increasing_position_sequences = (
        torch.arange(position_ids.shape[1], device=position_ids.device) + position_ids.min()
    )
    return batch_size == 1 and (increasing_position_sequences - position_ids).abs().sum().bool()

Verification

To verify that the fix worked, run the reproduction code again:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    attn_implementation="flash_attention_2",
)

input_ids = torch.randint(100, 5000, (1, 256), device="cuda")
with torch.no_grad():
    out = model(input_ids=input_ids, use_cache=False)

The model forward pass should now complete successfully without CUDA errors.

Extra Tips

Make sure to update the modeling_flash_attention_utils.py file in the correct location, depending on your project setup.
If you are using a virtual environment, ensure that the updated file is reflected in the environment.
This fix assumes that packed sequences always use 2D [batch, seq_len] position_ids. If this assumption is not valid, further modifications may be necessary.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#dependency conflict #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [Bug] Flash Attention crashes with illegal memory access on Qwen3.5 due to 3D position_ids being misinterpreted as packed sequence [2 pull requests, 6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #44911: Fix flash attention crash with 3D position_ids (Qwen3.5)

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Who can review?

Changed files

PR #1487: [multimodal] add language_model_only flag for models like qwen3.5

Description (problem / solution / changelog)

Add language_model_only flag for multimodal models (Qwen3.5)

Summary

Runs

Test plan

Changed files

Code Example

System Info

[Bug] Flash Attention crashes with illegal memory access on Qwen3.5 due to 3D position_ids being misinterpreted as packed sequence

Description

Reproduction

Root Cause Analysis

Qwen3.5 hybrid architecture

The bug

Intercepted parameters confirming the mismatch

Fix

Environment

Impact

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Add `language_model_only` flag for multimodal models (Qwen3.5)

[Bug] Flash Attention crashes with `illegal memory access` on Qwen3.5 due to 3D `position_ids` being misinterpreted as packed sequence