transformers - ✅(Solved) Fix Qwen3VL/Qwen2.5VL VisionAttention breaks torch.compile with flash_attention_2 [1 pull requests, 5 comments, 4 participants]

transformers2026-03-24 04:53:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44962•Fetched 2026-04-08 01:21:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5subscribed ×2cross-referenced ×1mentioned ×1

Error Message

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int' for argument 'max_seqlen_q' but instead found type 'FakeTensor'.

Root Cause

This is then passed to flash_attn_varlen_func via max_length_q / max_length_k, which expects int. During eager execution this works because PyTorch silently coerces the 0-d tensor. But under torch.compile, Dynamo traces it as a FakeTensor, and the flash_attn C++ op (flash_attn::_flash_attn_varlen_forward) rejects it:

Fix Action

Fix

Add .item() to convert the 0-d tensor to a Python int:

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

This is consistent with how transformers/modeling_flash_attention_utils.py already handles the same issue (line 354):

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

PR fix notes

PR #44973: Fix max_seqlen type in vision attention for torch.compile + FA2

Repository: huggingface/transformers
Author: andylizf
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44973

Description (problem / solution / changelog)

What does this PR do?

Adds .item() to max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max() in all vision attention modules that pass this value to flash_attn_varlen_func.

Context

On released versions (e.g. 4.52.4), using torch.compile + attn_implementation="flash_attention_2" crashes because max_seqlen is a 0-d tensor and the flash_attn C++ op expects int:

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward()
Expected a value of type 'int' for argument 'max_seqlen_q'
but instead found type 'FakeTensor'.

On main, this is already handled downstream by _process_flash_attention_kwargs which converts via .item() when is_tracing() is True (as @JJJYmmm pointed out). So this change is defense-in-depth on main, but a necessary fix for released versions.

Adding .item() at the source is consistent with how modeling_flash_attention_utils.py documents the issue (line 352-353):

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

Note: Qwen3.5's Qwen3_5VisionAttention (line 1004) has the same pattern without .item() — it works on main only because of the downstream fix, not because of a different implementation.

Affected models (19 files)

Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3.5, Qwen3.5-MoE, Qwen3-VL-MoE, Qwen2.5-Omni, Qwen3-Omni-MoE, GLM-4V, GLM-4V-MoE, GLM-Image, GLM-OCR, ERNIE-4.5-VL-MoE, PaddleOCR-VL, Video-LLaMA-3

Reproduction (on transformers ≤ 4.52.4)

import torch
torch.set_float32_matmul_precision("high")
from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    dtype=torch.bfloat16, attn_implementation="flash_attention_2",
).cuda().eval()
model = torch.compile(model, mode="max-autotune-no-cudagraphs")

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
img = Image.new("RGB", (875, 1024), color=(128, 128, 128))
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Describe."}]},
    {"role": "user", "content": [{"type": "image", "image": img}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[img], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # crashes on ≤4.52, works on main

Fixes #44962

Changed files

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py (modified, +1/-1)
src/transformers/models/glm4v/modeling_glm4v.py (modified, +1/-1)
src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +1/-1)
src/transformers/models/glm_image/modeling_glm_image.py (modified, +1/-1)
src/transformers/models/glm_image/modular_glm_image.py (modified, +1/-1)
src/transformers/models/glm_ocr/modeling_glm_ocr.py (modified, +1/-1)
src/transformers/models/glm_ocr/modular_glm_ocr.py (modified, +1/-1)
src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py (modified, +1/-1)
src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py (modified, +2/-2)
src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py (modified, +2/-2)
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +1/-1)
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (modified, +1/-1)
src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +1/-1)
src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +1/-1)
src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py (modified, +2/-2)
src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +1/-1)
src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +1/-1)
src/transformers/models/video_llama_3/modeling_video_llama_3.py (modified, +1/-1)
src/transformers/models/video_llama_3/modular_video_llama_3.py (modified, +1/-1)

Code Example

# src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221
max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()

---

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int'
for argument 'max_seqlen_q' but instead found type 'FakeTensor'.

---

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

---

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

---

import torch
torch.set_float32_matmul_precision("high")

from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B", trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()
model = torch.compile(model, mode="max-autotune-no-cudagraphs")

img = Image.new("RGB", (875, 1024), color=(128, 128, 128))
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Describe."}]},
    {"role": "user", "content": [{"type": "image", "image": img}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[img], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # crashes

RAW_BUFFERClick to expand / collapse

Bug description

Qwen3VLVisionAttention (and Qwen2_5_VLVisionAttention) computes max_seqlen as a 0-d tensor:

# src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221
max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int'
for argument 'max_seqlen_q' but instead found type 'FakeTensor'.

Fix

Add .item() to convert the 0-d tensor to a Python int:

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

This is consistent with how transformers/modeling_flash_attention_utils.py already handles the same issue (line 354):

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

Affected models

Qwen3VLVisionAttention (src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221)
Qwen2_5_VLVisionAttention (src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py, line 246)

Reproduction

import torch
torch.set_float32_matmul_precision("high")

from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B", trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()
model = torch.compile(model, mode="max-autotune-no-cudagraphs")

img = Image.new("RGB", (875, 1024), color=(128, 128, 128))
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Describe."}]},
    {"role": "user", "content": [{"type": "image", "image": img}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[img], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # crashes

Environment

transformers: 4.52.4 (also confirmed on main @ 2026-03-24)
flash-attn: 2.8.3
torch: 2.7.1+cu128
GPU: H100

extent analysis

Fix Plan

To fix the issue, you need to convert the 0-d tensor max_seqlen to a Python int by adding .item().

Modify the Qwen3VLVisionAttention and Qwen2_5_VLVisionAttention classes in the respective files:
- src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221
- src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py, line 246

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

Verification

To verify that the fix worked, run the reproduction code again. The code should no longer crash with a TorchRuntimeError.

import torch
torch.set_float32_matmul_precision("high")

from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

# ... (rest of the reproduction code remains the same)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # should no longer crash

Extra Tips

Make sure to update the transformers library to the latest version to ensure you have the latest fixes and improvements.
When working with PyTorch and torch.compile, be aware of the differences in behavior between eager execution and compiled execution, and test your code thoroughly to catch any potential issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Qwen3VL/Qwen2.5VL VisionAttention breaks torch.compile with flash_attention_2 [1 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #44973: Fix max_seqlen type in vision attention for torch.compile + FA2

Description (problem / solution / changelog)

What does this PR do?

Context

Affected models (19 files)

Reproduction (on transformers ≤ 4.52.4)

Changed files

Code Example

Bug description

Fix

Affected models

Reproduction

Environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Qwen3VL/Qwen2.5VL VisionAttention breaks torch.compile with flash_attention_2 [1 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #44973: Fix max_seqlen type in vision attention for torch.compile + FA2

Description (problem / solution / changelog)

What does this PR do?

Context

Affected models (19 files)

Reproduction (on transformers ≤ 4.52.4)

Changed files

Code Example

Bug description

Fix

Affected models

Reproduction

Environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING