transformers - ✅(Solved) Fix Qwen3VL/Qwen2.5VL VisionAttention breaks torch.compile with flash_attention_2 [1 pull requests, 5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44962Fetched 2026-04-08 01:21:26
View on GitHub
Comments
5
Participants
4
Timeline
10
Reactions
0
Author
Timeline (top)
commented ×5subscribed ×2cross-referenced ×1mentioned ×1

Error Message

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int' for argument 'max_seqlen_q' but instead found type 'FakeTensor'.

Root Cause

This is then passed to flash_attn_varlen_func via max_length_q / max_length_k, which expects int. During eager execution this works because PyTorch silently coerces the 0-d tensor. But under torch.compile, Dynamo traces it as a FakeTensor, and the flash_attn C++ op (flash_attn::_flash_attn_varlen_forward) rejects it:

Fix Action

Fix

Add .item() to convert the 0-d tensor to a Python int:

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

This is consistent with how transformers/modeling_flash_attention_utils.py already handles the same issue (line 354):

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

PR fix notes

PR #44973: Fix max_seqlen type in vision attention for torch.compile + FA2

Description (problem / solution / changelog)

What does this PR do?

Adds .item() to max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max() in all vision attention modules that pass this value to flash_attn_varlen_func.

Context

On released versions (e.g. 4.52.4), using torch.compile + attn_implementation="flash_attention_2" crashes because max_seqlen is a 0-d tensor and the flash_attn C++ op expects int:

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward()
Expected a value of type 'int' for argument 'max_seqlen_q'
but instead found type 'FakeTensor'.

On main, this is already handled downstream by _process_flash_attention_kwargs which converts via .item() when is_tracing() is True (as @JJJYmmm pointed out). So this change is defense-in-depth on main, but a necessary fix for released versions.

Adding .item() at the source is consistent with how modeling_flash_attention_utils.py documents the issue (line 352-353):

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

Note: Qwen3.5's Qwen3_5VisionAttention (line 1004) has the same pattern without .item() — it works on main only because of the downstream fix, not because of a different implementation.

Affected models (19 files)

Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3.5, Qwen3.5-MoE, Qwen3-VL-MoE, Qwen2.5-Omni, Qwen3-Omni-MoE, GLM-4V, GLM-4V-MoE, GLM-Image, GLM-OCR, ERNIE-4.5-VL-MoE, PaddleOCR-VL, Video-LLaMA-3

Reproduction (on transformers ≤ 4.52.4)

import torch
torch.set_float32_matmul_precision("high")
from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    dtype=torch.bfloat16, attn_implementation="flash_attention_2",
).cuda().eval()
model = torch.compile(model, mode="max-autotune-no-cudagraphs")

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B")
img = Image.new("RGB", (875, 1024), color=(128, 128, 128))
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Describe."}]},
    {"role": "user", "content": [{"type": "image", "image": img}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[img], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # crashes on ≤4.52, works on main

Fixes #44962

Changed files

  • src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py (modified, +1/-1)
  • src/transformers/models/glm4v/modeling_glm4v.py (modified, +1/-1)
  • src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +1/-1)
  • src/transformers/models/glm_image/modeling_glm_image.py (modified, +1/-1)
  • src/transformers/models/glm_image/modular_glm_image.py (modified, +1/-1)
  • src/transformers/models/glm_ocr/modeling_glm_ocr.py (modified, +1/-1)
  • src/transformers/models/glm_ocr/modular_glm_ocr.py (modified, +1/-1)
  • src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py (modified, +1/-1)
  • src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py (modified, +2/-2)
  • src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py (modified, +2/-2)
  • src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +1/-1)
  • src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (modified, +1/-1)
  • src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +1/-1)
  • src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +1/-1)
  • src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py (modified, +2/-2)
  • src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +1/-1)
  • src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +1/-1)
  • src/transformers/models/video_llama_3/modeling_video_llama_3.py (modified, +1/-1)
  • src/transformers/models/video_llama_3/modular_video_llama_3.py (modified, +1/-1)

Code Example

# src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221
max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()

---

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int'
for argument 'max_seqlen_q' but instead found type 'FakeTensor'.

---

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

---

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

---

import torch
torch.set_float32_matmul_precision("high")

from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B", trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()
model = torch.compile(model, mode="max-autotune-no-cudagraphs")

img = Image.new("RGB", (875, 1024), color=(128, 128, 128))
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Describe."}]},
    {"role": "user", "content": [{"type": "image", "image": img}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[img], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # crashes
RAW_BUFFERClick to expand / collapse

Bug description

Qwen3VLVisionAttention (and Qwen2_5_VLVisionAttention) computes max_seqlen as a 0-d tensor:

# src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221
max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()

This is then passed to flash_attn_varlen_func via max_length_q / max_length_k, which expects int. During eager execution this works because PyTorch silently coerces the 0-d tensor. But under torch.compile, Dynamo traces it as a FakeTensor, and the flash_attn C++ op (flash_attn::_flash_attn_varlen_forward) rejects it:

TorchRuntimeError: flash_attn::_flash_attn_varlen_forward() Expected a value of type 'int'
for argument 'max_seqlen_q' but instead found type 'FakeTensor'.

Fix

Add .item() to convert the 0-d tensor to a Python int:

max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

This is consistent with how transformers/modeling_flash_attention_utils.py already handles the same issue (line 354):

# This is a limitation of flash attention API, as the function `flash_attn_varlen_func`
# requires `max_length_q`, `max_length_k` to be passed as `int` and not `torch.Tensor`.
max_length_q = max_length_q.item()

Affected models

  • Qwen3VLVisionAttention (src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221)
  • Qwen2_5_VLVisionAttention (src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py, line 246)

Reproduction

import torch
torch.set_float32_matmul_precision("high")

from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-Embedding-2B", trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).cuda().eval()
model = torch.compile(model, mode="max-autotune-no-cudagraphs")

img = Image.new("RGB", (875, 1024), color=(128, 128, 128))
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Describe."}]},
    {"role": "user", "content": [{"type": "image", "image": img}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[img], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # crashes

Environment

  • transformers: 4.52.4 (also confirmed on main @ 2026-03-24)
  • flash-attn: 2.8.3
  • torch: 2.7.1+cu128
  • GPU: H100

extent analysis

Fix Plan

To fix the issue, you need to convert the 0-d tensor max_seqlen to a Python int by adding .item().

  • Modify the Qwen3VLVisionAttention and Qwen2_5_VLVisionAttention classes in the respective files:
    • src/transformers/models/qwen3_vl/modeling_qwen3_vl.py, line 221
    • src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py, line 246
max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()

Verification

To verify that the fix worked, run the reproduction code again. The code should no longer crash with a TorchRuntimeError.

import torch
torch.set_float32_matmul_precision("high")

from PIL import Image
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

# ... (rest of the reproduction code remains the same)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)  # should no longer crash

Extra Tips

  • Make sure to update the transformers library to the latest version to ensure you have the latest fixes and improvements.
  • When working with PyTorch and torch.compile, be aware of the differences in behavior between eager execution and compiled execution, and test your code thoroughly to catch any potential issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING