transformers - ✅(Solved) Fix transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong [1 pull requests, 4 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45381Fetched 2026-04-12 13:23:59
View on GitHub
Comments
4
Participants
1
Timeline
11
Reactions
0
Participants
Timeline (top)
commented ×4mentioned ×2subscribed ×2closed ×1

PR fix notes

PR #45400: Fix Qwen2.5VL temporal grid positions

Description (problem / solution / changelog)

What does this PR do?

Fixes https://github.com/huggingface/transformers/issues/45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔

update: i know why, the integration test we have uses second_grid_its = 0.083 which rounds to 0.0. So multiplication is zero no matter what value we get for vision positions. Great!

For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once

In any case, worth adding a fast test with expected positions, will do so

Changed files

  • src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py (modified, +11/-9)
  • src/transformers/models/glm46v/modeling_glm46v.py (modified, +20/-42)
  • src/transformers/models/glm4v/modeling_glm4v.py (modified, +20/-42)
  • src/transformers/models/glm4v/modular_glm4v.py (modified, +9/-90)
  • src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +20/-42)
  • src/transformers/models/glm_image/modeling_glm_image.py (modified, +11/-9)
  • src/transformers/models/glm_ocr/modeling_glm_ocr.py (modified, +20/-42)
  • src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py (modified, +11/-9)
  • src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +11/-9)
  • src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (modified, +11/-9)
  • src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +11/-9)
  • src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +11/-9)
  • src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +11/-9)
  • src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +11/-9)
  • tests/models/glm4v/test_modeling_glm4v.py (modified, +46/-0)
  • tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py (modified, +66/-36)
  • tests/models/qwen2_vl/test_modeling_qwen2_vl.py (modified, +52/-0)
  • tests/models/qwen3_vl/test_modeling_qwen3_vl.py (modified, +55/-0)

Code Example

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True, return_video_metadata=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=[[video_inputs[0][0]]],
    video_metadata=[[video_inputs[0][1]]],
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
RAW_BUFFERClick to expand / collapse

System Info

transformers == 5.3.0 [But the bug seems to be with any transformers >= 5.3.0, even in the current main branch] qwen_vl_utils == 0.0.14 Python 3.12.4 Cuda 12.6

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am using this script

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True, return_video_metadata=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=[[video_inputs[0][0]]],
    video_metadata=[[video_inputs[0][1]]],
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Expected behavior

The vision_position_ids output at https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L1177 is wrong. All video frames share the same position_temporal and the position_height is also wrong

extent analysis

TL;DR

The issue with incorrect vision_position_ids output in the Qwen2.5-VL model may be resolved by adjusting the process_vision_info function or the model's video processing logic.

Guidance

  • Verify that the process_vision_info function is correctly handling video frame rates and temporal information.
  • Check the video_kwargs dictionary to ensure it contains the correct frame rate and video metadata.
  • Review the Qwen2.5-VL model's implementation, specifically the Qwen2_5_VLForConditionalGeneration class, to understand how video frames are processed and vision_position_ids are generated.
  • Consider modifying the process_vision_info function to correctly align frame rates with absolute time, as mentioned in the Qwen 2.5 VL documentation.

Example

No code example is provided, as the issue requires a deeper understanding of the Qwen2.5-VL model's implementation and the process_vision_info function.

Notes

The issue seems to be specific to the Qwen2.5-VL model and its handling of video frames and temporal information. The provided script and code snippets are not sufficient to determine the root cause of the issue.

Recommendation

Apply workaround: Modify the process_vision_info function to correctly handle video frame rates and temporal information, or adjust the model's video processing logic to generate correct vision_position_ids. This is recommended because the issue seems to be specific to the Qwen2.5-VL model and its implementation, and a workaround may be necessary to resolve the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The vision_position_ids output at https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L1177 is wrong. All video frames share the same position_temporal and the position_height is also wrong

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING