transformers - ✅(Solved) Fix transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong [1 pull requests, 4 comments, 1 participants]

transformers2026-04-11 22:43:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45381•Fetched 2026-04-12 13:23:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bicheng-xu

Participants

bicheng-xu

Timeline (top)

commented ×4mentioned ×2subscribed ×2closed ×1

PR fix notes

PR #45400: Fix Qwen2.5VL temporal grid positions

Repository: huggingface/transformers
Author: zucchini-nlp
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/45400

Description (problem / solution / changelog)

What does this PR do?

Fixes https://github.com/huggingface/transformers/issues/45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔

update: i know why, the integration test we have uses second_grid_its = 0.083 which rounds to 0.0. So multiplication is zero no matter what value we get for vision positions. Great!

For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once

In any case, worth adding a fast test with expected positions, will do so

Changed files

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py (modified, +11/-9)
src/transformers/models/glm46v/modeling_glm46v.py (modified, +20/-42)
src/transformers/models/glm4v/modeling_glm4v.py (modified, +20/-42)
src/transformers/models/glm4v/modular_glm4v.py (modified, +9/-90)
src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +20/-42)
src/transformers/models/glm_image/modeling_glm_image.py (modified, +11/-9)
src/transformers/models/glm_ocr/modeling_glm_ocr.py (modified, +20/-42)
src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py (modified, +11/-9)
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +11/-9)
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (modified, +11/-9)
src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +11/-9)
src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +11/-9)
src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +11/-9)
src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +11/-9)
tests/models/glm4v/test_modeling_glm4v.py (modified, +46/-0)
tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py (modified, +66/-36)
tests/models/qwen2_vl/test_modeling_qwen2_vl.py (modified, +52/-0)
tests/models/qwen3_vl/test_modeling_qwen3_vl.py (modified, +55/-0)

Code Example

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True, return_video_metadata=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=[[video_inputs[0][0]]],
    video_metadata=[[video_inputs[0][1]]],
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

RAW_BUFFERClick to expand / collapse

System Info

transformers == 5.3.0 [But the bug seems to be with any transformers >= 5.3.0, even in the current main branch] qwen_vl_utils == 0.0.14 Python 3.12.4 Cuda 12.6

Who can help?

@zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am using this script

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
# Messages containing a local video path and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True, return_video_metadata=True)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=[[video_inputs[0][0]]],
    video_metadata=[[video_inputs[0][1]]],
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Expected behavior

The vision_position_ids output at https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L1177 is wrong. All video frames share the same position_temporal and the position_height is also wrong

extent analysis

TL;DR

The issue with incorrect vision_position_ids output in the Qwen2.5-VL model may be resolved by adjusting the process_vision_info function or the model's video processing logic.

Guidance

Verify that the process_vision_info function is correctly handling video frame rates and temporal information.
Check the video_kwargs dictionary to ensure it contains the correct frame rate and video metadata.
Review the Qwen2.5-VL model's implementation, specifically the Qwen2_5_VLForConditionalGeneration class, to understand how video frames are processed and vision_position_ids are generated.
Consider modifying the process_vision_info function to correctly align frame rates with absolute time, as mentioned in the Qwen 2.5 VL documentation.

Example

No code example is provided, as the issue requires a deeper understanding of the Qwen2.5-VL model's implementation and the process_vision_info function.

Notes

The issue seems to be specific to the Qwen2.5-VL model and its handling of video frames and temporal information. The provided script and code snippets are not sufficient to determine the root cause of the issue.

Recommendation

Apply workaround: Modify the process_vision_info function to correctly handle video frame rates and temporal information, or adjust the model's video processing logic to generate correct vision_position_ids. This is recommended because the issue seems to be specific to the Qwen2.5-VL model and its implementation, and a workaround may be necessary to resolve the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong [1 pull requests, 4 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #45400: Fix Qwen2.5VL temporal grid positions

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong [1 pull requests, 4 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #45400: Fix Qwen2.5VL temporal grid positions

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING