For still images, I would expect the temporal RoPE origin to stay anchored at the image start position, just like height and width. In other words, for image inputs I would expect: - temporal IDs to start at `start_position` - height IDs to start at `start_position` - width IDs to start at `start_position` and only videos should apply temporal scaling via `second_per_grid_ts * tokens_per_second`. Right now in `transformers==5.3.0`, still images appear to inherit `tokens_per_second` scaling, which shifts the temporal origin even though a still image has no temporal spacing to encode.

transformers - ✅(Solved) Fix Qwen2.5-VL get_rope_index scales still-image temporal position_ids by tokens_per_second in transformers 5.3.0 [1 pull requests, 1 comments, 2 participants]

transformers2026-04-08 18:20:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45325•Fetched 2026-04-09 07:50:44

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ayaan-fw

Participants

ayaan-fw

Kash6

Timeline (top)

mentioned ×2subscribed ×2commented ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: Fix Qwen2.5-VL temporal RoPE scaling applied to still images (https://github.com/huggingface/transformers/pull/45330)

PR fix notes

PR #45330: Fix Qwen2.5-VL temporal RoPE scaling applied to still images

Repository: huggingface/transformers
Author: Kash6
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45330

Description (problem / solution / changelog)

get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

What does this PR do?

Fixes #45325

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp @yonigozlan

Changed files

src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +6/-1)
src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +6/-1)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.3.0
Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Huggingface_hub version: 1.9.0
Safetensors version: 0.6.2
Accelerate version: 1.13.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.0a0+145a3a7bda.nv25.10 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: No
GPU type: NVIDIA H200

Who can help?

@zucchini-nlp @yonigozlan

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from PIL import Image

from transformers import Qwen2_5_VLConfig, Qwen2_5_VLProcessor
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLModel

MODEL_ID = "Qwen/Qwen2.5-VL-3B-Instruct"


class DummyQwen2_5VLModel:
    # Reuse the public HF implementation without loading full model weights.
    get_rope_index = Qwen2_5_VLModel.get_rope_index
    get_vision_position_ids = Qwen2_5_VLModel.get_vision_position_ids

    def __init__(self, config):
        self.config = config


processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID)
model = DummyQwen2_5VLModel(Qwen2_5_VLConfig.from_pretrained(MODEL_ID))

image = Image.fromarray(np.zeros((448, 448, 3), dtype=np.uint8))

prompt = (
    "<|im_start|>system\nYou are helpful.<|im_end|>\n"
    "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|> describe<|im_end|>\n"
    "<|im_start|>assistant\n"
)

inputs = processor(
    text=prompt,
    images=[image],
    videos=None,
    return_tensors="pt",
    return_mm_token_type_ids=True,
)

position_ids, rope_deltas = model.get_rope_index(
    input_ids=inputs["input_ids"],
    mm_token_type_ids=inputs["mm_token_type_ids"],
    image_grid_thw=inputs["image_grid_thw"],
)

if position_ids.dim() == 3:
    position_ids = position_ids.squeeze(1)

image_start = (inputs["mm_token_type_ids"][0] == 1).nonzero(as_tuple=False)[0].item()

temporal = position_ids[0, image_start : image_start + 8].tolist()
height = position_ids[1, image_start : image_start + 8].tolist()
width = position_ids[2, image_start : image_start + 8].tolist()

print("tokens_per_second =", model.config.vision_config.tokens_per_second)
print("image_start =", image_start)
print("temporal =", temporal)
print("height =", height)
print("width =", width)

assert height[0] == image_start
assert width[0] == image_start

# This is the behavior I think is wrong for still images:
assert temporal[0] == image_start, (
    f"Expected temporal[0] == image_start for a still image, "
    f"but got temporal[0]={temporal[0]}, image_start={image_start}, "
    f"tokens_per_second={model.config.vision_config.tokens_per_second}"
)

Observed behavior on my side with transformers==5.3.0:

tokens_per_second = 4
the first still-image temporal position starts at image_start * tokens_per_second
height and width still start at image_start

For example, if image_start = 13, the temporal IDs start at 52 while height/width start at 13.

I think the relevant code path is:

get_rope_index() uses the same branch for images and videos
when second_per_grid_ts is absent, it defaults to 1
time_interval = tokens_per_second * 1 is therefore applied to still images too
get_vision_position_ids() then multiplies the still-image temporal origin by that interval

Relevant source in v5.3.0:

get_rope_index: https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L1144-L1175
get_vision_position_ids: https://github.com/huggingface/transformers/blob/v5.3.0/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L1021-L1073

Expected behavior

For still images, I would expect the temporal RoPE origin to stay anchored at the image start position, just like height and width.

In other words, for image inputs I would expect:

temporal IDs to start at start_position
height IDs to start at start_position
width IDs to start at start_position

and only videos should apply temporal scaling via second_per_grid_ts * tokens_per_second.

Right now in transformers==5.3.0, still images appear to inherit tokens_per_second scaling, which shifts the temporal origin even though a still image has no temporal spacing to encode.

extent analysis

TL;DR

The issue can be resolved by modifying the get_rope_index and get_vision_position_ids methods to handle still images and videos separately, ensuring that still images do not inherit temporal scaling.

Guidance

Review the get_rope_index method to ensure it correctly handles the case when second_per_grid_ts is absent, and defaults to a value that does not apply temporal scaling to still images.
Modify the get_vision_position_ids method to anchor the temporal RoPE origin at the image start position for still images, rather than applying the time_interval calculation.
Verify that the changes do not introduce any regressions in video processing by testing with video inputs.
Consider submitting a pull request to the transformers repository with the proposed changes to ensure that the fix is incorporated into future versions.

Example

def get_rope_index(self, input_ids, mm_token_type_ids, image_grid_thw):
    # ... existing code ...
    if image_grid_thw is not None:  # still image
        time_interval = 1
    else:  # video
        time_interval = tokens_per_second * second_per_grid_ts
    # ... existing code ...

Notes

The provided code snippet and analysis suggest that the issue is specific to the transformers library version 5.3.0, and may not be present in other versions. Additionally, the fix may require modifications to the library's source code, which could have unintended consequences if not thoroughly tested.

Recommendation

Apply a workaround by modifying the get_rope_index and get_vision_position_ids methods as described above, until a fixed version of the transformers library is available.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

For still images, I would expect the temporal RoPE origin to stay anchored at the image start position, just like height and width.

In other words, for image inputs I would expect:

temporal IDs to start at start_position
height IDs to start at start_position
width IDs to start at start_position

and only videos should apply temporal scaling via second_per_grid_ts * tokens_per_second.

Right now in transformers==5.3.0, still images appear to inherit tokens_per_second scaling, which shifts the temporal origin even though a still image has no temporal spacing to encode.

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Qwen2.5-VL get_rope_index scales still-image temporal position_ids by tokens_per_second in transformers 5.3.0 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45330: Fix Qwen2.5-VL temporal RoPE scaling applied to still images

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Qwen2.5-VL get_rope_index scales still-image temporal position_ids by tokens_per_second in transformers 5.3.0 [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #45330: Fix Qwen2.5-VL temporal RoPE scaling applied to still images

Description (problem / solution / changelog)

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING