transformers - ✅(Solved) Fix Qwen2.5-VL get_rope_index scales still-image temporal position_ids by tokens_per_second in transformers 5.3.0 [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45325Fetched 2026-04-09 07:50:44
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2commented ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #45330: Fix Qwen2.5-VL temporal RoPE scaling applied to still images

Description (problem / solution / changelog)

get_rope_index unconditionally applies tokens_per_second temporal scaling to both images and videos. For still images (modality_type == 1), this shifts the temporal position origin to start_position * tokens_per_second instead of start_position, creating a mismatch with height/width dimensions.

Only apply temporal scaling (tokens_per_second * second_per_grid_ts) for video inputs (modality_type == 2). Still images use time_interval=1, keeping the temporal origin aligned with height and width at start_position.

Qwen3-VL inherits this fix via super().get_rope_index().

What does this PR do?

<!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable -->

Fixes #45325

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by code agents. We are currently bottlenecked by our ability to review and respond to them. As a result, we ask that new users do not submit pure code agent PRs at this time. You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zucchini-nlp @yonigozlan

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc - kernels: @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Changed files

  • src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +6/-1)
  • src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +6/-1)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.3.0
  • Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 1.9.0
  • Safetensors version: 0.6.2
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.0a0+145a3a7bda.nv25.10 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No
  • GPU type: NVIDIA H200

Who can help?

@zucchini-nlp @yonigozlan

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from PIL import Image

from transformers import Qwen2_5_VLConfig, Qwen2_5_VLProcessor
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLModel

MODEL_ID = "Qwen/Qwen2.5-VL-3B-Instruct"


class DummyQwen2_5VLModel:
    # Reuse the public HF implementation without loading full model weights.
    get_rope_index = Qwen2_5_VLModel.get_rope_index
    get_vision_position_ids = Qwen2_5_VLModel.get_vision_position_ids

    def __init__(self, config):
        self.config = config


processor = Qwen2_5_VLProcessor.from_pretrained(MODEL_ID)
model = DummyQwen2_5VLModel(Qwen2_5_VLConfig.from_pretrained(MODEL_ID))

image = Image.fromarray(np.zeros((448, 448, 3), dtype=np.uint8))

prompt = (
    "<|im_start|>system\nYou are helpful.<|im_end|>\n"
    "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|> describe<|im_end|>\n"
    "<|im_start|>assistant\n"
)

inputs = processor(
    text=prompt,
    images=[image],
    videos=None,
    return_tensors="pt",
    return_mm_token_type_ids=True,
)

position_ids, rope_deltas = model.get_rope_index(
    input_ids=inputs["input_ids"],
    mm_token_type_ids=inputs["mm_token_type_ids"],
    image_grid_thw=inputs["image_grid_thw"],
)

if position_ids.dim() == 3:
    position_ids = position_ids.squeeze(1)

image_start = (inputs["mm_token_type_ids"][0] == 1).nonzero(as_tuple=False)[0].item()

temporal = position_ids[0, image_start : image_start + 8].tolist()
height = position_ids[1, image_start : image_start + 8].tolist()
width = position_ids[2, image_start : image_start + 8].tolist()

print("tokens_per_second =", model.config.vision_config.tokens_per_second)
print("image_start =", image_start)
print("temporal =", temporal)
print("height =", height)
print("width =", width)

assert height[0] == image_start
assert width[0] == image_start

# This is the behavior I think is wrong for still images:
assert temporal[0] == image_start, (
    f"Expected temporal[0] == image_start for a still image, "
    f"but got temporal[0]={temporal[0]}, image_start={image_start}, "
    f"tokens_per_second={model.config.vision_config.tokens_per_second}"
)

Observed behavior on my side with transformers==5.3.0:

  • tokens_per_second = 4
  • the first still-image temporal position starts at image_start * tokens_per_second
  • height and width still start at image_start

For example, if image_start = 13, the temporal IDs start at 52 while height/width start at 13.

I think the relevant code path is:

  • get_rope_index() uses the same branch for images and videos
  • when second_per_grid_ts is absent, it defaults to 1
  • time_interval = tokens_per_second * 1 is therefore applied to still images too
  • get_vision_position_ids() then multiplies the still-image temporal origin by that interval

Relevant source in v5.3.0:

Expected behavior

For still images, I would expect the temporal RoPE origin to stay anchored at the image start position, just like height and width.

In other words, for image inputs I would expect:

  • temporal IDs to start at start_position
  • height IDs to start at start_position
  • width IDs to start at start_position

and only videos should apply temporal scaling via second_per_grid_ts * tokens_per_second.

Right now in transformers==5.3.0, still images appear to inherit tokens_per_second scaling, which shifts the temporal origin even though a still image has no temporal spacing to encode.

extent analysis

TL;DR

The issue can be resolved by modifying the get_rope_index and get_vision_position_ids methods to handle still images and videos separately, ensuring that still images do not inherit temporal scaling.

Guidance

  • Review the get_rope_index method to ensure it correctly handles the case when second_per_grid_ts is absent, and defaults to a value that does not apply temporal scaling to still images.
  • Modify the get_vision_position_ids method to anchor the temporal RoPE origin at the image start position for still images, rather than applying the time_interval calculation.
  • Verify that the changes do not introduce any regressions in video processing by testing with video inputs.
  • Consider submitting a pull request to the transformers repository with the proposed changes to ensure that the fix is incorporated into future versions.

Example

def get_rope_index(self, input_ids, mm_token_type_ids, image_grid_thw):
    # ... existing code ...
    if image_grid_thw is not None:  # still image
        time_interval = 1
    else:  # video
        time_interval = tokens_per_second * second_per_grid_ts
    # ... existing code ...

Notes

The provided code snippet and analysis suggest that the issue is specific to the transformers library version 5.3.0, and may not be present in other versions. Additionally, the fix may require modifications to the library's source code, which could have unintended consequences if not thoroughly tested.

Recommendation

Apply a workaround by modifying the get_rope_index and get_vision_position_ids methods as described above, until a fixed version of the transformers library is available.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For still images, I would expect the temporal RoPE origin to stay anchored at the image start position, just like height and width.

In other words, for image inputs I would expect:

  • temporal IDs to start at start_position
  • height IDs to start at start_position
  • width IDs to start at start_position

and only videos should apply temporal scaling via second_per_grid_ts * tokens_per_second.

Right now in transformers==5.3.0, still images appear to inherit tokens_per_second scaling, which shifts the temporal origin even though a still image has no temporal spacing to encode.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix Qwen2.5-VL get_rope_index scales still-image temporal position_ids by tokens_per_second in transformers 5.3.0 [1 pull requests, 1 comments, 2 participants]