transformers - ✅(Solved) Fix [`bug`] v5.3.0 video input regression for `qwen2_5_vl`, `qwen3_vl`, `qwen3_5`, and `qwen3_5_moe` [1 pull requests, 3 comments, 2 participants]

tomaarsen · 2026-03-05T18:47:53Z

[transformers] PR 43972: :rotating light: Unify 3D position ids - Repository: huggingface/transformers - Author: zucchini-nlp - State: closed | merged: True -… # PR #43972: :rotating_light: Unify 3D position ids - Repository: huggingface/transformers - Author: zucchini-nlp - State: closed | merged: True - Link: https://github.com/huggingface/transformers/pull/43972 ## Description (problem / solution / changelog) # What does this PR do? Following Ernie, we build 3d positions based on `mm_token_type_ids` and the models will return them by default from `processor`. We have a unified `get_vision_position` in the qwen2-vl model file, all other models just copy it from there. The utility build vision ids as the name suggests, and the models are free to manipulate on top as they wish. In most cases, the only thing that changes is the presence of new modalities or kwargs ## Changed files - `src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py` (modified, +117/-77) - `src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py` (modified, +54/-114) - `src/transformers/models/glm46v/modeling_glm46v.py` (modified, +140/-104) - `src/transformers/models/glm46v/processing_glm46v.py` (modified, +19/-2) - `src/transformers/models/glm4v/modeling_glm4v.py` (modified, +140/-104) - `src/transformers/models/glm4v/modular_glm4v.py` (modified, +134/-151) - `src/transformers/models/glm4v/processing_glm4v.py` (modified, +19/-2) - `src/transformers/models/glm4v_moe/modeling_glm4v_moe.py` (modified, +140/-104) - `src/transformers/models/glm4v_moe/modular_glm4v_moe.py` (modified, +2/-0) - `src/transformers/models/glm_image/modeling_glm_image.py` (modified, +59/-10) - `src/transformers/models/glm_image/modular_glm_image.py` (modified, +3/-10) - `src/transformers/models/glm_ocr/modeling_glm_ocr.py` (modified, +140/-104) - `src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py` (modified, +140/-95) - `src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py` (modified, +16/-2) - `src/transformers/models/paddleocr_vl/processing_paddleocr_vl.py` (modified, +15/-2) - `src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py` (modified, +138/-108) - `src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py` (modified, +81/-108) - `src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py` (modified, +3/-2) - `src/transformers/models/qwen2_vl/modeling_qwen2_vl.py` (modified, +140/-95) - `src/transformers/models/qwen2_vl/processing_qwen2_vl.py` (modified, +8/-1) - `src/transformers/models/qwen3_5/modeling_qwen3_5.py` (modified, +155/-71) - `src/transformers/models/qwen3_5/modular_qwen3_5.py` (modified, +2/-0) - `src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py` (modified, +155/-71) - `src/transformers/models/qwen3_vl/modeling_qwen3_vl.py` (modified, +155/-71) - `src/transformers/models/qwen3_vl/modular_qwen3_vl.py` (modified, +6/-126) - `src/transformers/models/qwen3_vl/processing_qwen3_vl.py` (modified, +8/-1) - `src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py` (modified, +155/-71) - `src/transformers/models/qwen3_vl_moe/modular_qwen3_vl_moe.py` (modified, +2/-0) - `src/transformers/models/video_llama_3/modular_video_llama_3.py` (modified, +7/-0) - `src/transformers/models/video_llama_3/processing_video_llama_3.py` (modified, +1/-0) - `src/transformers/utils/auto_docstring.py` (modified, +9/-0) - `tests/models/glm46v/test_modeling_glm46v.py` (modified, +4/-0) - `tests/models/glm46v/test_processor_glm46v.py` (modified, +1/-1) - `tests/models/glm4v/test_modeling_glm4v.py` (modified, +4/-0) - `tests/models/glm4v/test_processor_glm4v.py` (modified, +1/-1) - `tests/models/glm4v_moe/test_modeling_glm4v_moe.py` (modified, +4/-0) - `tests/models/glm_ocr/test_modeling_glm_ocr.py` (modified, +4/-0) - `tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py` (modified, +9/-0) - `tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py` (modified, +1/-1) - `tests/models/qwen2_vl/test_modeling_qwen2_vl.py` (modified, +7/-0) - `tests/models/qwen2_vl/test_processing_qwen2_vl.py` (modified, +1/-1) - `tests/models/qwen3_5/test_modeling_qwen3_5.py` (modified, +5/-0) - `tests/models/qwen3_vl/test_modeling_qwen3_vl.py` (modified, +7/-0) - `tests/models/qwen3_vl/test_processing_qwen3_vl.py` (modified, +1/-1) - `tests/models/qwen3_vl_moe/test_modeling_qwen3_vl_moe.py` (modified, +7/-0) ## Fixed - Fixed by PR: :rotating_light: Unify 3D position ids (https://github.com/huggingface/transformers/pull/43972) ### System Info - `transformers` version: 5.3.0.dev0 - Platform: Windows-10-10.0.26200-SP0 - Python version: 3.11.6 - Huggingface_hub version: 1.5.0 - Safetensors version: 0.6.2 - Accelerate version: 1.13.0.dev0 - Accelerate config: not found - DeepSpeed version: not installed - PyTorch version (accelerator?): 2.10.0+cu128 (CUDA) - Using distributed or parallel set-up in script?: No - Using GPU in script?: No (issue persists with GPU and CPU) - GPU type: NVIDIA GeForce RTX 3090 ### Who can help? @zucchini-nlp #

transformers2026-03-05 18:47:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44479•Fetched 2026-04-08 00:28:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tomaarsen

Participants

tomaarsen

zucchini-nlp

Timeline (top)

commented ×3closed ×1cross-referenced ×1labeled ×1

Error Message

Traceback (most recent call last): File "[sic]/sentence-transformers/demo_qwen_transformers.py", line 38, in <module> outputs = model(**inputs) ^^^^^^^^^^^^^^^ File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/utils/generic.py", line 843, in wrapper output = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1348, in forward position_ids = self.compute_3d_position_ids( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1242, in compute_3d_position_ids position_ids, rope_deltas = self.get_rope_index( ^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1134, in get_rope_index position_ids[:, batch_idx, attention_mask[batch_idx].bool()] = llm_positions.to(position_ids.device) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: shape mismatch: value tensor of shape [3, 25] cannot be broadcast to indexing result of shape [3, 23]

Fix Action

Fixed

Fixed by PR: :rotating_light: Unify 3D position ids (https://github.com/huggingface/transformers/pull/43972)

PR fix notes

PR #43972: :rotating_light: Unify 3D position ids

Repository: huggingface/transformers
Author: zucchini-nlp
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/43972

Description (problem / solution / changelog)

What does this PR do?

Following Ernie, we build 3d positions based on mm_token_type_ids and the models will return them by default from processor.

We have a unified get_vision_position in the qwen2-vl model file, all other models just copy it from there. The utility build vision ids as the name suggests, and the models are free to manipulate on top as they wish. In most cases, the only thing that changes is the presence of new modalities or kwargs

Changed files

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py (modified, +117/-77)
src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py (modified, +54/-114)
src/transformers/models/glm46v/modeling_glm46v.py (modified, +140/-104)
src/transformers/models/glm46v/processing_glm46v.py (modified, +19/-2)
src/transformers/models/glm4v/modeling_glm4v.py (modified, +140/-104)
src/transformers/models/glm4v/modular_glm4v.py (modified, +134/-151)
src/transformers/models/glm4v/processing_glm4v.py (modified, +19/-2)
src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +140/-104)
src/transformers/models/glm4v_moe/modular_glm4v_moe.py (modified, +2/-0)
src/transformers/models/glm_image/modeling_glm_image.py (modified, +59/-10)
src/transformers/models/glm_image/modular_glm_image.py (modified, +3/-10)
src/transformers/models/glm_ocr/modeling_glm_ocr.py (modified, +140/-104)
src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py (modified, +140/-95)
src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py (modified, +16/-2)
src/transformers/models/paddleocr_vl/processing_paddleocr_vl.py (modified, +15/-2)
src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +138/-108)
src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +81/-108)
src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py (modified, +3/-2)
src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (modified, +140/-95)
src/transformers/models/qwen2_vl/processing_qwen2_vl.py (modified, +8/-1)
src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +155/-71)
src/transformers/models/qwen3_5/modular_qwen3_5.py (modified, +2/-0)
src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +155/-71)
src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +155/-71)
src/transformers/models/qwen3_vl/modular_qwen3_vl.py (modified, +6/-126)
src/transformers/models/qwen3_vl/processing_qwen3_vl.py (modified, +8/-1)
src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +155/-71)
src/transformers/models/qwen3_vl_moe/modular_qwen3_vl_moe.py (modified, +2/-0)
src/transformers/models/video_llama_3/modular_video_llama_3.py (modified, +7/-0)
src/transformers/models/video_llama_3/processing_video_llama_3.py (modified, +1/-0)
src/transformers/utils/auto_docstring.py (modified, +9/-0)
tests/models/glm46v/test_modeling_glm46v.py (modified, +4/-0)
tests/models/glm46v/test_processor_glm46v.py (modified, +1/-1)
tests/models/glm4v/test_modeling_glm4v.py (modified, +4/-0)
tests/models/glm4v/test_processor_glm4v.py (modified, +1/-1)
tests/models/glm4v_moe/test_modeling_glm4v_moe.py (modified, +4/-0)
tests/models/glm_ocr/test_modeling_glm_ocr.py (modified, +4/-0)
tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py (modified, +9/-0)
tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py (modified, +1/-1)
tests/models/qwen2_vl/test_modeling_qwen2_vl.py (modified, +7/-0)
tests/models/qwen2_vl/test_processing_qwen2_vl.py (modified, +1/-1)
tests/models/qwen3_5/test_modeling_qwen3_5.py (modified, +5/-0)
tests/models/qwen3_vl/test_modeling_qwen3_vl.py (modified, +7/-0)
tests/models/qwen3_vl/test_processing_qwen3_vl.py (modified, +1/-1)
tests/models/qwen3_vl_moe/test_modeling_qwen3_vl_moe.py (modified, +7/-0)

Code Example

from transformers import AutoModel, AutoProcessor

model_name = "tiny-random/qwen3-vl"
model = AutoModel.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_0.mp4",
                }
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_1.mp4",
                }
            ],
        }
    ],
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
outputs = model(**inputs)
print(outputs)

---

Traceback (most recent call last):
  File "[sic]/sentence-transformers/demo_qwen_transformers.py", line 38, in <module>
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/utils/generic.py", line 843, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1348, in forward
    position_ids = self.compute_3d_position_ids(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1242, in compute_3d_position_ids
    position_ids, rope_deltas = self.get_rope_index(
                                ^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1134, in get_rope_index
    position_ids[:, batch_idx, attention_mask[batch_idx].bool()] = llm_positions.to(position_ids.device)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [3, 25] cannot be broadcast to indexing result of shape [3, 23]

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.3.0.dev0
Platform: Windows-10-10.0.26200-SP0
Python version: 3.11.6
Huggingface_hub version: 1.5.0
Safetensors version: 0.6.2
Accelerate version: 1.13.0.dev0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
Using distributed or parallel set-up in script?: No
Using GPU in script?: No (issue persists with GPU and CPU)
GPU type: NVIDIA GeForce RTX 3090

Who can help?

@zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Here is a simple, tiny reproducer using a qwen3_vl tiny-random model:

from transformers import AutoModel, AutoProcessor

model_name = "tiny-random/qwen3-vl"
model = AutoModel.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_0.mp4",
                }
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_1.mp4",
                }
            ],
        }
    ],
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
outputs = model(**inputs)
print(outputs)

Traceback (most recent call last):
  File "[sic]/sentence-transformers/demo_qwen_transformers.py", line 38, in <module>
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/utils/generic.py", line 843, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1348, in forward
    position_ids = self.compute_3d_position_ids(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1242, in compute_3d_position_ids
    position_ids, rope_deltas = self.get_rope_index(
                                ^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1134, in get_rope_index
    position_ids[:, batch_idx, attention_mask[batch_idx].bool()] = llm_positions.to(position_ids.device)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [3, 25] cannot be broadcast to indexing result of shape [3, 23]

These are the relevant lines where it goes wrong (for qwen3_vl): https://github.com/huggingface/transformers/blob/e498b5bd273e638990cfc82d8c2177a9c5b67858/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py#L1121-L1122

The attention_mask and position_ids are correctly shaped (e.g. torch.Size([2, 23]) and torch.Size([3, 2, 23])), but llm_positions is consistently 2 tokens too large.

Expected behavior

I would expect this to work as it did in v5.2.0. This is currently preventing inference on the VL-capable Qwen model with video inputs.

Tom Aarsen

extent analysis

Fix Plan

Update `transformers` to the latest version

The issue seems to be related to the qwen3_vl model, which is a custom model. However, the problem lies in the transformers library itself. The latest version of transformers might have fixed this issue.

pip install --upgrade transformers

Update `qwen3_vl` model to the latest version

If the issue persists after updating transformers, try updating the qwen3_vl model to the latest version.

from transformers import AutoModel, AutoProcessor

model_name = "tiny-random/qwen3-vl"
model = AutoModel.from_pretrained(model_name, revision="main")
processor = AutoProcessor.from_pretrained(model_name, revision="main")

Check the `attention_mask` and `position_ids` shapes

Make sure that the attention_mask and position_ids shapes match the expected shapes.

print(inputs["attention_mask"].shape)  # Should be torch.Size([2, 23])
print(inputs["position_ids"].shape)     # Should be torch.Size([3, 2, 23])

Update the `get_rope_index` method

If the issue still persists, you might need to update the get_rope_index method in the modeling_qwen3_vl.py file to handle the shape mismatch.

def get_rope_index(self, llm_positions):
    position_ids, rope_deltas = super().get_rope_index(llm_positions)
    position_ids[:, :, :llm_positions.shape[1]] = llm_positions.to(position_ids.device)
    return position_ids, rope_deltas

This fix assumes that the llm_positions tensor has a shape of [3, 25] and the position_ids tensor has a shape of

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

I would expect this to work as it did in v5.2.0. This is currently preventing inference on the VL-capable Qwen model with video inputs.

Tom Aarsen

#api #ssr #installation #tensor shape #autograd error #agent execution #callback error #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix [`bug`] v5.3.0 video input regression for `qwen2_5_vl`, `qwen3_vl`, `qwen3_5`, and `qwen3_5_moe` [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #43972: :rotating_light: Unify 3D position ids

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Update `transformers` to the latest version

Update `qwen3_vl` model to the latest version

Check the `attention_mask` and `position_ids` shapes

Update the `get_rope_index` method

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix [`bug`] v5.3.0 video input regression for `qwen2_5_vl`, `qwen3_vl`, `qwen3_5`, and `qwen3_5_moe` [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #43972: :rotating_light: Unify 3D position ids

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Update transformers to the latest version

Update qwen3_vl model to the latest version

Check the attention_mask and position_ids shapes

Update the get_rope_index method

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Update `transformers` to the latest version

Update `qwen3_vl` model to the latest version

Check the `attention_mask` and `position_ids` shapes

Update the `get_rope_index` method