transformers - ✅(Solved) Fix [`bug`] v5.3.0 video input regression for `qwen2_5_vl`, `qwen3_vl`, `qwen3_5`, and `qwen3_5_moe` [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44479Fetched 2026-04-08 00:28:14
View on GitHub
Comments
3
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
commented ×3closed ×1cross-referenced ×1labeled ×1

Error Message

Traceback (most recent call last): File "[sic]/sentence-transformers/demo_qwen_transformers.py", line 38, in <module> outputs = model(**inputs) ^^^^^^^^^^^^^^^ File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/utils/generic.py", line 843, in wrapper output = func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1348, in forward position_ids = self.compute_3d_position_ids( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1242, in compute_3d_position_ids position_ids, rope_deltas = self.get_rope_index( ^^^^^^^^^^^^^^^^^^^^ File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1134, in get_rope_index position_ids[:, batch_idx, attention_mask[batch_idx].bool()] = llm_positions.to(position_ids.device) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: shape mismatch: value tensor of shape [3, 25] cannot be broadcast to indexing result of shape [3, 23]

Fix Action

Fixed

PR fix notes

PR #43972: :rotating_light: Unify 3D position ids

Description (problem / solution / changelog)

What does this PR do?

Following Ernie, we build 3d positions based on mm_token_type_ids and the models will return them by default from processor.

We have a unified get_vision_position in the qwen2-vl model file, all other models just copy it from there. The utility build vision ids as the name suggests, and the models are free to manipulate on top as they wish. In most cases, the only thing that changes is the presence of new modalities or kwargs

Changed files

  • src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py (modified, +117/-77)
  • src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py (modified, +54/-114)
  • src/transformers/models/glm46v/modeling_glm46v.py (modified, +140/-104)
  • src/transformers/models/glm46v/processing_glm46v.py (modified, +19/-2)
  • src/transformers/models/glm4v/modeling_glm4v.py (modified, +140/-104)
  • src/transformers/models/glm4v/modular_glm4v.py (modified, +134/-151)
  • src/transformers/models/glm4v/processing_glm4v.py (modified, +19/-2)
  • src/transformers/models/glm4v_moe/modeling_glm4v_moe.py (modified, +140/-104)
  • src/transformers/models/glm4v_moe/modular_glm4v_moe.py (modified, +2/-0)
  • src/transformers/models/glm_image/modeling_glm_image.py (modified, +59/-10)
  • src/transformers/models/glm_image/modular_glm_image.py (modified, +3/-10)
  • src/transformers/models/glm_ocr/modeling_glm_ocr.py (modified, +140/-104)
  • src/transformers/models/paddleocr_vl/modeling_paddleocr_vl.py (modified, +140/-95)
  • src/transformers/models/paddleocr_vl/modular_paddleocr_vl.py (modified, +16/-2)
  • src/transformers/models/paddleocr_vl/processing_paddleocr_vl.py (modified, +15/-2)
  • src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py (modified, +138/-108)
  • src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +81/-108)
  • src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py (modified, +3/-2)
  • src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (modified, +140/-95)
  • src/transformers/models/qwen2_vl/processing_qwen2_vl.py (modified, +8/-1)
  • src/transformers/models/qwen3_5/modeling_qwen3_5.py (modified, +155/-71)
  • src/transformers/models/qwen3_5/modular_qwen3_5.py (modified, +2/-0)
  • src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py (modified, +155/-71)
  • src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +155/-71)
  • src/transformers/models/qwen3_vl/modular_qwen3_vl.py (modified, +6/-126)
  • src/transformers/models/qwen3_vl/processing_qwen3_vl.py (modified, +8/-1)
  • src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py (modified, +155/-71)
  • src/transformers/models/qwen3_vl_moe/modular_qwen3_vl_moe.py (modified, +2/-0)
  • src/transformers/models/video_llama_3/modular_video_llama_3.py (modified, +7/-0)
  • src/transformers/models/video_llama_3/processing_video_llama_3.py (modified, +1/-0)
  • src/transformers/utils/auto_docstring.py (modified, +9/-0)
  • tests/models/glm46v/test_modeling_glm46v.py (modified, +4/-0)
  • tests/models/glm46v/test_processor_glm46v.py (modified, +1/-1)
  • tests/models/glm4v/test_modeling_glm4v.py (modified, +4/-0)
  • tests/models/glm4v/test_processor_glm4v.py (modified, +1/-1)
  • tests/models/glm4v_moe/test_modeling_glm4v_moe.py (modified, +4/-0)
  • tests/models/glm_ocr/test_modeling_glm_ocr.py (modified, +4/-0)
  • tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py (modified, +9/-0)
  • tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py (modified, +1/-1)
  • tests/models/qwen2_vl/test_modeling_qwen2_vl.py (modified, +7/-0)
  • tests/models/qwen2_vl/test_processing_qwen2_vl.py (modified, +1/-1)
  • tests/models/qwen3_5/test_modeling_qwen3_5.py (modified, +5/-0)
  • tests/models/qwen3_vl/test_modeling_qwen3_vl.py (modified, +7/-0)
  • tests/models/qwen3_vl/test_processing_qwen3_vl.py (modified, +1/-1)
  • tests/models/qwen3_vl_moe/test_modeling_qwen3_vl_moe.py (modified, +7/-0)

Code Example

from transformers import AutoModel, AutoProcessor

model_name = "tiny-random/qwen3-vl"
model = AutoModel.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_0.mp4",
                }
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_1.mp4",
                }
            ],
        }
    ],
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
outputs = model(**inputs)
print(outputs)

---

Traceback (most recent call last):
  File "[sic]/sentence-transformers/demo_qwen_transformers.py", line 38, in <module>
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/utils/generic.py", line 843, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1348, in forward
    position_ids = self.compute_3d_position_ids(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1242, in compute_3d_position_ids
    position_ids, rope_deltas = self.get_rope_index(
                                ^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1134, in get_rope_index
    position_ids[:, batch_idx, attention_mask[batch_idx].bool()] = llm_positions.to(position_ids.device)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [3, 25] cannot be broadcast to indexing result of shape [3, 23]
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.3.0.dev0
  • Platform: Windows-10-10.0.26200-SP0
  • Python version: 3.11.6
  • Huggingface_hub version: 1.5.0
  • Safetensors version: 0.6.2
  • Accelerate version: 1.13.0.dev0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No (issue persists with GPU and CPU)
  • GPU type: NVIDIA GeForce RTX 3090

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Here is a simple, tiny reproducer using a qwen3_vl tiny-random model:

from transformers import AutoModel, AutoProcessor

model_name = "tiny-random/qwen3-vl"
model = AutoModel.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_0.mp4",
                }
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "https://huggingface.co/datasets/tomaarsen/tiny-test/resolve/main/tiny_test_video_1.mp4",
                }
            ],
        }
    ],
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)
outputs = model(**inputs)
print(outputs)
Traceback (most recent call last):
  File "[sic]/sentence-transformers/demo_qwen_transformers.py", line 38, in <module>
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/sentence-transformers/Lib/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/utils/generic.py", line 843, in wrapper
    output = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1348, in forward
    position_ids = self.compute_3d_position_ids(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1242, in compute_3d_position_ids
    position_ids, rope_deltas = self.get_rope_index(
                                ^^^^^^^^^^^^^^^^^^^^
  File "[sic]/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 1134, in get_rope_index
    position_ids[:, batch_idx, attention_mask[batch_idx].bool()] = llm_positions.to(position_ids.device)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [3, 25] cannot be broadcast to indexing result of shape [3, 23]

These are the relevant lines where it goes wrong (for qwen3_vl): https://github.com/huggingface/transformers/blob/e498b5bd273e638990cfc82d8c2177a9c5b67858/src/transformers/models/qwen3_vl/modeling_qwen3_vl.py#L1121-L1122

The attention_mask and position_ids are correctly shaped (e.g. torch.Size([2, 23]) and torch.Size([3, 2, 23])), but llm_positions is consistently 2 tokens too large.

Expected behavior

I would expect this to work as it did in v5.2.0. This is currently preventing inference on the VL-capable Qwen model with video inputs.

  • Tom Aarsen

extent analysis

Fix Plan

Update transformers to the latest version

The issue seems to be related to the qwen3_vl model, which is a custom model. However, the problem lies in the transformers library itself. The latest version of transformers might have fixed this issue.

pip install --upgrade transformers

Update qwen3_vl model to the latest version

If the issue persists after updating transformers, try updating the qwen3_vl model to the latest version.

from transformers import AutoModel, AutoProcessor

model_name = "tiny-random/qwen3-vl"
model = AutoModel.from_pretrained(model_name, revision="main")
processor = AutoProcessor.from_pretrained(model_name, revision="main")

Check the attention_mask and position_ids shapes

Make sure that the attention_mask and position_ids shapes match the expected shapes.

print(inputs["attention_mask"].shape)  # Should be torch.Size([2, 23])
print(inputs["position_ids"].shape)     # Should be torch.Size([3, 2, 23])

Update the get_rope_index method

If the issue still persists, you might need to update the get_rope_index method in the modeling_qwen3_vl.py file to handle the shape mismatch.

def get_rope_index(self, llm_positions):
    position_ids, rope_deltas = super().get_rope_index(llm_positions)
    position_ids[:, :, :llm_positions.shape[1]] = llm_positions.to(position_ids.device)
    return position_ids, rope_deltas

This fix assumes that the llm_positions tensor has a shape of [3, 25] and the position_ids tensor has a shape of

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

I would expect this to work as it did in v5.2.0. This is currently preventing inference on the VL-capable Qwen model with video inputs.

  • Tom Aarsen

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix [`bug`] v5.3.0 video input regression for `qwen2_5_vl`, `qwen3_vl`, `qwen3_5`, and `qwen3_5_moe` [1 pull requests, 3 comments, 2 participants]