transformers - 💡(How to fix) Fix Qwen3.5-9B video input fails with apply_chat_template — Video features and video tokens do not match [1 pull requests]

Q: Expected behavior

`apply_chat_template(tokenize=True)` should correctly insert video placeholder tokens so that `model.generate()` can match video features with video tokens.

transformers2026-05-30 04:59:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

(Full traceback below)

Fix Action

Fixed

Fixed by PR: [Qwen3VL] Fix video token placeholder: use self.video_token instead of hardcoded "<|placeholder|>" (https://github.com/huggingface/transformers/pull/46296)

Code Example

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

---

[transformers] Setting `pad_token_id` to `eos_token_id`:248044 for open-end generation.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_58/3282355235.py in <cell line: 0>()
     52 
     53     with torch.no_grad():
---> 54         generated_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
     55 
     56     generated_ids_trimmed = [

/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
    122         # pyrefly: ignore [bad-context-manager]
    123         with ctx_factory():
--> 124             return func(*args, **kwargs)
    125 
    126     return decorate_context

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, custom_generate, **kwargs)
   2578 
   2579         # 9. Call generation mode
-> 2580         result = decoding_method(
   2581             self,
   2582             input_ids,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   2778 
   2779         prefill_consumed = False
-> 2780         outputs = self._prefill(
   2781             input_ids,
   2782             generation_config,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _prefill(self, input_ids, generation_config, model_kwargs, is_first_iteration)
   3824                 **model_kwargs,
   3825             )
-> 3826             return self(**model_inputs, return_dict=True)
   3827 
   3828         # Chunked prefill (for very large contexts)

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py in new_forward(module, *args, **kwargs)
    190                 output = module._old_forward(*args, **kwargs)
    191         else:
--> 192             output = module._old_forward(*args, **kwargs)
    193         return module._hf_hook.post_forward(module, output)
    194 

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, logits_to_keep, **kwargs)
   1816         """
   1817 
-> 1818         outputs = self.model(
   1819             input_ids=input_ids,
   1820             pixel_values=pixel_values,

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, **kwargs)
   1581             video_embeds = video_outputs.pooler_output
   1582             video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
-> 1583             _, video_mask = self.get_placeholder_mask(
   1584                 input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
   1585             )

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in get_placeholder_mask(self, input_ids, inputs_embeds, image_features, video_features)
   1481         special_video_mask = special_video_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
   1482         if video_features is not None:
-> 1483             torch_compilable_check(
   1484                 inputs_embeds[special_video_mask].numel() == video_features.numel(),
   1485                 f"Video features and video tokens do not match, tokens: {n_video_tokens}, features: {video_features.shape[0]}",

/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py in torch_compilable_check(cond, msg, error_type)
   1575         torch._check_tensor_all_with(error_type, cond, msg_callable)
   1576     else:
-> 1577         torch._check_with(error_type, cond, msg_callable)
   1578 
   1579 

/usr/local/lib/python3.12/dist-packages/torch/__init__.py in _check_with(error_type, cond, message)
   1712         message_evaluated = str(message())
   1713 
-> 1714     raise error_type(message_evaluated)
   1715 
   1716 

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

---

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the video."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128)
    # ↑ Error occurs here: tokens: 0, features: 3000

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.10.0.dev0
Platform: Linux-6.6.122+-x86_64-with-glibc2.35
Python version: 3.12.13
Huggingface_hub version: 1.10.1
Safetensors version: 0.7.0
Accelerate version: 1.13.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
Using distributed or parallel set-up in script?: yes, model parallel
Using GPU in script?: yes
GPU type: Tesla T4

Who can help?

@yonigozlan @molbap @zucchini-nlp

Describe the bug

When using Qwen/Qwen3.5-9B with transformers for video understanding, processor.apply_chat_template(..., tokenize=True) does not correctly insert video placeholder tokens. This causes a mismatch between video features and video tokens during model forward.

Error message

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

(Full traceback below)

Full traceback

[transformers] Setting `pad_token_id` to `eos_token_id`:248044 for open-end generation.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_58/3282355235.py in <cell line: 0>()
     52 
     53     with torch.no_grad():
---> 54         generated_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
     55 
     56     generated_ids_trimmed = [

/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
    122         # pyrefly: ignore [bad-context-manager]
    123         with ctx_factory():
--> 124             return func(*args, **kwargs)
    125 
    126     return decorate_context

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, custom_generate, **kwargs)
   2578 
   2579         # 9. Call generation mode
-> 2580         result = decoding_method(
   2581             self,
   2582             input_ids,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   2778 
   2779         prefill_consumed = False
-> 2780         outputs = self._prefill(
   2781             input_ids,
   2782             generation_config,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _prefill(self, input_ids, generation_config, model_kwargs, is_first_iteration)
   3824                 **model_kwargs,
   3825             )
-> 3826             return self(**model_inputs, return_dict=True)
   3827 
   3828         # Chunked prefill (for very large contexts)

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py in new_forward(module, *args, **kwargs)
    190                 output = module._old_forward(*args, **kwargs)
    191         else:
--> 192             output = module._old_forward(*args, **kwargs)
    193         return module._hf_hook.post_forward(module, output)
    194 

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, logits_to_keep, **kwargs)
   1816         """
   1817 
-> 1818         outputs = self.model(
   1819             input_ids=input_ids,
   1820             pixel_values=pixel_values,

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, **kwargs)
   1581             video_embeds = video_outputs.pooler_output
   1582             video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
-> 1583             _, video_mask = self.get_placeholder_mask(
   1584                 input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
   1585             )

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in get_placeholder_mask(self, input_ids, inputs_embeds, image_features, video_features)
   1481         special_video_mask = special_video_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
   1482         if video_features is not None:
-> 1483             torch_compilable_check(
   1484                 inputs_embeds[special_video_mask].numel() == video_features.numel(),
   1485                 f"Video features and video tokens do not match, tokens: {n_video_tokens}, features: {video_features.shape[0]}",

/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py in torch_compilable_check(cond, msg, error_type)
   1575         torch._check_tensor_all_with(error_type, cond, msg_callable)
   1576     else:
-> 1577         torch._check_with(error_type, cond, msg_callable)
   1578 
   1579 

/usr/local/lib/python3.12/dist-packages/torch/__init__.py in _check_with(error_type, cond, message)
   1712         message_evaluated = str(message())
   1713 
-> 1714     raise error_type(message_evaluated)
   1715 
   1716 

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

Additional context

Using process_vision_info + processor(text=..., videos=...) also fails with a different error (fps expected int but got list).

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

code: https://www.kaggle.com/code/liuweiq/qwen3-5-video

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the video."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128)
    # ↑ Error occurs here: tokens: 0, features: 3000

Expected behavior

apply_chat_template(tokenize=True) should correctly insert video placeholder tokens so that model.generate() can match video features with video tokens.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

apply_chat_template(tokenize=True) should correctly insert video placeholder tokens so that model.generate() can match video features with video tokens.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Qwen3.5-9B video input fails with apply_chat_template — Video features and video tokens do not match [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

Code Example

System Info

Who can help?

Describe the bug

Error message

Full traceback

Additional context

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

TRENDING