transformers - 💡(How to fix) Fix Qwen3.5-9B video input fails with apply_chat_template — Video features and video tokens do not match [1 pull requests]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

(Full traceback below)

Fix Action

Fixed

Code Example

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

---

[transformers] Setting `pad_token_id` to `eos_token_id`:248044 for open-end generation.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_58/3282355235.py in <cell line: 0>()
     52 
     53     with torch.no_grad():
---> 54         generated_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
     55 
     56     generated_ids_trimmed = [

/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
    122         # pyrefly: ignore [bad-context-manager]
    123         with ctx_factory():
--> 124             return func(*args, **kwargs)
    125 
    126     return decorate_context

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, custom_generate, **kwargs)
   2578 
   2579         # 9. Call generation mode
-> 2580         result = decoding_method(
   2581             self,
   2582             input_ids,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   2778 
   2779         prefill_consumed = False
-> 2780         outputs = self._prefill(
   2781             input_ids,
   2782             generation_config,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _prefill(self, input_ids, generation_config, model_kwargs, is_first_iteration)
   3824                 **model_kwargs,
   3825             )
-> 3826             return self(**model_inputs, return_dict=True)
   3827 
   3828         # Chunked prefill (for very large contexts)

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py in new_forward(module, *args, **kwargs)
    190                 output = module._old_forward(*args, **kwargs)
    191         else:
--> 192             output = module._old_forward(*args, **kwargs)
    193         return module._hf_hook.post_forward(module, output)
    194 

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, logits_to_keep, **kwargs)
   1816         """
   1817 
-> 1818         outputs = self.model(
   1819             input_ids=input_ids,
   1820             pixel_values=pixel_values,

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, **kwargs)
   1581             video_embeds = video_outputs.pooler_output
   1582             video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
-> 1583             _, video_mask = self.get_placeholder_mask(
   1584                 input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
   1585             )

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in get_placeholder_mask(self, input_ids, inputs_embeds, image_features, video_features)
   1481         special_video_mask = special_video_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
   1482         if video_features is not None:
-> 1483             torch_compilable_check(
   1484                 inputs_embeds[special_video_mask].numel() == video_features.numel(),
   1485                 f"Video features and video tokens do not match, tokens: {n_video_tokens}, features: {video_features.shape[0]}",

/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py in torch_compilable_check(cond, msg, error_type)
   1575         torch._check_tensor_all_with(error_type, cond, msg_callable)
   1576     else:
-> 1577         torch._check_with(error_type, cond, msg_callable)
   1578 
   1579 

/usr/local/lib/python3.12/dist-packages/torch/__init__.py in _check_with(error_type, cond, message)
   1712         message_evaluated = str(message())
   1713 
-> 1714     raise error_type(message_evaluated)
   1715 
   1716 

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

---

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the video."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128)
    # ↑ Error occurs here: tokens: 0, features: 3000
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.10.0.dev0
  • Platform: Linux-6.6.122+-x86_64-with-glibc2.35
  • Python version: 3.12.13
  • Huggingface_hub version: 1.10.1
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
  • Using distributed or parallel set-up in script?: yes, model parallel
  • Using GPU in script?: yes
  • GPU type: Tesla T4

Who can help?

@yonigozlan @molbap @zucchini-nlp

Describe the bug

When using Qwen/Qwen3.5-9B with transformers for video understanding, processor.apply_chat_template(..., tokenize=True) does not correctly insert video placeholder tokens. This causes a mismatch between video features and video tokens during model forward.

Error message

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

(Full traceback below)

Full traceback

[transformers] Setting `pad_token_id` to `eos_token_id`:248044 for open-end generation.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_58/3282355235.py in <cell line: 0>()
     52 
     53     with torch.no_grad():
---> 54         generated_ids = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS)
     55 
     56     generated_ids_trimmed = [

/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py in decorate_context(*args, **kwargs)
    122         # pyrefly: ignore [bad-context-manager]
    123         with ctx_factory():
--> 124             return func(*args, **kwargs)
    125 
    126     return decorate_context

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, custom_generate, **kwargs)
   2578 
   2579         # 9. Call generation mode
-> 2580         result = decoding_method(
   2581             self,
   2582             input_ids,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   2778 
   2779         prefill_consumed = False
-> 2780         outputs = self._prefill(
   2781             input_ids,
   2782             generation_config,

/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py in _prefill(self, input_ids, generation_config, model_kwargs, is_first_iteration)
   3824                 **model_kwargs,
   3825             )
-> 3826             return self(**model_inputs, return_dict=True)
   3827 
   3828         # Chunked prefill (for very large contexts)

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py in new_forward(module, *args, **kwargs)
    190                 output = module._old_forward(*args, **kwargs)
    191         else:
--> 192             output = module._old_forward(*args, **kwargs)
    193         return module._hf_hook.post_forward(module, output)
    194 

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, logits_to_keep, **kwargs)
   1816         """
   1817 
-> 1818         outputs = self.model(
   1819             input_ids=input_ids,
   1820             pixel_values=pixel_values,

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1774             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1775         else:
-> 1776             return self._call_impl(*args, **kwargs)
   1777 
   1778     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1785                 or _global_backward_pre_hooks or _global_backward_hooks
   1786                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1787             return forward_call(*args, **kwargs)
   1788 
   1789         result = None

/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py in wrapper(self, *args, **kwargs)
    901         if return_dict_passed is not None:
    902             return_dict = return_dict_passed
--> 903         output = func(self, *args, **kwargs)
    904         if not return_dict and not isinstance(output, tuple):
    905             output = output.to_tuple()

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, pixel_values, pixel_values_videos, image_grid_thw, video_grid_thw, mm_token_type_ids, **kwargs)
   1581             video_embeds = video_outputs.pooler_output
   1582             video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
-> 1583             _, video_mask = self.get_placeholder_mask(
   1584                 input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
   1585             )

/usr/local/lib/python3.12/dist-packages/transformers/models/qwen3_5/modeling_qwen3_5.py in get_placeholder_mask(self, input_ids, inputs_embeds, image_features, video_features)
   1481         special_video_mask = special_video_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
   1482         if video_features is not None:
-> 1483             torch_compilable_check(
   1484                 inputs_embeds[special_video_mask].numel() == video_features.numel(),
   1485                 f"Video features and video tokens do not match, tokens: {n_video_tokens}, features: {video_features.shape[0]}",

/usr/local/lib/python3.12/dist-packages/transformers/utils/import_utils.py in torch_compilable_check(cond, msg, error_type)
   1575         torch._check_tensor_all_with(error_type, cond, msg_callable)
   1576     else:
-> 1577         torch._check_with(error_type, cond, msg_callable)
   1578 
   1579 

/usr/local/lib/python3.12/dist-packages/torch/__init__.py in _check_with(error_type, cond, message)
   1712         message_evaluated = str(message())
   1713 
-> 1714     raise error_type(message_evaluated)
   1715 
   1716 

ValueError: Video features and video tokens do not match, tokens: 0, features: 3000

Additional context

  • Using process_vision_info + processor(text=..., videos=...) also fails with a different error (fps expected int but got list).

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

code: https://www.kaggle.com/code/liuweiq/qwen3-5-video

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe the video."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=128)
    # ↑ Error occurs here: tokens: 0, features: 3000
<img width="2559" height="1162" alt="Image" src="https://github.com/user-attachments/assets/644012d0-64d3-4d07-a938-27cf1f686ed5" />

Expected behavior

apply_chat_template(tokenize=True) should correctly insert video placeholder tokens so that model.generate() can match video features with video tokens.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

apply_chat_template(tokenize=True) should correctly insert video placeholder tokens so that model.generate() can match video features with video tokens.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING