transformers - ✅(Solved) Fix apply_chat_template returns all-zero assistant_masks for multimodal inputs [2 pull requests, 7 comments, 5 participants]

transformers2026-03-08 05:03:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44521•Fetched 2026-04-08 00:27:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×7mentioned ×5subscribed ×5cross-referenced ×3

Fix Action

Fixed

Fixed by PR: Fix crash in Qwen2_5_VLProcessor when using batched input with padding=False (https://github.com/huggingface/transformers/pull/44535)
Fixed by PR: Fix assistant_masks for multimodal inputs in apply_chat_template (https://github.com/huggingface/transformers/pull/44543)

PR fix notes

PR #44535: Fix crash in Qwen2_5_VLProcessor when using batched input with padding=False

Repository: huggingface/transformers
Author: Anakintano
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44535

Description (problem / solution / changelog)

Problem

Qwen2_5_VLProcessor.apply_chat_template raises ValueError: setting an array element with a sequence when called with a batch of ≥2 conversations that include images under the default padding=False setting.

Root cause: mm_token_type_ids was built by calling np.array(text_inputs["input_ids"]) on a ragged list (variable-length sequences when padding=False). NumPy ≥ 1.24 rejects inhomogeneous shapes for this operation.

Fix

Iterate per-sequence instead of constructing a 2D array from a ragged list. Each ids_arr = np.array(ids) call receives a 1-D list, so the shape is always homogeneous.

Changed in both:

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py
src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py (auto-generated copy, manually synced since make is unavailable on Windows)

Test

Added test_batched_apply_chat_template_no_padding in tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py to guard against regression.

Closes #44545

Changed files

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py (modified, +8/-5)
src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py (modified, +8/-5)
tests/models/qwen2_5_vl/test_processing_qwen2_5_vl.py (modified, +37/-0)

PR #44543: Fix assistant_masks for multimodal inputs in apply_chat_template

Repository: huggingface/transformers
Author: umbilnm
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44543

Description (problem / solution / changelog)

What does this PR do?

Fixes #44521

apply_chat_template with return_assistant_tokens_mask=True returns all-zero masks when multimodal inputs (images/videos) are present.

Root cause

generation_indices (character-level positions of assistant responses) are computed from the original prompt text rendered by Jinja, which contains a single placeholder token per image (e.g. one <|image_pad|>). However, the processor's __call__ expands each placeholder into N copies (based on image resolution), so offset_mapping returned by the tokenizer corresponds to the expanded text. The bisect_left lookup then fails to find the assistant span, and the mask stays all zeros.

Fix

When multimodal inputs are present:

Tokenize the original (unexpanded) prompt separately to get offset_mapping aligned with generation_indices
Build the assistant mask on the original tokenization (where bisect_left works correctly)
Map the mask onto the expanded input_ids via two-pointer alignment — matching tokens get their mask value, extra expansion tokens get 0

This approach is generic and works for any multimodal processor that expands placeholder tokens, without requiring model-specific logic.

When no multimodal inputs are present, the original code path is used unchanged.

Tests

Added test_apply_chat_template_assistant_mask_with_image in test_processing_common.py. Verified on Qwen2.5-VL and Qwen3-VL:

Without fix: FAIL (mask is all zeros)
With fix: PASS (mask correctly marks assistant tokens)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@zucchini-nlp (author of the original return_assistant_tokens_mask support in PR #38545, multimodal models)

Changed files

src/transformers/processing_utils.py (modified, +71/-21)
tests/test_processing_common.py (modified, +83/-0)

Code Example

from transformers import AutoProcessor

messages = [
    dict(
        role="user",
        content=[
            dict(type="image", image="test.jpg"),
            dict(type="text", text="Describe the image above."),
        ],
    ),
    dict(
        role="assistant",
        content=[
            dict(type="text", text="The image above shows a cat sitting on a table."),
        ],
    ),
]

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    return_assistant_tokens_mask=True,
)

print(inputs["assistant_masks"])

---

prompt, generation_indices = render_jinja_template(
    conversations=conversations,
    chat_template=chat_template,
    **template_kwargs,
    **special_tokens_map,
)

...

out = self(
    text=prompt,
    images=batch_images if images_exist else None,
    videos=batch_videos if videos_exist else None,
    audio=batch_audios if batch_audios else None,
    **kwargs,
)

...

offset_mapping = out.pop("offset_mapping")
input_ids = out["input_ids"]

for assistant_start_char, assistant_end_char in generation_indices[i]:
    start_pos = bisect.bisect_left(offset_starts, assistant_start_char)
    end_pos = bisect.bisect_left(offset_starts, assistant_end_char)

RAW_BUFFERClick to expand / collapse

System Info

transformers==5.3.0

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor

messages = [
    dict(
        role="user",
        content=[
            dict(type="image", image="test.jpg"),
            dict(type="text", text="Describe the image above."),
        ],
    ),
    dict(
        role="assistant",
        content=[
            dict(type="text", text="The image above shows a cat sitting on a table."),
        ],
    ),
]

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    return_assistant_tokens_mask=True,
)

print(inputs["assistant_masks"])

Actual behavior

assistant_masks is all zeros.

Expected behavior

The tokens corresponding to the assistant response should be marked with 1 in assistant_masks.

Suspected cause

I think the issue comes from a mismatch between generation_indices and the final tokenized prompt in multimodal cases.

In AutoProcessor.apply_chat_template, the assistant mask is computed roughly like this:

prompt, generation_indices = render_jinja_template(
    conversations=conversations,
    chat_template=chat_template,
    **template_kwargs,
    **special_tokens_map,
)

...

out = self(
    text=prompt,
    images=batch_images if images_exist else None,
    videos=batch_videos if videos_exist else None,
    audio=batch_audios if batch_audios else None,
    **kwargs,
)

...

offset_mapping = out.pop("offset_mapping")
input_ids = out["input_ids"]

for assistant_start_char, assistant_end_char in generation_indices[i]:
    start_pos = bisect.bisect_left(offset_starts, assistant_start_char)
    end_pos = bisect.bisect_left(offset_starts, assistant_end_char)

My understanding is:

generation_indices is computed from the rendered text prompt returned by render_jinja_template
but in multimodal processing, extra placeholder tokens such as <|image_pad|> are inserted later by the processor/tokenizer path
therefore offset_mapping corresponds to the expanded multimodal text, while generation_indices still refers to the pre-expanded text
this makes the character spans misaligned, so the assistant span lookup fails and assistant_masks ends up all zeros

extent analysis

Fix Plan

1. Update `AutoProcessor` to handle multimodal cases correctly

We need to modify the AutoProcessor to correctly handle multimodal cases by updating the generation_indices to match the expanded text.

2. Update `apply_chat_template` method

We need to update the apply_chat_template method to correctly compute the assistant_masks by taking into account the multimodal case.

3. Update `render_jinja_template` method

We need to update the render_jinja_template method to return the expanded text with multimodal placeholder tokens.

Code Snippets

from transformers import AutoProcessor

class CustomAutoProcessor(AutoProcessor):
    def apply_chat_template(
        self,
        conversations,
        chat_template,
        **template_kwargs,
        **special_tokens_map,
    ):
        # ... (rest of the method remains the same)

        # Update generation_indices to match the expanded text
        prompt, generation_indices = render_jinja_template(
            conversations=conversations,
            chat_template=chat_template,
            **template_kwargs,
            **special_tokens_map,
        )

        # ... (rest of the method remains the same)

        # Compute assistant_masks correctly
        offset_mapping = out.pop("offset_mapping")
        input_ids = out["input_ids"]

        for assistant_start_char, assistant_end_char in generation_indices:
            start_pos = bisect.bisect_left(offset_starts, assistant_start_char)
            end_pos = bisect.bisect_left(offset_starts, assistant_end_char)

            # Mark the assistant response tokens with 1 in assistant_masks
            assistant_masks[start_pos:end_pos] = 1

        return {
            "assistant_masks": assistant_masks,
            **out,
        }

# Usage
processor = CustomAutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")
inputs = processor.apply_chat_template(
    messages,
    tokenize

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The tokens corresponding to the assistant response should be marked with 1 in assistant_masks.

#api #ssr #installation #tensor shape #autograd error #optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix apply_chat_template returns all-zero assistant_masks for multimodal inputs [2 pull requests, 7 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44535: Fix crash in Qwen2_5_VLProcessor when using batched input with padding=False

Description (problem / solution / changelog)

Problem

Fix

Test

Changed files

PR #44543: Fix assistant_masks for multimodal inputs in apply_chat_template

Description (problem / solution / changelog)

What does this PR do?

Root cause

Fix

Tests

Before submitting

Who can review?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Actual behavior

Expected behavior

Suspected cause

extent analysis

Fix Plan

1. Update AutoProcessor to handle multimodal cases correctly

2. Update apply_chat_template method

3. Update render_jinja_template method

Code Snippets

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Update `AutoProcessor` to handle multimodal cases correctly

2. Update `apply_chat_template` method

3. Update `render_jinja_template` method