transformers - ✅(Solved) Fix [Gemma4] Bug: audio token missing newline separators in chat_template.jinja causes multimodal failure when image precedes audio [1 pull requests]

StepCodex · 2026-04-09T03:35:46Z

[transformers] PR 45257: Gemma4 Fix chat template and stop tokens for OpenAI tool calling compatibility - Repository: huggingface/transformers - Author: lucian… # PR #45257: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility - Repository: huggingface/transformers - Author: lucianommartins - State: open | merged: False - Link: https://github.com/huggingface/transformers/pull/45257 ## Description (problem / solution / changelog) # [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility ## What does this PR do? Rewrites the `_patch_template_for_openai_tool_role()` function in `convert_gemma4_weights.py` to fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B). ### Chat template patcher - **Forward-scan tool rendering**: `role: "tool"` messages are skipped in the outer loop and rendered proactively as ` ` blocks from the preceding assistant turn that issued the `tool_calls` - **Turn suppression**: Suppresses duplicate ` model` when consecutive assistant messages are separated only by tool messages (multi-round tool-call loops) - **`tool_call_id` resolution**: Matches tool results back to the originating `tool_calls` array by ID to resolve function names correctly (prevents `response:unknown`) - **Content-parts robustness**: Handles tool response `content` as both plain strings and OpenAI content-parts arrays (`[{type: "text", text: "..."}]`) - **`format_tool_response_block` macro**: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma native `tool_responses` and OpenAI-style `role: "tool"` paths) - **`reasoning`/`reasoning_content` support**: Renders thinking fields as ` thought` blocks (compatible with vLLM, DeepSeek, and o1-style inference servers) - **Legacy compat**: Preserves native `tool_responses` on assistant messages (Google/Gemma format) ### Stop tokens (`eos_token_id`) - Removed ` ` (`etc_token`) from the stop token list - Keeps only ` ` + ` ` (`eot_token`) - Enables parallel tool calls without premature truncation after the first ` `; ` ` still terminates the model turn correctly ### Testing Validated with 17 functional test scenarios across both E4B and 31B templates: - Simple chat, tool declarations, single/multi/parallel tool calls - Multi-round tool loops (exactly 1 ` model` emitted) - Legacy `tool_responses`, `tool_call_id` resolution, content-parts arrays - `reasoning`/`reasoning_content` field rendering - `add_generation_prompt` correctness, Jinja2 syntax validation ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? - [ ] Did you make sure to update the documentation with your changes? - [ ] Did you write any new necessary tests? ## Who can review? Models: - multimodal models: @zucchini-nlp Library: - generate: @zucchini-nlp (visual-language models) ## Changed files - `src/transformers/models/gemma4/convert_gemma4_weights.py` (modified, +269/-1) ## Fixed - Fixed by PR: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility (https://github.com/huggingface/transformers/pull/45257) ## Bug Description In `chat_template.jinja` for Gemma4, the image token has `\n\n` separators but the audio token does not: ```jinja {%- elif item['type'] == 'image' -%} {{- '\n\n \n\n' -}} ← has \n\n {%- elif item['type'] == 'audio' -%} {{- ' ' -}} ← missing \n\n This causes the model to fail completely when image is placed before audio in the message content list. Reproduction python from transformers import AutoProcessor, AutoModelForMultimodalLM import torch processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it") model = AutoModelForMultimodalLM.from_pretrained( "google/gemma-4-E2B-it", torch_dtype=torch.bfloat16, device_map="auto", ) # ❌ image before audio → model fails messages_image_first = [ { "role": "user", "content": [ {"type": "image", "url": IMAGE_PATH}, {"type": "audio", "audio": AUDIO_PATH}, {"type": "text", "text": "Describe the image and audio."}, ] } ] # ✅ audio before image → works correctly messages_audio_first = [ { "role": "user", "content": [ {"type": "audio", "audio": AUDIO_PATH}, {"type": "image", "url": IMAGE_PATH}, {"type": "text", "text": "Describe the image and audio."}, ] } ] Root Cause The jinja template inserts \n\n around image tokens but not around audio tokens. Image-first token sequence (broken): ... \n\n ... Describe the image and audio. ↑ audio token directly concatenated with text, no separator Audio-first token sequence (correct): ... \n\n ...

transformers2026-04-09 03:35:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

✅ audio before image → works correctly

messages_audio_first = [ { "role": "user", "content": [ {"type": "audio", "audio": AUDIO_PATH}, {"type": "image", "url": IMAGE_PATH}, {"type": "text", "text": "Describe the image and audio."}, ] } ] Root Cause The jinja template inserts \n\n around image tokens but not around audio tokens.

Fix Action

Fixed

Fixed by PR: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility (https://github.com/huggingface/transformers/pull/45257)

PR fix notes

PR #45257: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

Repository: huggingface/transformers
Author: lucianommartins
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45257

Description (problem / solution / changelog)

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Rewrites the _patch_template_for_openai_tool_role() function in convert_gemma4_weights.py to fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B).

Chat template patcher

Forward-scan tool rendering: role: "tool" messages are skipped in the outer loop and rendered proactively as <|tool_response> blocks from the preceding assistant turn that issued the tool_calls
Turn suppression: Suppresses duplicate <|turn>model when consecutive assistant messages are separated only by tool messages (multi-round tool-call loops)
tool_call_id resolution: Matches tool results back to the originating tool_calls array by ID to resolve function names correctly (prevents response:unknown)
Content-parts robustness: Handles tool response content as both plain strings and OpenAI content-parts arrays ([{type: "text", text: "..."}])
format_tool_response_block macro: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma native tool_responses and OpenAI-style role: "tool" paths)
reasoning/reasoning_content support: Renders thinking fields as <|channel>thought blocks (compatible with vLLM, DeepSeek, and o1-style inference servers)
Legacy compat: Preserves native tool_responses on assistant messages (Google/Gemma format)

Stop tokens (`eos_token_id`)

Removed <tool_call|> (etc_token) from the stop token list
Keeps only <eos> + <turn|> (eot_token)
Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly

Testing

Validated with 17 functional test scenarios across both E4B and 31B templates:

Simple chat, tool declarations, single/multi/parallel tool calls
Multi-round tool loops (exactly 1 <|turn>model emitted)
Legacy tool_responses, tool_call_id resolution, content-parts arrays
reasoning/reasoning_content field rendering
add_generation_prompt correctness, Jinja2 syntax validation

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Models:

multimodal models: @zucchini-nlp

Library:

generate: @zucchini-nlp (visual-language models)

Changed files

src/transformers/models/gemma4/convert_gemma4_weights.py (modified, +269/-1)

RAW_BUFFERClick to expand / collapse

Bug Description

In chat_template.jinja for Gemma4, the image token has \n\n separators but the audio token does not:

{%- elif item['type'] == 'image' -%}
    {{- '\n\n<|image|>\n\n' -}}   ← has \n\n
{%- elif item['type'] == 'audio' -%}
    {{- '<|audio|>' -}}            ← missing \n\n
This causes the model to fail completely when image is placed before audio in the message content list.

Reproduction
python
from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch

processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-E2B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# ❌ image before audio → model fails
messages_image_first = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": IMAGE_PATH},
            {"type": "audio", "audio": AUDIO_PATH},
            {"type": "text", "text": "Describe the image and audio."},
        ]
    }
]

# ✅ audio before image → works correctly
messages_audio_first = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": AUDIO_PATH},
            {"type": "image", "url": IMAGE_PATH},
            {"type": "text", "text": "Describe the image and audio."},
        ]
    }
]
Root Cause
The jinja template inserts \n\n around image tokens but not around audio tokens.

Image-first token sequence (broken):

<|image>...<image|>\n\n<|audio>...<audio|>Describe the image and audio.
                                          ↑ audio token directly concatenated with text, no separator
Audio-first token sequence (correct):

<|audio>...<audio|>\n\n<|image>...<image|>\n\n Describe the image and audio.
                    ↑                      ↑ correct \n\n separators
Note also that before the fix, the two orderings produce different input_ids shapes:

audio-first shape: torch.Size([1, 738])
image-first shape: torch.Size([1, 737])  ← 1 token missing due to missing \n\n
Evidence
Top 10 next tokens with image-first (before fix):

'<turn|>': 0.7383   ← model immediately ends the turn
'<eos>':   0.1406
Top 10 next tokens with image-first (after fix):

'这张': 0.6797      ← model correctly starts generating
'好的': 0.2656
Full generation with audio-first (before fix):

这张图片展示了一个教室的场景。画面中有一位戴眼镜的女性老师站在讲台后面...
音频中有人在呼喊"Look look look at the girl"...
Full generation with image-first (before fix):

(empty, model outputs <turn|> immediately)
Full generation with image-first (after fix):

这张图片展示了一个教室的场景，有几位学生和一位老师...
音频内容似乎是孩子们在进行某种对话或游戏...
Fix
In chat_template.jinja, change:

jinja
{%- elif item['type'] == 'audio' -%}
    {{- '<|audio|>' -}}
to:

jinja
{%- elif item['type'] == 'audio' -%}
    {{- '\n\n<|audio|>\n\n' -}}
After the fix, both orderings produce identical input_ids shapes and correct outputs.

Environment
transformers version: 5.5.0
Model: google/gemma-4-E2B-it

extent analysis

TL;DR

To fix the issue, add \n\n separators around the audio token in the chat_template.jinja file.

Guidance

Verify that the issue is caused by the missing \n\n separators around the audio token by checking the input_ids shapes and the model's output.
Update the chat_template.jinja file to include \n\n separators around the audio token, as shown in the fix section of the issue.
Test the model with both image-first and audio-first orderings to ensure that the issue is resolved and the model produces correct outputs.
Check the input_ids shapes to confirm that they are identical for both orderings after the fix.

Example

The corrected audio token template should look like this:

{%- elif item['type'] == 'audio' -%}
    {{- '\n\n<|audio|>\n\n' -}}

Notes

This fix assumes that the issue is caused by the missing \n\n separators around the audio token. If the issue persists after applying this fix, further investigation may be needed.

Recommendation

Apply the workaround by updating the chat_template.jinja file to include \n\n separators around the audio token, as this fix has been shown to resolve the issue and produce correct outputs.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#retriever error #indexing error #inference speed #output truncation #response parsing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix [Gemma4] Bug: audio token missing newline separators in chat_template.jinja causes multimodal failure when image precedes audio [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

✅ audio before image → works correctly

Fix Action

Fixed

PR fix notes

PR #45257: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

Description (problem / solution / changelog)

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Chat template patcher

Stop tokens (`eos_token_id`)

Testing

Before submitting

Who can review?

Changed files

Bug Description

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix [Gemma4] Bug: audio token missing newline separators in chat_template.jinja causes multimodal failure when image precedes audio [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

✅ audio before image → works correctly

Fix Action

Fixed

PR fix notes

PR #45257: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

Description (problem / solution / changelog)

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Chat template patcher

Stop tokens (eos_token_id)

Testing

Before submitting

Who can review?

Changed files

Bug Description

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Stop tokens (`eos_token_id`)