transformers - ✅(Solved) Fix [Gemma4] Bug: audio token missing newline separators in chat_template.jinja causes multimodal failure when image precedes audio [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

✅ audio before image → works correctly

messages_audio_first = [ { "role": "user", "content": [ {"type": "audio", "audio": AUDIO_PATH}, {"type": "image", "url": IMAGE_PATH}, {"type": "text", "text": "Describe the image and audio."}, ] } ] Root Cause The jinja template inserts \n\n around image tokens but not around audio tokens.

Fix Action

Fixed

PR fix notes

PR #45257: [Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

Description (problem / solution / changelog)

[Gemma4] Fix chat template and stop tokens for OpenAI tool calling compatibility

What does this PR do?

Rewrites the _patch_template_for_openai_tool_role() function in convert_gemma4_weights.py to fully support OpenAI Chat Completions tool-calling semantics for Gemma4 (E4B and 31B).

Chat template patcher

  • Forward-scan tool rendering: role: "tool" messages are skipped in the outer loop and rendered proactively as <|tool_response> blocks from the preceding assistant turn that issued the tool_calls
  • Turn suppression: Suppresses duplicate <|turn>model when consecutive assistant messages are separated only by tool messages (multi-round tool-call loops)
  • tool_call_id resolution: Matches tool results back to the originating tool_calls array by ID to resolve function names correctly (prevents response:unknown)
  • Content-parts robustness: Handles tool response content as both plain strings and OpenAI content-parts arrays ([{type: "text", text: "..."}])
  • format_tool_response_block macro: Injects a reusable macro to centralize tool response rendering (used by both legacy Gemma native tool_responses and OpenAI-style role: "tool" paths)
  • reasoning/reasoning_content support: Renders thinking fields as <|channel>thought blocks (compatible with vLLM, DeepSeek, and o1-style inference servers)
  • Legacy compat: Preserves native tool_responses on assistant messages (Google/Gemma format)

Stop tokens (eos_token_id)

  • Removed <tool_call|> (etc_token) from the stop token list
  • Keeps only <eos> + <turn|> (eot_token)
  • Enables parallel tool calls without premature truncation after the first <tool_call|>; <turn|> still terminates the model turn correctly

Testing

Validated with 17 functional test scenarios across both E4B and 31B templates:

  • Simple chat, tool declarations, single/multi/parallel tool calls
  • Multi-round tool loops (exactly 1 <|turn>model emitted)
  • Legacy tool_responses, tool_call_id resolution, content-parts arrays
  • reasoning/reasoning_content field rendering
  • add_generation_prompt correctness, Jinja2 syntax validation

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Models:

  • multimodal models: @zucchini-nlp

Library:

  • generate: @zucchini-nlp (visual-language models)

Changed files

  • src/transformers/models/gemma4/convert_gemma4_weights.py (modified, +269/-1)
RAW_BUFFERClick to expand / collapse

Bug Description

In chat_template.jinja for Gemma4, the image token has \n\n separators but the audio token does not:

{%- elif item['type'] == 'image' -%}
    {{- '\n\n<|image|>\n\n' -}}   ← has \n\n
{%- elif item['type'] == 'audio' -%}
    {{- '<|audio|>' -}}            ← missing \n\n
This causes the model to fail completely when image is placed before audio in the message content list.

Reproduction
python
from transformers import AutoProcessor, AutoModelForMultimodalLM
import torch

processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-E2B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# ❌ image before audio → model fails
messages_image_first = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": IMAGE_PATH},
            {"type": "audio", "audio": AUDIO_PATH},
            {"type": "text", "text": "Describe the image and audio."},
        ]
    }
]

# ✅ audio before image → works correctly
messages_audio_first = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": AUDIO_PATH},
            {"type": "image", "url": IMAGE_PATH},
            {"type": "text", "text": "Describe the image and audio."},
        ]
    }
]
Root Cause
The jinja template inserts \n\n around image tokens but not around audio tokens.

Image-first token sequence (broken):

<|image>...<image|>\n\n<|audio>...<audio|>Describe the image and audio.
                                          ↑ audio token directly concatenated with text, no separator
Audio-first token sequence (correct):

<|audio>...<audio|>\n\n<|image>...<image|>\n\n Describe the image and audio.
                    ↑                      ↑ correct \n\n separators
Note also that before the fix, the two orderings produce different input_ids shapes:

audio-first shape: torch.Size([1, 738])
image-first shape: torch.Size([1, 737])  ← 1 token missing due to missing \n\n
Evidence
Top 10 next tokens with image-first (before fix):

'<turn|>': 0.7383   ← model immediately ends the turn
'<eos>':   0.1406
Top 10 next tokens with image-first (after fix):

'这张': 0.6797      ← model correctly starts generating
'好的': 0.2656
Full generation with audio-first (before fix):

这张图片展示了一个教室的场景。画面中有一位戴眼镜的女性老师站在讲台后面...
音频中有人在呼喊"Look look look at the girl"...
Full generation with image-first (before fix):

(empty, model outputs <turn|> immediately)
Full generation with image-first (after fix):

这张图片展示了一个教室的场景,有几位学生和一位老师...
音频内容似乎是孩子们在进行某种对话或游戏...
Fix
In chat_template.jinja, change:

jinja
{%- elif item['type'] == 'audio' -%}
    {{- '<|audio|>' -}}
to:

jinja
{%- elif item['type'] == 'audio' -%}
    {{- '\n\n<|audio|>\n\n' -}}
After the fix, both orderings produce identical input_ids shapes and correct outputs.

Environment
transformers version: 5.5.0
Model: google/gemma-4-E2B-it

extent analysis

TL;DR

To fix the issue, add \n\n separators around the audio token in the chat_template.jinja file.

Guidance

  • Verify that the issue is caused by the missing \n\n separators around the audio token by checking the input_ids shapes and the model's output.
  • Update the chat_template.jinja file to include \n\n separators around the audio token, as shown in the fix section of the issue.
  • Test the model with both image-first and audio-first orderings to ensure that the issue is resolved and the model produces correct outputs.
  • Check the input_ids shapes to confirm that they are identical for both orderings after the fix.

Example

The corrected audio token template should look like this:

{%- elif item['type'] == 'audio' -%}
    {{- '\n\n<|audio|>\n\n' -}}

Notes

This fix assumes that the issue is caused by the missing \n\n separators around the audio token. If the issue persists after applying this fix, further investigation may be needed.

Recommendation

Apply the workaround by updating the chat_template.jinja file to include \n\n separators around the audio token, as this fix has been shown to resolve the issue and produce correct outputs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING