transformers - 💡(How to fix) Fix Roundtrip Failure for Gemma Pipeline on "▁"

transformers2026-05-25 13:08:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

import transformers

pipeline = transformers.pipeline("text-generation", "google/gemma-4-E2B-it")
print(pipeline([{"role": "user", "content": "What is the difference between \" \" and \"▁\"?"}]))

---

key="<key>"
model="gemma-4-26b-a4b-it"
api="streamGenerateContent"
url="https://generativelanguage.googleapis.com/v1beta/models/$model:$api?key=$key"

cat << EOF > request.json
{
	"contents": [{ "role": "user", "parts": [{ "text": "What is the difference between \" \" and \"▁\"?" }] }],
	"generationConfig": { "thinkingConfig": { "thinkingLevel": "MINIMAL" } }
}
EOF

curl --header "Content-Type: application/json" --data @request.json "$url"

RAW_BUFFERClick to expand / collapse

System Info

Arch Linux Python 3.14.4 Transformers 5.9.0

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import transformers

pipeline = transformers.pipeline("text-generation", "google/gemma-4-E2B-it")
print(pipeline([{"role": "user", "content": "What is the difference between \" \" and \"▁\"?"}]))

This causes the model to be confused:

It seems to perceive the user input as What is the difference between " " and " "? and often comments on the fact that the user entered two identical strings
When it repeats parts of the question, it outputs " " and " "

This is due to the GemmaTokenizer normalizing all spaces to "▁" and then the decoder turning the "▁" back into spaces. Thus the model has no way of distinguishing them.

Expected behavior

The difference between " " and "▁" should be preserved. This works on the Google APIs:

key="<key>"
model="gemma-4-26b-a4b-it"
api="streamGenerateContent"
url="https://generativelanguage.googleapis.com/v1beta/models/$model:$api?key=$key"

cat << EOF > request.json
{
	"contents": [{ "role": "user", "parts": [{ "text": "What is the difference between \" \" and \"▁\"?" }] }],
	"generationConfig": { "thinkingConfig": { "thinkingLevel": "MINIMAL" } }
}
EOF

curl --header "Content-Type: application/json" --data @request.json "$url"

Here, the model is not confused and can apparently distinguish between " " and "▁". It is also able to output " " and "▁" as different characters.

It seems like the model can be set up in such a way that it correctly handles these characters, as evidenced by the Google implementation. Thus, I feel like the transformers pipeline should also handle this correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The difference between " " and "▁" should be preserved. This works on the Google APIs:

key="<key>"
model="gemma-4-26b-a4b-it"
api="streamGenerateContent"
url="https://generativelanguage.googleapis.com/v1beta/models/$model:$api?key=$key"

cat << EOF > request.json
{
	"contents": [{ "role": "user", "parts": [{ "text": "What is the difference between \" \" and \"▁\"?" }] }],
	"generationConfig": { "thinkingConfig": { "thinkingLevel": "MINIMAL" } }
}
EOF

curl --header "Content-Type: application/json" --data @request.json "$url"

Here, the model is not confused and can apparently distinguish between " " and "▁". It is also able to output " " and "▁" as different characters.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Roundtrip Failure for Gemma Pipeline on "▁"

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

TRENDING