transformers - 💡(How to fix) Fix Roundtrip Failure for Gemma Pipeline on "▁"

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

import transformers

pipeline = transformers.pipeline("text-generation", "google/gemma-4-E2B-it")
print(pipeline([{"role": "user", "content": "What is the difference between \" \" and \"▁\"?"}]))

---

key="<key>"
model="gemma-4-26b-a4b-it"
api="streamGenerateContent"
url="https://generativelanguage.googleapis.com/v1beta/models/$model:$api?key=$key"

cat << EOF > request.json
{
	"contents": [{ "role": "user", "parts": [{ "text": "What is the difference between \" \" and \"▁\"?" }] }],
	"generationConfig": { "thinkingConfig": { "thinkingLevel": "MINIMAL" } }
}
EOF

curl --header "Content-Type: application/json" --data @request.json "$url"
RAW_BUFFERClick to expand / collapse

System Info

Arch Linux Python 3.14.4 Transformers 5.9.0

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import transformers

pipeline = transformers.pipeline("text-generation", "google/gemma-4-E2B-it")
print(pipeline([{"role": "user", "content": "What is the difference between \" \" and \"▁\"?"}]))

This causes the model to be confused:

  1. It seems to perceive the user input as What is the difference between " " and " "? and often comments on the fact that the user entered two identical strings
  2. When it repeats parts of the question, it outputs " " and " "

This is due to the GemmaTokenizer normalizing all spaces to "▁" and then the decoder turning the "▁" back into spaces. Thus the model has no way of distinguishing them.

Expected behavior

The difference between " " and "▁" should be preserved. This works on the Google APIs:

key="<key>"
model="gemma-4-26b-a4b-it"
api="streamGenerateContent"
url="https://generativelanguage.googleapis.com/v1beta/models/$model:$api?key=$key"

cat << EOF > request.json
{
	"contents": [{ "role": "user", "parts": [{ "text": "What is the difference between \" \" and \"\"?" }] }],
	"generationConfig": { "thinkingConfig": { "thinkingLevel": "MINIMAL" } }
}
EOF

curl --header "Content-Type: application/json" --data @request.json "$url"

Here, the model is not confused and can apparently distinguish between " " and "▁". It is also able to output " " and "▁" as different characters.

It seems like the model can be set up in such a way that it correctly handles these characters, as evidenced by the Google implementation. Thus, I feel like the transformers pipeline should also handle this correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The difference between " " and "▁" should be preserved. This works on the Google APIs:

key="<key>"
model="gemma-4-26b-a4b-it"
api="streamGenerateContent"
url="https://generativelanguage.googleapis.com/v1beta/models/$model:$api?key=$key"

cat << EOF > request.json
{
	"contents": [{ "role": "user", "parts": [{ "text": "What is the difference between \" \" and \"\"?" }] }],
	"generationConfig": { "thinkingConfig": { "thinkingLevel": "MINIMAL" } }
}
EOF

curl --header "Content-Type: application/json" --data @request.json "$url"

Here, the model is not confused and can apparently distinguish between " " and "▁". It is also able to output " " and "▁" as different characters.

It seems like the model can be set up in such a way that it correctly handles these characters, as evidenced by the Google implementation. Thus, I feel like the transformers pipeline should also handle this correctly.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Roundtrip Failure for Gemma Pipeline on "▁"