transformers - 💡(How to fix) Fix transformers version changes the tokenization [1 comments, 2 participants]

Q: Expected behavior

This tokenization: ``` ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]'] ``` Should appear with the 5.X transformers version.

transformers2026-04-29 12:19:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45701•Fetched 2026-04-30 06:18:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

PiRom1

Participants

PiRom1

zeel2104

Timeline (top)

subscribed ×3mentioned ×2commented ×1labeled ×1

Code Example

Python: 3.12.3
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1

---

import transformers, torch, sentencepiece
print("transformers:", transformers.__version__)
print("torch       :", torch.__version__)
print("sentencepiece:", sentencepiece.__version__)

MODEL_DIR = "almanach/camembertv2-base"
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

ids = tokenizer("This is a text example, You could have written anything else if necessary !", return_tensors="pt")["input_ids"]
print("Token IDs:", ids.tolist())
print("Tokens   :", tokenizer.convert_ids_to_tokens(ids[0]))

---

transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens   : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']

---

transformers: 4.57.6
torch       : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens   : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

---

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

RAW_BUFFERClick to expand / collapse

System Info

Python: 3.12.3
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1

Platform: Linux-6.17.0-22-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Huggingface_hub version: 0.36.2
Safetensors version: 0.7.0
Accelerate version: 1.12.0
Accelerate config: not found
DeepSpeed version: not installed
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: <fill in>
Using GPU in script?: <fill in>
GPU type: NVIDIA GeForce RTX 5060 Laptop GPU

Who can help?

@ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Tokenization differs according to the transformers version.

In 4.X: Tokenization is normal
In 5.X: Tokenization is at the character level

Please find below an example toy code I used in 2 different venv. The only difference was the transformers version used (5.7.0 vs 4.57.6):

import transformers, torch, sentencepiece
print("transformers:", transformers.__version__)
print("torch       :", torch.__version__)
print("sentencepiece:", sentencepiece.__version__)

MODEL_DIR = "almanach/camembertv2-base"
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

ids = tokenizer("This is a text example, You could have written anything else if necessary !", return_tensors="pt")["input_ids"]
print("Token IDs:", ids.tolist())
print("Tokens   :", tokenizer.convert_ids_to_tokens(ids[0]))

Outputs with transformers 5.7.0

transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens   : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']

Outputs with transformers 4.57.6

transformers: 4.57.6
torch       : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens   : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Expected behavior

This tokenization:

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Should appear with the 5.X transformers version.

extent analysis

TL;DR

The issue can be resolved by using the use_fast parameter of the AutoTokenizer and setting it to False to ensure consistent tokenization across different transformers versions.

Guidance

The difference in tokenization is due to the change in the transformers version from 4.X to 5.X.
The use_fast parameter of the AutoTokenizer can be used to control the tokenization behavior.
Setting use_fast to False can help achieve consistent tokenization.
Verify the tokenization output after making this change to ensure it matches the expected behavior.

Example

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False)

Notes

The use_fast parameter is only available in transformers version 5.X, so this solution applies to the 5.7.0 version used in the issue.
The sentencepiece version is the same in both cases, so it's unlikely to be the cause of the issue.

Recommendation

Apply workaround: use the use_fast=False parameter when creating the AutoTokenizer instance to ensure consistent tokenization. This is because the default value of use_fast changed between transformers 4.X and 5.X, and setting it to False can help achieve the expected tokenization behavior.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

This tokenization:

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Should appear with the 5.X transformers version.

#memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix transformers version changes the tokenization [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Outputs with transformers 5.7.0

Outputs with transformers 4.57.6

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix transformers version changes the tokenization [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Outputs with transformers 5.7.0

Outputs with transformers 4.57.6

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING