transformers - 💡(How to fix) Fix transformers version changes the tokenization [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45701Fetched 2026-04-30 06:18:22
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
subscribed ×3mentioned ×2commented ×1labeled ×1

Code Example

Python: 3.12.3
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1

---

import transformers, torch, sentencepiece
print("transformers:", transformers.__version__)
print("torch       :", torch.__version__)
print("sentencepiece:", sentencepiece.__version__)

MODEL_DIR = "almanach/camembertv2-base"
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

ids = tokenizer("This is a text example, You could have written anything else if necessary !", return_tensors="pt")["input_ids"]
print("Token IDs:", ids.tolist())
print("Tokens   :", tokenizer.convert_ids_to_tokens(ids[0]))

---

transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens   : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']

---

transformers: 4.57.6
torch       : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens   : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

---

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']
RAW_BUFFERClick to expand / collapse

System Info

Python: 3.12.3
transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1

  • Platform: Linux-6.17.0-22-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 0.36.2
  • Safetensors version: 0.7.0
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: <fill in>
  • Using GPU in script?: <fill in>
  • GPU type: NVIDIA GeForce RTX 5060 Laptop GPU

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Tokenization differs according to the transformers version.

  • In 4.X: Tokenization is normal
  • In 5.X: Tokenization is at the character level

Please find below an example toy code I used in 2 different venv. The only difference was the transformers version used (5.7.0 vs 4.57.6):

import transformers, torch, sentencepiece
print("transformers:", transformers.__version__)
print("torch       :", torch.__version__)
print("sentencepiece:", sentencepiece.__version__)

MODEL_DIR = "almanach/camembertv2-base"
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)

ids = tokenizer("This is a text example, You could have written anything else if necessary !", return_tensors="pt")["input_ids"]
print("Token IDs:", ids.tolist())
print("Tokens   :", tokenizer.convert_ids_to_tokens(ids[0]))

Outputs with transformers 5.7.0

transformers: 5.7.0
torch       : 2.11.0+cu130
sentencepiece: 0.2.1
Token IDs: [[1, 233, 59, 79, 80, 90, 233, 80, 90, 233, 72, 233, 91, 76, 95, 91, 233, 76, 95, 72, 84, 87, 83, 76, 19, 233, 64, 86, 92, 233, 74, 86, 92, 83, 75, 233, 79, 72, 93, 76, 233, 94, 89, 80, 91, 91, 76, 85, 233, 72, 85, 96, 91, 79, 80, 85, 78, 233, 76, 83, 90, 76, 233, 80, 77, 233, 85, 76, 74, 76, 90, 90, 72, 89, 96, 233, 8, 2]]
Tokens   : ['[CLS]', 'Ġ', 'T', 'h', 'i', 's', 'Ġ', 'i', 's', 'Ġ', 'a', 'Ġ', 't', 'e', 'x', 't', 'Ġ', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ',', 'Ġ', 'Y', 'o', 'u', 'Ġ', 'c', 'o', 'u', 'l', 'd', 'Ġ', 'h', 'a', 'v', 'e', 'Ġ', 'w', 'r', 'i', 't', 't', 'e', 'n', 'Ġ', 'a', 'n', 'y', 't', 'h', 'i', 'n', 'g', 'Ġ', 'e', 'l', 's', 'e', 'Ġ', 'i', 'f', 'Ġ', 'n', 'e', 'c', 'e', 's', 's', 'a', 'r', 'y', 'Ġ', '!', '[SEP]']

Outputs with transformers 4.57.6

transformers: 4.57.6
torch       : 2.10.0+cu128
sentencepiece: 0.2.1
Token IDs: [[1, 13711, 5806, 72, 15532, 28108, 19, 9619, 26650, 13592, 94, 5152, 4954, 22118, 5259, 4985, 10723, 4766, 16562, 25007, 8346, 8, 2]]
Tokens   : ['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Expected behavior

This tokenization:

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Should appear with the 5.X transformers version.

extent analysis

TL;DR

The issue can be resolved by using the use_fast parameter of the AutoTokenizer and setting it to False to ensure consistent tokenization across different transformers versions.

Guidance

  • The difference in tokenization is due to the change in the transformers version from 4.X to 5.X.
  • The use_fast parameter of the AutoTokenizer can be used to control the tokenization behavior.
  • Setting use_fast to False can help achieve consistent tokenization.
  • Verify the tokenization output after making this change to ensure it matches the expected behavior.

Example

tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False)

Notes

  • The use_fast parameter is only available in transformers version 5.X, so this solution applies to the 5.7.0 version used in the issue.
  • The sentencepiece version is the same in both cases, so it's unlikely to be the cause of the issue.

Recommendation

Apply workaround: use the use_fast=False parameter when creating the AutoTokenizer instance to ensure consistent tokenization. This is because the default value of use_fast changed between transformers 4.X and 5.X, and setting it to False can help achieve the expected tokenization behavior.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

This tokenization:

['[CLS]', 'This', 'is', 'a', 'text', 'example', ',', 'You', 'could', 'have', 'w', '##rit', '##ten', 'any', '##th', '##ing', 'el', '##se', 'if', 'necess', '##ary', '!', '[SEP]']

Should appear with the 5.X transformers version.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING