transformers - 💡(How to fix) Fix AutoTokenizer produces wrong token IDs for OLMo2, HyperClovaX, DeepSeek-R1-Distill-Llama, Yi, and others (v5 regression)

Fix Action

Fix / Workaround

Issue	Model	Status
#45812	Granite (GPT2Tokenizer)	Open, PR #45813 pending
#45488	DeepSeek V3/R1 (LlamaTokenizer hardcodes Metaspace)	Open
#44779	DeepSeek V3 (added to override set)	Closed, fixed in v5.3.0
#45741	DeepSeek R1-Distill-Qwen (Qwen2 mapping)	Merged in v5.8.0
#44462	deepseek-coder	Closed
#43122	MiniMax	Closed
#45701	camembert v2	Closed

Code Example

from transformers import AutoTokenizer, PreTrainedTokenizerFast

# --- OLMo2 (GPT2Tokenizer) ---
model_id = "allenai/OLMo-2-0425-1B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

print(tok_v5.encode("650841823", add_special_tokens=False))      # [13655, 5833, 972, 1419] <- wrong
print(tok_correct.encode("650841823", add_special_tokens=False))  # [13655, 25496, 23848]    <- correct

# --- DeepSeek-R1-Distill-Llama (LlamaTokenizer) ---
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

# --- HyperClovaX (GPT2Tokenizer) ---
model_id = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

System Info

transformers version: 5.8.0 (also reproduced on 5.0.0 through 5.7.0)
Platform: Linux-5.14.0-503.11.1.el9_5.x86_64-x86_64-with-glibc2.34
Python version: 3.12.13
Huggingface_hub version: 1.14.0
Safetensors version: 0.7.0
Tokenizers version: 0.22.2

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, PreTrainedTokenizerFast

# --- OLMo2 (GPT2Tokenizer) ---
model_id = "allenai/OLMo-2-0425-1B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

print(tok_v5.encode("650841823", add_special_tokens=False))      # [13655, 5833, 972, 1419] <- wrong
print(tok_correct.encode("650841823", add_special_tokens=False))  # [13655, 25496, 23848]    <- correct

# --- DeepSeek-R1-Distill-Llama (LlamaTokenizer) ---
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

# --- HyperClovaX (GPT2Tokenizer) ---
model_id = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

Expected behavior

Follow-up to #45812 (Granite). Same root cause, additional affected model families identified.

I downloaded the tokenizer for all models in the top 1000 trending + downloads last week to check the scale of this issue. I found 62 affected models out of ~1700 unique models (that could be loaded) in the HuggingFace top 1000 downloads + trending. After existing fixes (DeepSeek V3 via #44779, DeepSeek R1-Distill-Qwen via #45741), ~30 models remain broken with a combined 3M+ downloads.

Again, the issue is the pretokenizer mismatch caused by the branching in AutoTokenizer. My analysis showed only the classes GPT2Tokenizer and LlamaTokenizer are affected, but potentially all tokenizer classes with a custom __init__ may be affected.

Full list of affected models (as of last week): affected_models.jsonl.zip

Still broken (no fix in v5.8.0)

model_type	v5 class	Example model	Downloads
`olmo2`	GPT2Tokenizer	`allenai/OLMo-2-0425-1B`	83K
`hyperclovax`	GPT2Tokenizer	`naver-hyperclovax/HyperCLOVAX-SEED-Think-14B`	39K
`granite`	GPT2Tokenizer	`ibm-granite/granite-4.1-8b`	14K
`granitemoehybrid`	GPT2Tokenizer	`ibm-granite/granite-4.0-micro`	400K
`llama` (DeepSeek distills)	LlamaTokenizer	`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`	1.97M
`llama` (Yi)	LlamaTokenizer	`01-ai/Yi-34B-Chat`	27K
`ernie`	LlamaTokenizer	`baidu/ERNIE-4.5-21B-A3B-Thinking`	35K

Already fixed

model_type	Fix	PR/Issue
`deepseek_v3`	Added to `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS`	#44779 (v5.3.0)
`deepseek_v2`	Added to `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS`	#44779 (v5.3.0)
`qwen2` (DeepSeek distills)	Routed through `Qwen2Tokenizer`	#45741 (v5.8.0)

Problematic cases

DeepSeek-R1-Distill-Llama-8B and Yi-34B-Chat both have model_type="llama" and tokenizer_class: "LlamaTokenizerFast" in tokenizer_config.json. Both TOKENIZER_MAPPING_NAMES["llama"] ("LlamaTokenizer") and the hub class ("LlamaTokenizerFast") resolve to "LlamaTokenizer" after .removesuffix("Fast"), so no mismatch is detected and the override set is never consulted.

Neither mechanism can fix these models without breaking actual Llama models:

MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS: Adding "llama" would force ALL Llama models to TokenizersBackend, breaking Meta's Llama 3/4 models that genuinely use LlamaTokenizer's Metaspace pre-tokenizer.
TOKENIZER_MAPPING_NAMES: Changing ("llama", "LlamaTokenizer") to ("llama", "TokenizersBackend") has the same problem, it's keyed by model_type, which these distill models share with unaffected Llama models.

Related Issues

All of these are the same bug class — tokenizer class __init__ discards tokenizer.json's pre-tokenizer in v5:

Issue	Model	Status
#45812	Granite (GPT2Tokenizer)	Open, PR #45813 pending
#45488	DeepSeek V3/R1 (LlamaTokenizer hardcodes Metaspace)	Open
#44779	DeepSeek V3 (added to override set)	Closed, fixed in v5.3.0
#45741	DeepSeek R1-Distill-Qwen (Qwen2 mapping)	Merged in v5.8.0
#44462	deepseek-coder	Closed
#43122	MiniMax	Closed
#45701	camembert v2	Closed

FAQ

Expected behavior

Follow-up to #45812 (Granite). Same root cause, additional affected model families identified.

Full list of affected models (as of last week): affected_models.jsonl.zip

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix AutoTokenizer produces wrong token IDs for OLMo2, HyperClovaX, DeepSeek-R1-Distill-Llama, Yi, and others (v5 regression)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Still broken (no fix in v5.8.0)

Already fixed

Problematic cases

Related Issues

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix AutoTokenizer produces wrong token IDs for OLMo2, HyperClovaX, DeepSeek-R1-Distill-Llama, Yi, and others (v5 regression)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Still broken (no fix in v5.8.0)

Already fixed

Problematic cases

Related Issues

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING