transformers - 💡(How to fix) Fix AutoTokenizer produces wrong token IDs for OLMo2, HyperClovaX, DeepSeek-R1-Distill-Llama, Yi, and others (v5 regression)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Follow-up to #45812 (Granite). Same root cause, additional affected model families identified.

Fix Action

Fix / Workaround

IssueModelStatus
#45812Granite (GPT2Tokenizer)Open, PR #45813 pending
#45488DeepSeek V3/R1 (LlamaTokenizer hardcodes Metaspace)Open
#44779DeepSeek V3 (added to override set)Closed, fixed in v5.3.0
#45741DeepSeek R1-Distill-Qwen (Qwen2 mapping)Merged in v5.8.0
#44462deepseek-coderClosed
#43122MiniMaxClosed
#45701camembert v2Closed

Code Example

from transformers import AutoTokenizer, PreTrainedTokenizerFast

# --- OLMo2 (GPT2Tokenizer) ---
model_id = "allenai/OLMo-2-0425-1B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

print(tok_v5.encode("650841823", add_special_tokens=False))      # [13655, 5833, 972, 1419] <- wrong
print(tok_correct.encode("650841823", add_special_tokens=False))  # [13655, 25496, 23848]    <- correct

# --- DeepSeek-R1-Distill-Llama (LlamaTokenizer) ---
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

# --- HyperClovaX (GPT2Tokenizer) ---
model_id = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.8.0 (also reproduced on 5.0.0 through 5.7.0)
  • Platform: Linux-5.14.0-503.11.1.el9_5.x86_64-x86_64-with-glibc2.34
  • Python version: 3.12.13
  • Huggingface_hub version: 1.14.0
  • Safetensors version: 0.7.0
  • Tokenizers version: 0.22.2

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, PreTrainedTokenizerFast

# --- OLMo2 (GPT2Tokenizer) ---
model_id = "allenai/OLMo-2-0425-1B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

print(tok_v5.encode("650841823", add_special_tokens=False))      # [13655, 5833, 972, 1419] <- wrong
print(tok_correct.encode("650841823", add_special_tokens=False))  # [13655, 25496, 23848]    <- correct

# --- DeepSeek-R1-Distill-Llama (LlamaTokenizer) ---
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

# --- HyperClovaX (GPT2Tokenizer) ---
model_id = "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"

tok_v5 = AutoTokenizer.from_pretrained(model_id)                # WRONG
tok_correct = PreTrainedTokenizerFast.from_pretrained(model_id)  # CORRECT

print(tok_v5.encode("2023", add_special_tokens=False))       # [508, 1419] <- wrong
print(tok_correct.encode("2023", add_special_tokens=False))  # [2366, 18]  <- correct

Expected behavior

Follow-up to #45812 (Granite). Same root cause, additional affected model families identified.

I downloaded the tokenizer for all models in the top 1000 trending + downloads last week to check the scale of this issue. I found 62 affected models out of ~1700 unique models (that could be loaded) in the HuggingFace top 1000 downloads + trending. After existing fixes (DeepSeek V3 via #44779, DeepSeek R1-Distill-Qwen via #45741), ~30 models remain broken with a combined 3M+ downloads.

Again, the issue is the pretokenizer mismatch caused by the branching in AutoTokenizer. My analysis showed only the classes GPT2Tokenizer and LlamaTokenizer are affected, but potentially all tokenizer classes with a custom __init__ may be affected.

Full list of affected models (as of last week): affected_models.jsonl.zip

Still broken (no fix in v5.8.0)

model_typev5 classExample modelDownloads
olmo2GPT2Tokenizerallenai/OLMo-2-0425-1B83K
hyperclovaxGPT2Tokenizernaver-hyperclovax/HyperCLOVAX-SEED-Think-14B39K
graniteGPT2Tokenizeribm-granite/granite-4.1-8b14K
granitemoehybridGPT2Tokenizeribm-granite/granite-4.0-micro400K
llama (DeepSeek distills)LlamaTokenizerdeepseek-ai/DeepSeek-R1-Distill-Llama-8B1.97M
llama (Yi)LlamaTokenizer01-ai/Yi-34B-Chat27K
ernieLlamaTokenizerbaidu/ERNIE-4.5-21B-A3B-Thinking35K

Already fixed

model_typeFixPR/Issue
deepseek_v3Added to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS#44779 (v5.3.0)
deepseek_v2Added to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS#44779 (v5.3.0)
qwen2 (DeepSeek distills)Routed through Qwen2Tokenizer#45741 (v5.8.0)

Problematic cases

DeepSeek-R1-Distill-Llama-8B and Yi-34B-Chat both have model_type="llama" and tokenizer_class: "LlamaTokenizerFast" in tokenizer_config.json. Both TOKENIZER_MAPPING_NAMES["llama"] ("LlamaTokenizer") and the hub class ("LlamaTokenizerFast") resolve to "LlamaTokenizer" after .removesuffix("Fast"), so no mismatch is detected and the override set is never consulted.

Neither mechanism can fix these models without breaking actual Llama models:

  • MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS: Adding "llama" would force ALL Llama models to TokenizersBackend, breaking Meta's Llama 3/4 models that genuinely use LlamaTokenizer's Metaspace pre-tokenizer.
  • TOKENIZER_MAPPING_NAMES: Changing ("llama", "LlamaTokenizer") to ("llama", "TokenizersBackend") has the same problem, it's keyed by model_type, which these distill models share with unaffected Llama models.

Related Issues

All of these are the same bug class — tokenizer class __init__ discards tokenizer.json's pre-tokenizer in v5:

IssueModelStatus
#45812Granite (GPT2Tokenizer)Open, PR #45813 pending
#45488DeepSeek V3/R1 (LlamaTokenizer hardcodes Metaspace)Open
#44779DeepSeek V3 (added to override set)Closed, fixed in v5.3.0
#45741DeepSeek R1-Distill-Qwen (Qwen2 mapping)Merged in v5.8.0
#44462deepseek-coderClosed
#43122MiniMaxClosed
#45701camembert v2Closed

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Follow-up to #45812 (Granite). Same root cause, additional affected model families identified.

I downloaded the tokenizer for all models in the top 1000 trending + downloads last week to check the scale of this issue. I found 62 affected models out of ~1700 unique models (that could be loaded) in the HuggingFace top 1000 downloads + trending. After existing fixes (DeepSeek V3 via #44779, DeepSeek R1-Distill-Qwen via #45741), ~30 models remain broken with a combined 3M+ downloads.

Again, the issue is the pretokenizer mismatch caused by the branching in AutoTokenizer. My analysis showed only the classes GPT2Tokenizer and LlamaTokenizer are affected, but potentially all tokenizer classes with a custom __init__ may be affected.

Full list of affected models (as of last week): affected_models.jsonl.zip

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING