transformers - ✅(Solved) Fix LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45488Fetched 2026-04-18 05:51:50
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
mentioned ×3subscribed ×3commented ×1cross-referenced ×1

Root Cause

DeepSeek's tokenizer_config.json sets "tokenizer_class": "LlamaTokenizerFast", but tokenizer.json is ByteLevel BPE (GPT-2 style, not SentencePiece). In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer, overwriting whatever tokenizer.json declared. Since the vocab has no -prefixed tokens, the metaspace marker drops and spaces vanish.

Loading the same file via tokenizers.Tokenizer.from_file(...) or PreTrainedTokenizerFast.from_pretrained(...) works — the bug is specifically in LlamaTokenizer's __init__.

Fix Action

Workaround

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("deepseek-ai/DeepSeek-V3")

PR fix notes

PR #4: [https://nvbugs/6074784][fix] reload ByteLevel tokenizer via PreTrainedTokenizerFast for LlamaTokenizer

Description (problem / solution / changelog)

Description

Fixes the DeepSeek-V3-Lite accuracy failures (25%→64% on GSM8K) in the Transformers 5.3 upgrade effort (DeepSeekV3Lite_accuracy, 35 failures).

Root cause (tokenizer, not model config): DeepSeek-V3-Lite's tokenizer_config.json declares tokenizer_class: LlamaTokenizer, but its tokenizer.json is actually a ByteLevel BPE (GPT-style, not SentencePiece). In Transformers 5.x, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer, silently replacing the ByteLevel one from tokenizer.json. Spaces then get stripped from every prompt ("hello world""helloworld"), the model sees garbage, and GSM8K collapses from ~63.7% to ~26%.

Direct verification:

  • AutoTokenizer.from_pretrained(..., trust_remote_code=True).decode(encode("hello world"))"helloworld" (broken)
  • PreTrainedTokenizerFast.from_pretrained(...).decode(encode("hello world"))"hello world" (correct)

This is distinct from the earlier DeepSeek V3 rope_scaling fixes (9e38e95a2 / a647dbe35 / f207585ff). The accuracy number being similar (25% vs 64%) was a red herring — the RoPE path is already correct; the tokenizer was the actual culprit.

The entire DeepSeek V3 / R1 family ships the same pattern and is affected. Filed upstream as huggingface/transformers#45488.

Fix: After AutoTokenizer.from_pretrained, detect the mismatch — loaded tokenizer has Metaspace pre-tokenizer but tokenizer.json declares ByteLevel — and reload via PreTrainedTokenizerFast, which respects tokenizer.json verbatim. Generic and targeted: triggers only on this bug scenario; Llama 3.x and genuine SentencePiece tokenizers are unaffected.

Test Coverage

Verified locally on GB300 (SM 10.3):

TestAccuracy (before)Accuracy (after)ThresholdStatus
TestDeepSeekV3Lite::test_nvfp4[moe_backend=CUTLASS-mtp_nextn=0-fp8kv=False-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False]26.1%>60%60.51%✅ PASS
TestDeepSeekV3Lite::test_bfloat16[mtp_nextn=0-attention_dp=False-cuda_graph=False-overlap_scheduler=False-torch_compile=False-enable_chunked_prefill=False-v2_kv_cache=True](broken)65.087%61.54%✅ PASS

The fix should clear all 35 DeepSeekV3Lite_accuracy failures listed in the nvbug since they all load the same DeepSeek tokenizer through the same code path. Sanity-checked Llama-3.1-8B-Instruct still loads unchanged (TokenizersBackend, Sequence pre-tokenizer, round-trip OK) — fix has no effect on correctly-configured tokenizers.

PR Checklist

  • PR description clearly explains what and why.
  • PR Follows TRT-LLM CODING GUIDELINES to the best of my knowledge.
  • Test cases: relied on existing accuracy tests; no new test added since the fix is a narrow runtime workaround for an upstream regression (tracked at huggingface/transformers#45488).
  • No new dependencies.

Changed files

  • tensorrt_llm/tokenizer/tokenizer.py (modified, +63/-0)

Code Example

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3")
print(repr(tok.decode(tok.encode("hello world", add_special_tokens=False))))
# 'helloworld'  — expected 'hello world'

---

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("deepseek-ai/DeepSeek-V3")
RAW_BUFFERClick to expand / collapse

System info

  • transformers: 5.3.0
  • tokenizers: 0.22.2
  • Python: 3.12 / Linux

Who can help?

@ArthurZucker

Reproduction

Tokenizer-only, ~7 MB download:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3")
print(repr(tok.decode(tok.encode("hello world", add_special_tokens=False))))
# 'helloworld'  — expected 'hello world'

Expected vs actual

v4.x / raw tokenizers.Tokenizer.from_filev5.x AutoTokenizer
round-trip of "hello world""hello world""helloworld"
backend_tokenizer.pre_tokenizerSequence([..., ByteLevel]) (from tokenizer.json)Metaspace(replacement="▁", split=False)

Root cause

DeepSeek's tokenizer_config.json sets "tokenizer_class": "LlamaTokenizerFast", but tokenizer.json is ByteLevel BPE (GPT-2 style, not SentencePiece). In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer, overwriting whatever tokenizer.json declared. Since the vocab has no -prefixed tokens, the metaspace marker drops and spaces vanish.

Loading the same file via tokenizers.Tokenizer.from_file(...) or PreTrainedTokenizerFast.from_pretrained(...) works — the bug is specifically in LlamaTokenizer's __init__.

Affected models

All models shipping tokenizer_class: Llama*Tokenizer + a ByteLevel tokenizer.json. Confirmed broken on all 5 DeepSeek V3/R1 upstream repos: DeepSeek-V3, DeepSeek-V3.1, DeepSeek-V3.2-Exp, DeepSeek-R1, DeepSeek-R1-0528.

Note: LlamaTokenizer is LlamaTokenizerFast in v5, so either class name in tokenizer_config.json triggers the bug.

Downstream impact

Catastrophic silent accuracy loss on any HF-based inference stack. For example, GSM8K on DeepSeek-V3-Lite drops ~63.7% → ~26.1% purely from this tokenizer regression.

Suggested fix

In LlamaTokenizer.__init__, only install the default Metaspace pre-tokenizer when tokenizer.json is absent (the "train from scratch" path). When loading an existing tokenizer.json, trust its pre_tokenizer.

Workaround

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("deepseek-ai/DeepSeek-V3")

extent analysis

TL;DR

The most likely fix is to modify the LlamaTokenizer.__init__ method to only install the default Metaspace pre-tokenizer when tokenizer.json is absent.

Guidance

  • Verify the issue by checking if the model's tokenizer_config.json sets "tokenizer_class": "LlamaTokenizerFast" and if the tokenizer.json is ByteLevel BPE.
  • To mitigate the issue, use PreTrainedTokenizerFast.from_pretrained instead of AutoTokenizer.from_pretrained as a temporary workaround.
  • Check if the model is one of the affected models listed, which include DeepSeek-V3, DeepSeek-V3.1, DeepSeek-V3.2-Exp, DeepSeek-R1, and DeepSeek-R1-0528.
  • Be aware that this bug can cause catastrophic silent accuracy loss on any HF-based inference stack.

Example

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("deepseek-ai/DeepSeek-V3")
print(repr(tok.decode(tok.encode("hello world", add_special_tokens=False))))
# Expected output: 'hello world'

Notes

This issue is specific to the LlamaTokenizer class in version 5 of the transformers library and only affects models with a ByteLevel tokenizer.json. The suggested fix requires modifying the LlamaTokenizer.__init__ method.

Recommendation

Apply the workaround by using PreTrainedTokenizerFast.from_pretrained instead of AutoTokenizer.from_pretrained until the LlamaTokenizer.__init__ method is modified to correctly handle existing tokenizer.json files.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family [1 pull requests, 1 comments, 2 participants]