transformers - ✅(Solved) Fix Regression in Kimi-K2.5 tokenizer from 5.3.0 to 5.4.0: incorrect codec handling and misleading fix_mistral_regex warning [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45356Fetched 2026-04-11 06:12:17
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
1
Timeline (top)
mentioned ×2subscribed ×2commented ×1cross-referenced ×1

Error Message

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Fix Action

Fixed

PR fix notes

PR #45359: Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError

Description (problem / solution / changelog)

Fixes #45356

Summary

  • Remove kimi_k25 from MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS: its remote TikTokenTokenizer is the only correct backend — the model has no tokenizer.json, and its added_tokens_decoder has non-sequential IDs (gaps + [UNK]/[PAD] at 163838/163839) that TokenizersBackend.add_tokens() can't reproduce, causing every token after ID 163588 to be assigned wrong IDs.
  • Fix _patch_mistral_regex: use tokenizer.pre_tokenizer instead of tokenizer.backend_tokenizer.pre_tokenizer — the method receives the raw tokenizers.Tokenizer, which doesn't have .backend_tokenizer.

Test plan

  • AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True).decode([163607]) returns '</think>'
  • Roundtrip encode/decode of <think>hello</think> works

Changed files

  • src/transformers/models/auto/tokenization_auto.py (modified, +0/-1)
  • src/transformers/tokenization_utils_tokenizers.py (modified, +3/-3)

Code Example

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

---

'</think>'

---

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

---

''

---

The tokenizer you are loading from 'moonshotai/Kimi-K2.5' with an incorrect regex pattern...
You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

---

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    fix_mistral_regex=True,
)

---

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'
RAW_BUFFERClick to expand / collapse

System Info

  • OS: Linux
  • Python: 3.10.12
  • Model/tokenizer: moonshotai/Kimi-K2.5
  • trust_remote_code=True

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

There seems to be a regression for the Kimi-K2.5 tokenizer between transformers==5.3.0 and 5.4.0+.

In 5.3.0, the tokenizer handles the </think> token correctly.

In 5.4.0 and later, this behavior breaks, and the warning shown by Transformers suggests using fix_mistral_regex=True, but following that suggestion triggers another error for this tokenizer.

Reproduction

Works in 5.3.0

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

Output:

'</think>'

Broken in 5.4.0+

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

Output:

''

At the same time, Transformers emits this warning:

The tokenizer you are loading from 'moonshotai/Kimi-K2.5' with an incorrect regex pattern...
You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

But following the warning also fails

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    fix_mistral_regex=True,
)

This raises:

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Expected behavior

Expected behavior

  • No regression in special-token handling between 5.3.0 and 5.4.0+
  • t.decode([163607]) should still return '</think>'
  • The warning should not recommend fix_mistral_regex=True for a tokenizer path that will immediately fail

Actual behavior

  • </think> handling is broken in 5.4.0+
  • The new warning points users to fix_mistral_regex=True, but that path crashes for this tokenizer

extent analysis

TL;DR

Downgrade to transformers==5.3.0 to avoid the regression in special-token handling for the Kimi-K2.5 tokenizer.

Guidance

  • Verify the issue by running the provided reproduction code in both 5.3.0 and 5.4.0+ to confirm the regression.
  • Check the Transformers library documentation for any known issues or updates related to the Kimi-K2.5 tokenizer and the fix_mistral_regex=True flag.
  • Consider opening an issue with the Transformers library maintainers to report the regression and the failure of the recommended fix_mistral_regex=True workaround.
  • If possible, test the tokenizer with other versions of the Transformers library to identify the exact version where the regression occurred.

Example

No code example is provided as the issue already includes clear reproduction code.

Notes

The provided information suggests a regression in the Kimi-K2.5 tokenizer between transformers==5.3.0 and 5.4.0+, but the recommended workaround fails. Downgrading to 5.3.0 may be the most straightforward solution until the issue is resolved.

Recommendation

Apply workaround: Downgrade to transformers==5.3.0 to maintain the correct handling of the </think> token until a fix is available for the regression in 5.4.0+.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • No regression in special-token handling between 5.3.0 and 5.4.0+
  • t.decode([163607]) should still return '</think>'
  • The warning should not recommend fix_mistral_regex=True for a tokenizer path that will immediately fail

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING