- No regression in special-token handling between `5.3.0` and `5.4.0+` - `t.decode([163607])` should still return `' '` - The warning should not recommend `fix_mistral_regex=True` for a tokenizer path that will immediately fail

transformers - ✅(Solved) Fix Regression in Kimi-K2.5 tokenizer from 5.3.0 to 5.4.0: incorrect codec handling and misleading fix_mistral_regex warning [1 pull requests, 1 comments, 2 participants]

Lander-Hatsune · 2026-04-10T09:42:26Z

[transformers] PR 45359: Fix Kimi-K2.5 tokenizer regression and patch mistral regex AttributeError - Repository: huggingface/transformers - Author: ArthurZucke… # PR #45359: Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError - Repository: huggingface/transformers - Author: ArthurZucker - State: open | merged: False - Link: https://github.com/huggingface/transformers/pull/45359 ## Description (problem / solution / changelog) Fixes #45356 ## Summary - Remove `kimi_k25` from `MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS`: its remote `TikTokenTokenizer` is the only correct backend — the model has no `tokenizer.json`, and its `added_tokens_decoder` has non-sequential IDs (gaps + `[UNK]`/`[PAD]` at 163838/163839) that `TokenizersBackend.add_tokens()` can't reproduce, causing every token after ID 163588 to be assigned wrong IDs. - Fix `_patch_mistral_regex`: use `tokenizer.pre_tokenizer` instead of `tokenizer.backend_tokenizer.pre_tokenizer` — the method receives the raw `tokenizers.Tokenizer`, which doesn't have `.backend_tokenizer`. ## Test plan - [x] `AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True).decode([163607])` returns `' '` - [x] Roundtrip encode/decode of ` hello ` works ## Changed files - `src/transformers/models/auto/tokenization_auto.py` (modified, +0/-1) - `src/transformers/tokenization_utils_tokenizers.py` (modified, +3/-3) ## Fixed - Fixed by PR: Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError (https://github.com/huggingface/transformers/pull/45359) ### System Info - OS: Linux - Python: 3.10.12 - Model/tokenizer: `moonshotai/Kimi-K2.5` - `trust_remote_code=True` ### Who can help? @ArthurZucker and @itazap ### Information - [ ] The official example scripts - [x] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction There seems to be a regression for the Kimi-K2.5 tokenizer between `transformers==5.3.0` and `5.4.0+`. In `5.3.0`, the tokenizer handles the ` ` token correctly. In `5.4.0` and later, this behavior breaks, and the warning shown by Transformers suggests using `fix_mistral_regex=True`, but following that suggestion triggers another error for this tokenizer. ### Reproduction #### Works in 5.3.0 ```python from transformers import AutoTokenizer t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True) print(t.decode([163607])) ``` Output: ```python ' ' ``` #### Broken in 5.4.0+ ```python from transformers import AutoTokenizer t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True) print(t.decode([163607])) ``` Output: ```python '' ``` At the same time, Transformers emits this warning: ```text The tokenizer you are loading from 'moonshotai/Kimi-K2.5' with an incorrect regex pattern... You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ``` #### But following the warning also fails ```python from transformers import AutoTokenizer t = AutoTokenizer.from_pretrained( "moonshotai/Kimi-K2.5", trust_remote_code=True, fix_mistral_regex=True, ) ``` This raises: ```text AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer' ``` ### Expected behavior ### Expected behavior - No regression in special-token handling between `5.3.0` and `5.4.0+` - `t.decode([163607])` should still return `' '` - The warning should not recommend `fix_mistral_regex=True` for a tokenizer path that will immediately fail ### Actual behavior - ` ` handling is broken in `5.4.0+` - The new warning points users to `fix_mistral_regex=True`, but that path crashes for this tokenizer

transformers2026-04-10 09:42:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45356•Fetched 2026-04-11 06:12:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Lander-Hatsune

Participants

ArthurZucker

Lander-Hatsune

Timeline (top)

mentioned ×2subscribed ×2commented ×1cross-referenced ×1

Error Message

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Fix Action

Fixed

Fixed by PR: Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError (https://github.com/huggingface/transformers/pull/45359)

PR fix notes

PR #45359: Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError

Repository: huggingface/transformers
Author: ArthurZucker
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45359

Description (problem / solution / changelog)

Fixes #45356

Summary

Remove kimi_k25 from MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS: its remote TikTokenTokenizer is the only correct backend — the model has no tokenizer.json, and its added_tokens_decoder has non-sequential IDs (gaps + [UNK]/[PAD] at 163838/163839) that TokenizersBackend.add_tokens() can't reproduce, causing every token after ID 163588 to be assigned wrong IDs.
Fix _patch_mistral_regex: use tokenizer.pre_tokenizer instead of tokenizer.backend_tokenizer.pre_tokenizer — the method receives the raw tokenizers.Tokenizer, which doesn't have .backend_tokenizer.

Test plan

AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True).decode([163607]) returns '</think>'
Roundtrip encode/decode of <think>hello</think> works

Changed files

src/transformers/models/auto/tokenization_auto.py (modified, +0/-1)
src/transformers/tokenization_utils_tokenizers.py (modified, +3/-3)

Code Example

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

---

'</think>'

---

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

---

''

---

The tokenizer you are loading from 'moonshotai/Kimi-K2.5' with an incorrect regex pattern...
You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

---

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    fix_mistral_regex=True,
)

---

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

RAW_BUFFERClick to expand / collapse

System Info

OS: Linux
Python: 3.10.12
Model/tokenizer: moonshotai/Kimi-K2.5
trust_remote_code=True

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

There seems to be a regression for the Kimi-K2.5 tokenizer between transformers==5.3.0 and 5.4.0+.

In 5.3.0, the tokenizer handles the </think> token correctly.

In 5.4.0 and later, this behavior breaks, and the warning shown by Transformers suggests using fix_mistral_regex=True, but following that suggestion triggers another error for this tokenizer.

Reproduction

Works in 5.3.0

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

Output:

'</think>'

Broken in 5.4.0+

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
print(t.decode([163607]))

Output:

''

At the same time, Transformers emits this warning:

The tokenizer you are loading from 'moonshotai/Kimi-K2.5' with an incorrect regex pattern...
You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

But following the warning also fails

from transformers import AutoTokenizer

t = AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    fix_mistral_regex=True,
)

This raises:

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Expected behavior

No regression in special-token handling between 5.3.0 and 5.4.0+
t.decode([163607]) should still return '</think>'
The warning should not recommend fix_mistral_regex=True for a tokenizer path that will immediately fail

Actual behavior

</think> handling is broken in 5.4.0+
The new warning points users to fix_mistral_regex=True, but that path crashes for this tokenizer

extent analysis

TL;DR

Downgrade to transformers==5.3.0 to avoid the regression in special-token handling for the Kimi-K2.5 tokenizer.

Guidance

Verify the issue by running the provided reproduction code in both 5.3.0 and 5.4.0+ to confirm the regression.
Check the Transformers library documentation for any known issues or updates related to the Kimi-K2.5 tokenizer and the fix_mistral_regex=True flag.
Consider opening an issue with the Transformers library maintainers to report the regression and the failure of the recommended fix_mistral_regex=True workaround.
If possible, test the tokenizer with other versions of the Transformers library to identify the exact version where the regression occurred.

Example

No code example is provided as the issue already includes clear reproduction code.

Notes

The provided information suggests a regression in the Kimi-K2.5 tokenizer between transformers==5.3.0 and 5.4.0+, but the recommended workaround fails. Downgrading to 5.3.0 may be the most straightforward solution until the issue is resolved.

Recommendation

Apply workaround: Downgrade to transformers==5.3.0 to maintain the correct handling of the </think> token until a fix is available for the regression in 5.4.0+.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

No regression in special-token handling between 5.3.0 and 5.4.0+
t.decode([163607]) should still return '</think>'
The warning should not recommend fix_mistral_regex=True for a tokenizer path that will immediately fail

#agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix Regression in Kimi-K2.5 tokenizer from 5.3.0 to 5.4.0: incorrect codec handling and misleading fix_mistral_regex warning [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #45359: Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Reproduction

Works in 5.3.0

Broken in 5.4.0+

But following the warning also fails

Expected behavior

Expected behavior

Actual behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING