transformers - 💡(How to fix) Fix Our `tinker-cookbook` CI broke: `list_repo_files` should forward the `revision` argument

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

Root Cause

AutoTokenizer.from_pretrained should successfully load the tokenizer at the pinned revision, using the tiktoken.model + remote-code TikTokenTokenizer that exist at that sha, and not raise ValueError because it tried to fetch a tokenizer.json that only exists on main.

Fix Action

Fix / Workaround

With the patch above, it loads cleanly (TikTokenTokenizer, the slow remote-code class, picked up via the existing tiktoken.model fallback).

Code Example

from transformers import AutoTokenizer

# Old revision: no tokenizer.json in this commit, just tiktoken.model + tokenization_kimi.py
AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.6",
    revision="5a49d036ab7472b7d5912ded487150ec1358c11d",
    trust_remote_code=True,
)

---

ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a `tokenizers` library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

---

from transformers import AutoTokenizer

# Old revision: no tokenizer.json in this commit, just tiktoken.model + tokenization_kimi.py
AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.6",
    revision="5a49d036ab7472b7d5912ded487150ec1358c11d",
    trust_remote_code=True,
)
RAW_BUFFERClick to expand / collapse

System Info

In PreTrainedTokenizerBase._from_pretrained, the call to list_repo_files does not forward the revision kwarg, so the returned file list always reflects the repo's main branch even when the caller pinned a specific revision. This makes the surrounding "fall back to tiktoken.model/tokenizer.model/tekken.json if tokenizer.json is not on the repo" logic (added in #42299) misfire whenever main and the pinned revision disagree on which tokenizer files exist.

The fix is to forward revision to list_repo_files, matching the convention used three lines above for list_repo_templates in the same function.

This issue broke our CI for tinker-cookbook: https://github.com/thinking-machines-lab/tinker-cookbook/actions/runs/25682484078/job/75485176003

With moonshotai/Kimi-K2.6, main was rewritten on 2026-05-11 to a tokenizer.json + PreTrainedTokenizerFast layout, but the older sha 5a49d036... still uses tiktoken.model + a remote-code slow tokenizer:

from transformers import AutoTokenizer

# Old revision: no tokenizer.json in this commit, just tiktoken.model + tokenization_kimi.py
AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.6",
    revision="5a49d036ab7472b7d5912ded487150ec1358c11d",
    trust_remote_code=True,
)

On main this raises:

ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a `tokenizers` library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one.

With the patch above, it loads cleanly (TikTokenTokenizer, the slow remote-code class, picked up via the existing tiktoken.model fallback).

Who can help?

@ArthurZucker @itazap (tokenizers and Arthur is the author of #42299)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

# Old revision: no tokenizer.json in this commit, just tiktoken.model + tokenization_kimi.py
AutoTokenizer.from_pretrained(
    "moonshotai/Kimi-K2.6",
    revision="5a49d036ab7472b7d5912ded487150ec1358c11d",
    trust_remote_code=True,
)

Expected behavior

AutoTokenizer.from_pretrained should successfully load the tokenizer at the pinned revision, using the tiktoken.model + remote-code TikTokenTokenizer that exist at that sha, and not raise ValueError because it tried to fetch a tokenizer.json that only exists on main.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

AutoTokenizer.from_pretrained should successfully load the tokenizer at the pinned revision, using the tiktoken.model + remote-code TikTokenTokenizer that exist at that sha, and not raise ValueError because it tried to fetch a tokenizer.json that only exists on main.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Our `tinker-cookbook` CI broke: `list_repo_files` should forward the `revision` argument