transformers - ✅(Solved) Fix AutoTokenizer ignores tokenizer.json from the repository [2 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44462Fetched 2026-04-08 00:28:19
View on GitHub
Comments
3
Participants
2
Timeline
13
Reactions
3
Author
Participants
Timeline (top)
commented ×3mentioned ×3subscribed ×3cross-referenced ×2

Fix Action

Fixed

PR fix notes

PR #322: Make transformer==5.2.0 to avoid 5.3.0 tokenizer bug

Description (problem / solution / changelog)

Motivation

related issue in transformers https://github.com/huggingface/transformers/issues/44462

fix https://github.com/huggingface/transformers/pull/44463

<!-- Explain the purpose of this PR and the goals it aims to achieve. -->

Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

Test Result

<!-- Briefly summarize test outcomes. -->

Submission Checklist

Changed files

  • pyproject.toml (modified, +1/-1)

Code Example

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

---

"normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  }

---

"normalizer": null,
  "pre_tokenizer": {
    "type": "Metaspace",
    "replacement": "▁",
    "prepend_scheme": "always",
    "split": false
  },
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.3.0
  • Python version: 3.10.12
  • Huggingface_hub version: 1.5.0

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

AutoTokenizer doesn't load tokenizer based on tokenizer.json from the repository. Saving this tokenizer produces different tokenizer.json file.

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

tokenizer_original.json tokenizer_saved.json

Original normalizer/pre-tokenizer:

  "normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  }

Saved normalizer/pre-tokenizer:

  "normalizer": null,
  "pre_tokenizer": {
    "type": "Metaspace",
    "replacement": "▁",
    "prepend_scheme": "always",
    "split": false
  },

Expected behavior

Original tokenizer.json should be used to instantiate the tokenizer.

extent analysis

Fix Plan

Update transformers to the latest version

The issue is likely due to a bug in the transformers library. Upgrading to the latest version should resolve the issue.

Update transformers to version 5.5.1 or later

pip install --upgrade transformers

Verify the fix

Run the following code to verify that the fix worked:

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

# Compare the original and saved tokenizer.json files
import json
with open("tokenizer_original.json", "r") as f:
    original_tokenizer = json.load(f)
with open("hf_deepseek_tokenizer/tokenizer.json", "r") as f:
    saved_tokenizer = json.load(f)

print(original_tokenizer == saved_tokenizer)

This code should print True if the fix worked.

Extra Tips

  • Always check for updates to the transformers library to ensure you have the latest bug fixes.
  • If you're using a custom tokenizer, make sure to save it with the correct configuration to avoid issues like this.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Original tokenizer.json should be used to instantiate the tokenizer.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix AutoTokenizer ignores tokenizer.json from the repository [2 pull requests, 3 comments, 2 participants]