transformers - ✅(Solved) Fix AutoTokenizer ignores tokenizer.json from the repository [2 pull requests, 3 comments, 2 participants]

Q: Expected behavior

Original `tokenizer.json` should be used to instantiate the tokenizer.

transformers2026-03-05 12:04:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44462•Fetched 2026-04-08 00:28:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

apaniukov

Participants

apaniukov

itazap

Timeline (top)

commented ×3mentioned ×3subscribed ×3cross-referenced ×2

Fix Action

Fixed

Fixed by PR: Make transformer==5.2.0 to avoid 5.3.0 tokenizer bug (https://github.com/ROCm/ATOM/pull/322)

PR fix notes

PR #322: Make transformer==5.2.0 to avoid 5.3.0 tokenizer bug

Repository: ROCm/ATOM
Author: ZhangLirong-amd
State: closed | merged: True
Link: https://github.com/ROCm/ATOM/pull/322

Description (problem / solution / changelog)

Motivation

related issue in transformers https://github.com/huggingface/transformers/issues/44462

fix https://github.com/huggingface/transformers/pull/44463

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Changed files

pyproject.toml (modified, +1/-1)

Code Example

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

---

"normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  }

---

"normalizer": null,
  "pre_tokenizer": {
    "type": "Metaspace",
    "replacement": "▁",
    "prepend_scheme": "always",
    "split": false
  },

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.3.0
Python version: 3.10.12
Huggingface_hub version: 1.5.0

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

AutoTokenizer doesn't load tokenizer based on tokenizer.json from the repository. Saving this tokenizer produces different tokenizer.json file.

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

tokenizer_original.json tokenizer_saved.json

Original normalizer/pre-tokenizer:

  "normalizer": {
    "type": "Sequence",
    "normalizers": []
  },
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[\r\n]"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{L}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "\\s?\\p{P}+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Split",
        "pattern": {
          "Regex": "[一-龥ࠀ-一가-퟿]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  }

Saved normalizer/pre-tokenizer:

  "normalizer": null,
  "pre_tokenizer": {
    "type": "Metaspace",
    "replacement": "▁",
    "prepend_scheme": "always",
    "split": false
  },

Expected behavior

Original tokenizer.json should be used to instantiate the tokenizer.

extent analysis

Fix Plan

Update `transformers` to the latest version

The issue is likely due to a bug in the transformers library. Upgrading to the latest version should resolve the issue.

Update `transformers` to version 5.5.1 or later

pip install --upgrade transformers

Verify the fix

Run the following code to verify that the fix worked:

from transformers import AutoTokenizer

hf_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct")
hf_tokenizer.save_pretrained("hf_deepseek_tokenizer/")

# Compare the original and saved tokenizer.json files
import json
with open("tokenizer_original.json", "r") as f:
    original_tokenizer = json.load(f)
with open("hf_deepseek_tokenizer/tokenizer.json", "r") as f:
    saved_tokenizer = json.load(f)

print(original_tokenizer == saved_tokenizer)

This code should print True if the fix worked.

Extra Tips

Always check for updates to the transformers library to ensure you have the latest bug fixes.
If you're using a custom tokenizer, make sure to save it with the correct configuration to avoid issues like this.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Original tokenizer.json should be used to instantiate the tokenizer.

#api #ssr #installation #tensor shape #autograd error #API rate limit #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix AutoTokenizer ignores tokenizer.json from the repository [2 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #322: Make transformer==5.2.0 to avoid 5.3.0 tokenizer bug

Description (problem / solution / changelog)

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Update transformers to the latest version

Update transformers to version 5.5.1 or later

Verify the fix

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Update `transformers` to the latest version

Update `transformers` to version 5.5.1 or later