transformers - ✅(Solved) Fix [BUG] tokenizer.save_pretrained: tokenizer_class in tokenizer_config.json doesn't match the original [1 pull requests, 5 comments, 6 participants]

transformers2026-02-26 11:37:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44297•Fetched 2026-04-08 00:29:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5subscribed ×4cross-referenced ×3mentioned ×3

Fix Action

Fixed

Fixed by PR: fix(tokenization): preserve original tokenizer_class in save_pretrained (https://github.com/huggingface/transformers/pull/44427)

PR fix notes

PR #44427: fix(tokenization): preserve original tokenizer_class in save_pretrained

Repository: huggingface/transformers
Author: Jaredw2289-svg
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44427

Description (problem / solution / changelog)

Fixes #44297

Problem

tokenizer.save_pretrained() overwrites tokenizer_class in tokenizer_config.json with the current wrapper class (e.g. PreTrainedTokenizerFast) instead of preserving the original class from the loaded config (e.g. Qwen2Tokenizer). This breaks round-trip loading for models like Qwen3.5.

Fix

In tokenization_utils_base.py, when building the config dict to save, check if the original tokenizer_class is present and preserve it rather than overwriting with the current class name.

Testing

Added test verifying tokenizer_class is preserved after save/reload cycle.

Changed files

src/transformers/tokenization_utils_base.py (modified, +14/-6)
tests/tokenization/test_tokenization_utils.py (modified, +29/-0)

Code Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output')

RAW_BUFFERClick to expand / collapse

System Info

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output')

Who can help?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Expected behavior

extent analysis

Fix Plan

Fix Name: Proper Tokenizer Saving

Step 1: Check the Model Size

Ensure that the model size is not too large to be saved. The 27B model might be too big to be saved directly.

Step 2: Use `save_pretrained` with `save_total_limit`

Limit the number of saved checkpoints to prevent disk space issues.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output', save_total_limit=2)  # Save only 2 checkpoints

Step 3: Use `save_pretrained` with `save_on_each_node`

If you're using a distributed environment, save the model on each node to prevent data loss.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output', save_on_each_node=True)

Step 4: Check Disk Space

Ensure that the disk has enough space to save the model.

Step 5: Consider Using a Cloud Storage

If the model is too large, consider using a cloud storage service like AWS S3 or Google Cloud Storage to store the model.

Verification

Verify that the model is saved correctly by checking the output directory for the saved model files.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error #retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [BUG] tokenizer.save_pretrained: tokenizer_class in tokenizer_config.json doesn't match the original [1 pull requests, 5 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44427: fix(tokenization): preserve original tokenizer_class in save_pretrained

Description (problem / solution / changelog)

Problem

Fix

Testing

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Fix Name: Proper Tokenizer Saving

Step 1: Check the Model Size

Step 2: Use save_pretrained with save_total_limit

Step 3: Use save_pretrained with save_on_each_node

Step 4: Check Disk Space

Step 5: Consider Using a Cloud Storage

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Step 2: Use `save_pretrained` with `save_total_limit`

Step 3: Use `save_pretrained` with `save_on_each_node`