transformers - ✅(Solved) Fix [BUG] tokenizer.save_pretrained: tokenizer_class in tokenizer_config.json doesn't match the original [1 pull requests, 5 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44297Fetched 2026-04-08 00:29:15
View on GitHub
Comments
5
Participants
6
Timeline
17
Reactions
0
Timeline (top)
commented ×5subscribed ×4cross-referenced ×3mentioned ×3

Fix Action

Fixed

PR fix notes

PR #44427: fix(tokenization): preserve original tokenizer_class in save_pretrained

Description (problem / solution / changelog)

Fixes #44297

Problem

tokenizer.save_pretrained() overwrites tokenizer_class in tokenizer_config.json with the current wrapper class (e.g. PreTrainedTokenizerFast) instead of preserving the original class from the loaded config (e.g. Qwen2Tokenizer). This breaks round-trip loading for models like Qwen3.5.

Fix

In tokenization_utils_base.py, when building the config dict to save, check if the original tokenizer_class is present and preserve it rather than overwriting with the current class name.

Testing

Added test verifying tokenizer_class is preserved after save/reload cycle.

Changed files

  • src/transformers/tokenization_utils_base.py (modified, +14/-6)
  • tests/tokenization/test_tokenization_utils.py (modified, +29/-0)

Code Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output')
RAW_BUFFERClick to expand / collapse

System Info

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output')
<img width="348" height="155" alt="Image" src="https://github.com/user-attachments/assets/2c7ab0fc-b993-427c-b3f8-15e98f81f1df" />

->

<img width="402" height="79" alt="Image" src="https://github.com/user-attachments/assets/0a0114ce-d577-4484-a5b0-60a9e62251bf" />

Who can help?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Expected behavior

extent analysis

Fix Plan

Fix Name: Proper Tokenizer Saving

Step 1: Check the Model Size

Ensure that the model size is not too large to be saved. The 27B model might be too big to be saved directly.

Step 2: Use save_pretrained with save_total_limit

Limit the number of saved checkpoints to prevent disk space issues.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output', save_total_limit=2)  # Save only 2 checkpoints

Step 3: Use save_pretrained with save_on_each_node

If you're using a distributed environment, save the model on each node to prevent data loss.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-27B')
tokenizer.save_pretrained('output', save_on_each_node=True)

Step 4: Check Disk Space

Ensure that the disk has enough space to save the model.

Step 5: Consider Using a Cloud Storage

If the model is too large, consider using a cloud storage service like AWS S3 or Google Cloud Storage to store the model.

Verification

Verify that the model is saved correctly by checking the output directory for the saved model files.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix [BUG] tokenizer.save_pretrained: tokenizer_class in tokenizer_config.json doesn't match the original [1 pull requests, 5 comments, 6 participants]