transformers - ✅(Solved) Fix Current version also does not load "cjvt/sleng-bert" [1 pull requests, 14 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44488Fetched 2026-04-08 00:28:06
View on GitHub
Comments
14
Participants
6
Timeline
34
Reactions
0
Timeline (top)
commented ×14mentioned ×8subscribed ×8closed ×1

Error Message

from transformers import AutoTokenizer bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert") Traceback (most recent call last): File "<python-input-2>", line 1, in <module> bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert") File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 749, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained return cls._from_pretrained( ~~~~~~~~~~~~~~~~~~~~^ resolved_vocab_files, ^^^^^^^^^^^^^^^^^^^^^ ...<9 lines>... **kwargs, ^^^^^^^^^ ) ^ File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in init unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in <genexpr> unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) ^^^^^^^^ ValueError: too many values to unpack (expected 2)

Fix Action

Fixed

PR fix notes

PR #44800: fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)

Description (problem / solution / changelog)

Summary

Fixes #44488

CamembertTokenizer raised ValueError: too many values to unpack (expected 2) when loading models like cjvt/sleng-bert that provide vocab as a dict {token: id} from tokenizer.json (BPE format). The tokenizer expected a list of (token, score) tuples for Unigram.

Root cause

When AutoTokenizer.from_pretrained loads a model with tokenizer.json, convert_to_native_format passes vocab as a dict. CamembertTokenizer assumed list format and unpacked (tok, _) = token_string, causing the error.

Fix

Handle dict vocab by converting to list of (token, 0.0) tuples in id order before passing to Unigram.

Testing

  • Added test_camembert_tokenizer_with_dict_vocab in test_tokenization_camembert.py
  • Manually verified CamembertTokenizer(vocab=dict_from_sleng_bert) loads successfully

Made with Cursor

Changed files

  • src/transformers/models/camembert/tokenization_camembert.py (modified, +14/-2)
  • tests/models/camembert/test_tokenization_camembert.py (modified, +22/-0)

Code Example

>>> from transformers import AutoTokenizer
>>> bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 749, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in __init__
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in <genexpr>
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                               ^^^^^^^^
ValueError: too many values to unpack (expected 2)
RAW_BUFFERClick to expand / collapse

System Info

broken config:

Python 3.13.5
tokenizers 0.22.2
transformers 5.2.0
torch 2.7.1+cu118

working config:

Python 3.13.5
tokenizers 0.22.1
transformers 4.57.1
torch 2.8.0+cu129

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

>>> from transformers import AutoTokenizer
>>> bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 749, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in __init__
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in <genexpr>
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                               ^^^^^^^^
ValueError: too many values to unpack (expected 2)

Expected behavior

Loading the model would be great! The older version of transformers works fine.

extent analysis

Fix Plan

Downgrade Tokenizers

The issue is caused by a compatibility issue between tokenizers 0.22.2 and transformers 5.2.0. Downgrading tokenizers to 0.22.1 should resolve the issue.

Steps

  1. Uninstall tokenizers 0.22.2: Run pip uninstall tokenizers==0.22.2
  2. Install tokenizers 0.22.1: Run pip install tokenizers==0.22.1
  3. Verify the installation: Run pip show tokenizers to ensure the correct version is installed.

Example Code

import pip

# Uninstall tokenizers 0.22.2
pip.main(['uninstall', 'tokenizers==0.22.2'])

# Install tokenizers 0.22.1
pip.main(['install', 'tokenizers==0.22.1'])

Verify the Fix

  1. Re-run the original code: Run python -c "from transformers import AutoTokenizer; bert_tokenizer = AutoTokenizer.from_pretrained('cjvt/sleng-bert')"
  2. Check for errors: If the code runs without errors, the fix is successful.

Extra Tips

  • Always check the version of dependencies before reporting issues.
  • Downgrading dependencies can be a temporary solution, but it's essential to investigate the root cause and upgrade dependencies as soon as possible.
  • Consider using a virtual environment to manage dependencies and avoid conflicts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Loading the model would be great! The older version of transformers works fine.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix Current version also does not load "cjvt/sleng-bert" [1 pull requests, 14 comments, 6 participants]