transformers - ✅(Solved) Fix Current version also does not load "cjvt/sleng-bert" [1 pull requests, 14 comments, 6 participants]

transformers2026-03-06 08:36:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44488•Fetched 2026-04-08 00:28:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

michaelanderson01826-glitch

Rocketknight1

Timeline (top)

commented ×14mentioned ×8subscribed ×8closed ×1

Error Message

from transformers import AutoTokenizer bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert") Traceback (most recent call last): File "<python-input-2>", line 1, in <module> bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert") File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 749, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained return cls._from_pretrained( ~~~~~~~~~~~~~~~~~~~~^ resolved_vocab_files, ^^^^^^^^^^^^^^^^^^^^^ ...<9 lines>... **kwargs, ^^^^^^^^^ ) ^ File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in init unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in <genexpr> unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) ^^^^^^^^ ValueError: too many values to unpack (expected 2)

Fix Action

Fixed

Fixed by PR: fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488) (https://github.com/huggingface/transformers/pull/44800)

PR fix notes

PR #44800: fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)

Repository: huggingface/transformers
Author: aayushbaluni
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44800

Description (problem / solution / changelog)

Summary

Fixes #44488

CamembertTokenizer raised ValueError: too many values to unpack (expected 2) when loading models like cjvt/sleng-bert that provide vocab as a dict {token: id} from tokenizer.json (BPE format). The tokenizer expected a list of (token, score) tuples for Unigram.

Root cause

When AutoTokenizer.from_pretrained loads a model with tokenizer.json, convert_to_native_format passes vocab as a dict. CamembertTokenizer assumed list format and unpacked (tok, _) = token_string, causing the error.

Fix

Handle dict vocab by converting to list of (token, 0.0) tuples in id order before passing to Unigram.

Testing

Added test_camembert_tokenizer_with_dict_vocab in test_tokenization_camembert.py
Manually verified CamembertTokenizer(vocab=dict_from_sleng_bert) loads successfully

Made with Cursor

Changed files

src/transformers/models/camembert/tokenization_camembert.py (modified, +14/-2)
tests/models/camembert/test_tokenization_camembert.py (modified, +22/-0)

Code Example

>>> from transformers import AutoTokenizer
>>> bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 749, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in __init__
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in <genexpr>
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                               ^^^^^^^^
ValueError: too many values to unpack (expected 2)

RAW_BUFFERClick to expand / collapse

System Info

broken config:

Python 3.13.5
tokenizers 0.22.2
transformers 5.2.0
torch 2.7.1+cu118

working config:

Python 3.13.5
tokenizers 0.22.1
transformers 4.57.1
torch 2.8.0+cu129

Who can help?

@ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

>>> from transformers import AutoTokenizer
>>> bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    bert_tokenizer = AutoTokenizer.from_pretrained("cjvt/sleng-bert")
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 749, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in __init__
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
  File "/nlp/scr/horatio/miniconda3/lib/python3.13/site-packages/transformers/models/camembert/tokenization_camembert.py", line 118, in <genexpr>
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                               ^^^^^^^^
ValueError: too many values to unpack (expected 2)

Expected behavior

Loading the model would be great! The older version of transformers works fine.

extent analysis

Fix Plan

Downgrade Tokenizers

The issue is caused by a compatibility issue between tokenizers 0.22.2 and transformers 5.2.0. Downgrading tokenizers to 0.22.1 should resolve the issue.

Steps

Uninstall tokenizers 0.22.2: Run pip uninstall tokenizers==0.22.2
Install tokenizers 0.22.1: Run pip install tokenizers==0.22.1
Verify the installation: Run pip show tokenizers to ensure the correct version is installed.

Example Code

import pip

# Uninstall tokenizers 0.22.2
pip.main(['uninstall', 'tokenizers==0.22.2'])

# Install tokenizers 0.22.1
pip.main(['install', 'tokenizers==0.22.1'])

Verify the Fix

Re-run the original code: Run python -c "from transformers import AutoTokenizer; bert_tokenizer = AutoTokenizer.from_pretrained('cjvt/sleng-bert')"
Check for errors: If the code runs without errors, the fix is successful.

Extra Tips

Always check the version of dependencies before reporting issues.
Downgrading dependencies can be a temporary solution, but it's essential to investigate the root cause and upgrade dependencies as soon as possible.
Consider using a virtual environment to manage dependencies and avoid conflicts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Loading the model would be great! The older version of transformers works fine.

#api #ssr #installation #tensor shape #autograd error #chain error #conversation history #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix Current version also does not load "cjvt/sleng-bert" [1 pull requests, 14 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44800: fix: handle dict vocab in CamembertTokenizer for tokenizer.json (#44488)

Description (problem / solution / changelog)

Summary

Root cause

Fix

Testing

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Downgrade Tokenizers

Steps

Example Code

Verify the Fix

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING