transformers - 💡(How to fix) Fix transformers >= 5.0.0 fails loading tokenizer for EMBEDDIA/est-roberta [6 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44991Fetched 2026-04-08 01:26:13
View on GitHub
Comments
6
Participants
3
Timeline
10
Reactions
0
Author
Timeline (top)
commented ×6closed ×1labeled ×1mentioned ×1

Error Message

from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/est-roberta") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 749, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\tokenization_utils_base.py", line 1721, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\tokenization_utils_base.py", line 1910, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\camembert\tokenization_camembert.py", line 118, in init unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\camembert\tokenization_camembert.py", line 118, in <genexpr> unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0) ^^^^^^^^ ValueError: too many values to unpack (expected 2)

Code Example

>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/est-roberta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 749, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\camembert\tokenization_camembert.py", line 118, in __init__
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\camembert\tokenization_camembert.py", line 118, in <genexpr>
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                               ^^^^^^^^
ValueError: too many values to unpack (expected 2)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.3.0
  • Platform: Windows-11-10.0.26200-SP0
  • Python version: 3.12.13
  • Huggingface_hub version: 1.7.2
  • Safetensors version: 0.7.0
  • Accelerate version: not installed
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.11.0+cpu (NA)
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

AutoTokenizer fails to load a tokenizer of the model "EMBEDDIA/est-roberta". The problem seems to be related to tokenizer API changes introduced in Transformers v5, as the loading works fine in v4 ( I tested it on transformers 4.57.6 ).

>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/est-roberta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 749, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\camembert\tokenization_camembert.py", line 118, in __init__
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Programmid\Miniconda3\envs\py312_transformers_problem\Lib\site-packages\transformers\models\camembert\tokenization_camembert.py", line 118, in <genexpr>
    unk_index = next((i for i, (tok, _) in enumerate(self._vocab) if tok == str(unk_token)), 0)
                               ^^^^^^^^
ValueError: too many values to unpack (expected 2)

Expected behavior

loading the tokenizer succeeds gracefully :)

extent analysis

Fix Plan

The issue arises from the tokenization_camembert.py file, which is not compatible with the EMBEDDIA/est-roberta model. To fix this, we need to use the correct tokenizer for the model.

  • Check the model's documentation to find the correct tokenizer class.
  • Use the AutoTokenizer with the use_fast parameter set to False to use the slow tokenizer.
from transformers import AutoTokenizer

# Load the tokenizer with use_fast=False
tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/est-roberta", use_fast=False)

Alternatively, you can try to use the BertTokenizer or RobertaTokenizer directly:

from transformers import BertTokenizer, RobertaTokenizer

# Load the tokenizer
tokenizer = RobertaTokenizer.from_pretrained("EMBEDDIA/est-roberta")

Verification

To verify that the fix worked, you can try to load the tokenizer and use it to encode a sentence:

input_text = "This is a test sentence."
inputs = tokenizer(input_text, return_tensors="pt")

print(inputs)

If the tokenizer loads successfully and encodes the sentence without errors, the fix has worked.

Extra Tips

  • Make sure to check the model's documentation for the recommended tokenizer and configuration.
  • If you're using a custom model, ensure that the tokenizer is compatible with the model's architecture.
  • You can also try to update the transformers library to the latest version to see if the issue is resolved.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

loading the tokenizer succeeds gracefully :)

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING