transformers - ✅(Solved) Fix _patch_mistral_regex crashes with AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer' when loading Mistral tokenizer with fix_mistral_regex=True [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45081Fetched 2026-04-08 01:45:21
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
mentioned ×2subscribed ×2cross-referenced ×1labeled ×1

Error Message

Traceback (most recent call last): File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module> tokenizer = AutoTokenizer.from_pretrained( "mistralai/Mistral-Nemo-Instruct-2407", trust_remote_code=True, fix_mistral_regex=True, ) File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class_from_name(tokenizer_config_class).from_pretrained( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ pretrained_model_name_or_path, *inputs, **kwargs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained return cls._from_pretrained( ~~~~~~~~~~~~~~~~~~~~^ resolved_vocab_files, ^^^^^^^^^^^^^^^^^^^^^ ...<9 lines>... **kwargs, ^^^^^^^^^ ) ^ File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in init self._tokenizer = self._patch_mistral_regex( ~~~~~~~~~~~~~~~~~~~~~~~~~^ self._tokenizer, ^^^^^^^^^^^^^^^^ ...<3 lines>... **kwargs, ^^^^^^^^^ ) ^ File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Root Cause

Root cause analysis and suggested fix

Fix Action

Fix / Workaround

tokenizer = AutoTokenizer.from_pretrained( "mistralai/Mistral-Nemo-Instruct-2407", trust_remote_code=True, fix_mistral_regex=True, )

Traceback (most recent call last): File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module> tokenizer = AutoTokenizer.from_pretrained( "mistralai/Mistral-Nemo-Instruct-2407", trust_remote_code=True, fix_mistral_regex=True, ) File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained return tokenizer_class_from_name(tokenizer_config_class).from_pretrained( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ pretrained_model_name_or_path, *inputs, **kwargs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained return cls._from_pretrained( ~~~~~~~~~~~~~~~~~~~~^ resolved_vocab_files, ^^^^^^^^^^^^^^^^^^^^^ ...<9 lines>... **kwargs, ^^^^^^^^^ ) ^ File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in init self._tokenizer = self._patch_mistral_regex( ~~~~~~~~~~~~~~~~~~~~~~~~~^ self._tokenizer, ^^^^^^^^^^^^^^^^ ...<3 lines>... **kwargs, ^^^^^^^^^ ) ^ File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'


In `tokenization_utils_tokenizers.py`, `_patch_mistral_regex` is called from `__init__` as:

```python
# line ~477
self._tokenizer = self._patch_mistral_regex(
    self._tokenizer,   # <-- this is a raw tokenizers.Tokenizer (Rust object)
    ...
)

PR fix notes

PR #45086: fix AttributeError in _patch_mistral_regex

Description (problem / solution / changelog)

the function accesses backend_tokenizer.pre_tokenizer but the tokenizer passed is already the raw rust object, so it should be pre_tokenizer directly. fixes #45081

Changed files

  • src/transformers/tokenization_utils_tokenizers.py (modified, +3/-3)

PR #45317: Fix AttributeError in _patch_mistral_regex when fix_mistral_regex=True

Description (problem / solution / changelog)

Fixes #45081

Problem

Loading a Mistral tokenizer with fix_mistral_regex=True crashes because _patch_mistral_regex receives a raw tokenizers.Tokenizer but tries to access .backend_tokenizer.pre_tokenizer on it — that attribute only exists on PreTrainedTokenizerFast.

Fix

Removed the .backend_tokenizer indirection (3 lines) since the tokenizer passed in is already the backend tokenizer.

AI tools helped locate the bug. I reviewed and understood the fix.

Before submitting

  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. Answer : Issue #45081

Who can review?

@ArthurZucker @itazap (tokenizers)

<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. Please tag fewer than 3 people. Models: - text models: @ArthurZucker @Cyrilvallez - vision models: @yonigozlan @molbap - audio models: @eustlb @ebezzam @vasqu - multimodal models: @zucchini-nlp - graph models: @clefourrier Library: - generate: @zucchini-nlp (visual-language models) or @gante (all others) - continuous batching: @remi-or @ArthurZucker @McPatate - pipelines: @Rocketknight1 - tokenizers: @ArthurZucker and @itazap - trainer: @SunMarc - attention: @vasqu @ArthurZucker @CyrilVallez - model loading (from pretrained, etc): @CyrilVallez - distributed: @3outeille @ArthurZucker - CIs: @ydshieh Integrations: - ray/raytune: @richardliaw, @amogkam - Big Model Inference: @SunMarc - quantization: @SunMarc - kernels: @drbh - peft: @BenjaminBossan @githubnemo Devices/Backends: - AMD ROCm: @ivarflakstad - Intel XPU: @IlyasMoutawwakil - Ascend NPU: @ivarflakstad Documentation: @stevhliu Research projects are not maintained and should be taken as is. -->

Changed files

  • src/transformers/tokenization_utils_tokenizers.py (modified, +3/-3)
  • tests/models/auto/test_tokenization_auto.py (modified, +21/-0)

Code Example

import transformers

print(f"transformers version: {transformers.__version__}")
print("Loading Mistral tokenizer with fix_mistral_regex=True ...")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-Nemo-Instruct-2407",
    trust_remote_code=True,
    fix_mistral_regex=True,
)

---

Traceback (most recent call last):
  File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
        "mistralai/Mistral-Nemo-Instruct-2407",
        trust_remote_code=True,
        fix_mistral_regex=True,
    )
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_from_name(tokenizer_config_class).from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *inputs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in __init__
    self._tokenizer = self._patch_mistral_regex(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^
        self._tokenizer,
        ^^^^^^^^^^^^^^^^
    ...<3 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex
    current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

---

# line ~477
self._tokenizer = self._patch_mistral_regex(
    self._tokenizer,   # <-- this is a raw tokenizers.Tokenizer (Rust object)
    ...
)

---

current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer  # BUG

---

# line 1363 — change:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
# to:
current_pretokenizer = tokenizer.pre_tokenizer
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.4.0
  • Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.35
  • Python version: 3.13.5
  • Huggingface_hub version: 1.8.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Using distributed or parallel set-up in script?: <fill in>
  • Using GPU in script?: <fill in>
  • GPU type: NVIDIA A100-PCIE-40GB

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import transformers

print(f"transformers version: {transformers.__version__}")
print("Loading Mistral tokenizer with fix_mistral_regex=True ...")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-Nemo-Instruct-2407",
    trust_remote_code=True,
    fix_mistral_regex=True,
)
Traceback (most recent call last):
  File "/mnt/RAPID/tmp/repro_mistral_regex_bug.py", line 23, in <module>
    tokenizer = AutoTokenizer.from_pretrained(
        "mistralai/Mistral-Nemo-Instruct-2407",
        trust_remote_code=True,
        fix_mistral_regex=True,
    )
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_from_name(tokenizer_config_class).from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        pretrained_model_name_or_path, *inputs, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1721, in from_pretrained
    return cls._from_pretrained(
           ~~~~~~~~~~~~~~~~~~~~^
        resolved_vocab_files,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_base.py", line 1910, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 477, in __init__
    self._tokenizer = self._patch_mistral_regex(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~^
        self._tokenizer,
        ^^^^^^^^^^^^^^^^
    ...<3 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/root/miniconda3/envs/myconda/lib/python3.13/site-packages/transformers/tokenization_utils_tokenizers.py", line 1363, in _patch_mistral_regex
    current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'backend_tokenizer'

Expected behavior

fix_mistral_regex=True should successfully replace the incorrect pre-tokenizer regex pattern in the Mistral tokenizer without raising any error.


Root cause analysis and suggested fix

In tokenization_utils_tokenizers.py, _patch_mistral_regex is called from __init__ as:

# line ~477
self._tokenizer = self._patch_mistral_regex(
    self._tokenizer,   # <-- this is a raw tokenizers.Tokenizer (Rust object)
    ...
)

Inside _patch_mistral_regex, line 1363 then does:

current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer  # BUG

But tokenizer here is already self._tokenizer — the raw Rust tokenizers.Tokenizer object. The .backend_tokenizer property exists on the Python-level PreTrainedTokenizerFast / TokenizersBackend wrapper, not on the underlying Rust object itself.

Fix: access .pre_tokenizer directly, since the argument is already the backend tokenizer:

# line 1363 — change:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
# to:
current_pretokenizer = tokenizer.pre_tokenizer

And correspondingly update any write-back in the same method that goes through .backend_tokenizer to use the object directly.

extent analysis

Fix Plan

To fix the issue, you need to update the tokenization_utils_tokenizers.py file. Here are the steps:

  • Open the tokenization_utils_tokenizers.py file located in your transformers library installation.
  • Find the _patch_mistral_regex method.
  • Update the line current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer to current_pretokenizer = tokenizer.pre_tokenizer.
  • Update any corresponding write-back lines that use .backend_tokenizer to use the object directly.

Example code snippet:

# Before
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer

# After
current_pretokenizer = tokenizer.pre_tokenizer

Make sure to update all occurrences of .backend_tokenizer in the _patch_mistral_regex method.

Verification

To verify that the fix worked, run your original code again:

import transformers

print(f"transformers version: {transformers.__version__}")
print("Loading Mistral tokenizer with fix_mistral_regex=True ...")

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-Nemo-Instruct-2407",
    trust_remote_code=True,
    fix_mistral_regex=True,
)

If the fix is successful, the code should run without raising any errors.

Extra Tips

  • Make sure to update the transformers library to the latest version if possible.
  • If you are using a virtual environment, ensure that the changes are made in the correct environment.
  • If you are using a package manager like pip, you may need to reinstall the transformers library after making the changes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

fix_mistral_regex=True should successfully replace the incorrect pre-tokenizer regex pattern in the Mistral tokenizer without raising any error.


Root cause analysis and suggested fix

In tokenization_utils_tokenizers.py, _patch_mistral_regex is called from __init__ as:

# line ~477
self._tokenizer = self._patch_mistral_regex(
    self._tokenizer,   # <-- this is a raw tokenizers.Tokenizer (Rust object)
    ...
)

Inside _patch_mistral_regex, line 1363 then does:

current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer  # BUG

But tokenizer here is already self._tokenizer — the raw Rust tokenizers.Tokenizer object. The .backend_tokenizer property exists on the Python-level PreTrainedTokenizerFast / TokenizersBackend wrapper, not on the underlying Rust object itself.

Fix: access .pre_tokenizer directly, since the argument is already the backend tokenizer:

# line 1363 — change:
current_pretokenizer = tokenizer.backend_tokenizer.pre_tokenizer
# to:
current_pretokenizer = tokenizer.pre_tokenizer

And correspondingly update any write-back in the same method that goes through .backend_tokenizer to use the object directly.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING