transformers - ✅(Solved) Fix [BUG] MLukeTokenizer fails with AttributeError on tasks [1 pull requests, 1 participants]

Q: Expected behavior

→ Entity classification task should pass without `AttributeError`. **Output After the Fix:**

transformers2026-02-28 19:58:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44361•Fetched 2026-04-08 00:29:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

harshaljanjani

Participants

harshaljanjani

Timeline (top)

closed ×1cross-referenced ×1labeled ×1

Error Message

from transformers import MLukeTokenizer

try: tokenizer = MLukeTokenizer.from_pretrained( "studio-ousia/mluke-base", task="entity_classification" ) sentence = "Japanese is an East Asian language spoken by about 128 million people, primarily in Japan." span = (15, 34) encoding = tokenizer(sentence, entity_spans=[span]) print(tokenizer.decode(encoding["input_ids"], spaces_between_special_tokens=False)) except Exception as e: print(e)

Fix Action

Fixed

Fixed by PR: fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor (https://github.com/huggingface/transformers/pull/44362)

PR fix notes

PR #44362: fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor

Repository: huggingface/transformers
Author: harshaljanjani
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44362

Description (problem / solution / changelog)

What does this PR do?

The following failing Dia use case was identified and fixed in this PR:

→ MIGRATION_GUIDE_V5.md states that v5 renamed additional_special_tokens to extra_special_tokens internally as part of the tokenizer refactor; but tokenization_mluke.py (amongst other instances) still references self.additional_special_tokens_ids, which is the only remaining call site under the old name that needs fixing :) → For more details on reproducing the bug and the output screenshots, please visit the linked issue!

Fixes #44361.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you fix any necessary existing tests?

Changed files

src/transformers/models/mluke/tokenization_mluke.py (modified, +4/-6)

Code Example

from transformers import MLukeTokenizer

try:
    tokenizer = MLukeTokenizer.from_pretrained(
      "studio-ousia/mluke-base", task="entity_classification"
    )
    sentence = "Japanese is an East Asian language spoken by about 128 million people, primarily in Japan."
    span = (15, 34)
    encoding = tokenizer(sentence, entity_spans=[span])
    print(tokenizer.decode(encoding["input_ids"], spaces_between_special_tokens=False))
except Exception as e:
    print(e)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.0.0.dev0
Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python version: 3.12.3
huggingface_hub version: 1.3.2
safetensors version: 0.7.0
accelerate version: 1.12.0
Accelerate config: not installed
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
GPU type: NVIDIA L4
NVIDIA driver version: 550.90.07
CUDA version: 12.4

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import MLukeTokenizer

try:
    tokenizer = MLukeTokenizer.from_pretrained(
      "studio-ousia/mluke-base", task="entity_classification"
    )
    sentence = "Japanese is an East Asian language spoken by about 128 million people, primarily in Japan."
    span = (15, 34)
    encoding = tokenizer(sentence, entity_spans=[span])
    print(tokenizer.decode(encoding["input_ids"], spaces_between_special_tokens=False))
except Exception as e:
    print(e)

MIGRATION_GUIDE_V5.md states that v5 renamed additional_special_tokens to extra_special_tokens internally as part of the tokenizer refactor; but tokenization_mluke.py (amongst other instances) still references self.additional_special_tokens_ids, which is the only remaining call site under the old name that needs fixing :)

Current Output:

Expected behavior

→ Entity classification task should pass without AttributeError.

Output After the Fix:

extent analysis

Problem Summary

The issue is caused by a deprecated attribute in the MLukeTokenizer class, leading to an AttributeError when trying to perform entity classification.

Root Cause Analysis

The root cause is a naming conflict between the new extra_special_tokens attribute and the old additional_special_tokens attribute in the MLukeTokenizer class.

Fix Plan

To fix this issue, you need to update the MLukeTokenizer class to use the new attribute name.

Step-by-Step Solution

Update the MLukeTokenizer class:
- Open the tokenization_mluke.py file in the transformers repository.
- Replace all occurrences of self.additional_special_tokens_ids with self.extra_special_tokens_ids.
- Commit the changes and push them to your fork.
Update your code:
- Open your modified script.
- Replace tokenizer = MLukeTokenizer.from_pretrained(...) with tokenizer = MLukeTokenizer.from_pretrained(..., use_auth_token=True).
- This will ensure that you're using the updated MLukeTokenizer class.

Example Code

from transformers import MLukeTokenizer

try:
    tokenizer = MLukeTokenizer.from_pretrained(
      "studio-ousia/mluke-base", task="entity_classification", use_auth_token=True
    )
    sentence = "Japanese is an East Asian language spoken by about 128 million people, primarily in Japan."
    span = (15, 34)
    encoding = tokenizer(sentence, entity_spans=[span])
    print(tokenizer.decode(encoding["input_ids"], spaces_between_special_tokens=False))
except Exception as e:
    print(e)

Verification

To verify that the fix worked, run the script again and check that the output is as expected. The AttributeError should be resolved, and the entity classification task should pass without any issues

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

→ Entity classification task should pass without AttributeError.

Output After the Fix:

#api #ssr #installation #tensor shape #autograd error #network issue #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix [BUG] MLukeTokenizer fails with AttributeError on tasks [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44362: fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

Code Example

System Info

Information

Tasks

Reproduction

Expected behavior

extent analysis

Problem Summary

Root Cause Analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix [BUG] MLukeTokenizer fails with AttributeError on tasks [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44362: fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

Code Example

System Info

Information

Tasks

Reproduction

Expected behavior

extent analysis

Problem Summary

Root Cause Analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING