transformers - ✅(Solved) Fix [BUG] add_special_tokens=True doesn't add BOS/EOS tokens for microsoft/mdeberta-v3-base tokenizer in transformers >=5.0 [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44568Fetched 2026-04-08 00:27:36
View on GitHub
Comments
0
Participants
1
Timeline
9
Reactions
0
Participants
Timeline (top)
cross-referenced ×2mentioned ×2subscribed ×2closed ×1

In transformers >=5.0, add_special_tokens=True doesn't add special tokens for microsoft/mdeberta-v3-base tokenizer. This is a regression from v4.x.

Root Cause

In transformers >=5.0, add_special_tokens=True doesn't add special tokens for microsoft/mdeberta-v3-base tokenizer. This is a regression from v4.x.

Fix Action

Fixed

PR fix notes

PR #44570: Fix missing post_processor in DebertaV2Tokenizer causing no special t…

Description (problem / solution / changelog)

What does this PR do?

In transformers v5, DebertaV2Tokenizer was rewritten to use TokenizersBackend, but the post_processor responsible for adding [CLS]/[SEP] tokens was never set. This causes
add_special_tokens=True to silently produce output without special tokens for models like microsoft/mdeberta-v3-base.

The root cause: in v4, special tokens were added via build_inputs_with_special_tokens() (Python-level). In v5, this method was removed in favor of the Rust-level post_processor on
the tokenizers backend — but for DebertaV2Tokenizer this processor was never configured. Other tokenizers like BertTokenizer set it correctly.

The fix adds a TemplateProcessing post-processor after super().__init__(), matching the pattern used by BertTokenizer and the template previously defined in
DebertaV2Converter.post_processor().

Fixes #44568

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. #44568
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap — tokenizers

Changed files

  • src/transformers/models/deberta_v2/tokenization_deberta_v2.py (modified, +13/-1)

PR #44618: fix: Add BOS/EOS tokens by default for DeBERTa v2 tokenizer

Description (problem / solution / changelog)

Summary

Set add_bos_token=True and add_eos_token=True by default in DebertaV2Tokenizer to fix the regression where add_special_tokens=True doesn't add BOS/EOS tokens for microsoft/mdeberta-v3-base tokenizer in transformers >=5.0.

Root Cause

In transformers v5, the tokenizer refactoring changed the default behavior for adding BOS/EOS tokens. The DebertaV2Tokenizer class was not setting the default values for add_bos_token and add_eos_token, causing these tokens to not be added even when add_special_tokens=True.

Fix

Added default values in DebertaV2Tokenizer.__init__():

  • self._add_bos_token = True
  • self._add_eos_token = True
  • self.update_post_processor() to apply the changes

Testing

  • Verified fix works with microsoft/mdeberta-v3-base and microsoft/deberta-v2-xlarge models
  • Added unit tests test_bos_token_with_add_bos_token_true and test_eos_token_with_add_eos_token_true

Results

ModelBeforeAfter
microsoft/mdeberta-v3-base[124394][1, 124394, 2]
microsoft/deberta-v2-xlarge[11496][1, 11496, 2]

Fixes #44568

Changed files

  • src/transformers/models/deberta_v2/tokenization_deberta_v2.py (modified, +7/-0)
  • tests/models/deberta_v2/test_tokenization_deberta_v2.py (modified, +36/-0)

Code Example

from transformers import AutoTokenizer
models = [
    "microsoft/mdeberta-v3-base",
    "FacebookAI/roberta-base", 
    "bert-base-uncased"
]
for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    result = tokenizer("hello", add_special_tokens=True)
    print(f"{model_name}: input_ids={result['input_ids']}")

---

Expected (v4.48.0 - Working correctly)
- microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP)
- FacebookAI/roberta-base: [0, 42891, 2] (<s>, hello, </s>)
- bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP)
Actual (v5.2.0 - Broken for mdeberta only)
- microsoft/mdeberta-v3-base: [124394]NO special tokens!
- FacebookAI/roberta-base: [0, 42891, 2]Works
- bert-base-uncased: [101, 7592, 102]Works
RAW_BUFFERClick to expand / collapse

System Info

Version Details

  • Working version: transformers==4.48.0
  • Broken versions: transformers==5.0.0, 5.1.0, 5.2.0, 5.3.0

Environment

  • transformers: 5.2.0
  • tokenizers: 0.22.2
  • Python: 3.12
  • Platform: Linux

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Description

In transformers >=5.0, add_special_tokens=True doesn't add special tokens for microsoft/mdeberta-v3-base tokenizer. This is a regression from v4.x.

Reproduction

from transformers import AutoTokenizer
models = [
    "microsoft/mdeberta-v3-base",
    "FacebookAI/roberta-base", 
    "bert-base-uncased"
]
for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    result = tokenizer("hello", add_special_tokens=True)
    print(f"{model_name}: input_ids={result['input_ids']}")

Additional Notes

  • The issue is MODEL-SPECIFIC, not general to all tokenizers
  • Only microsoft/mdeberta-v3-base is affected
  • The tokenizer has correct bos_token_id=1 and eos_token_id=2 values
  • This appears to be related to DeBERTa v3's SentencePiece-based tokenizer and the v5 tokenizer redesign

Expected Behavior

The behavior should be consistent across v4.x and v5.x for backward compatibility.

Expected behavior

Expected (v4.48.0 - Working correctly)
- microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP)
- FacebookAI/roberta-base: [0, 42891, 2] (<s>, hello, </s>)
- bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP)
Actual (v5.2.0 - Broken for mdeberta only)
- microsoft/mdeberta-v3-base: [124394] ← NO special tokens!
- FacebookAI/roberta-base: [0, 42891, 2] ← Works
- bert-base-uncased: [101, 7592, 102] ← Works

extent analysis

Fix Plan

Fix Name

Add special tokens for SentencePiece-based tokenizers in transformers >=5.0

Fix Steps

1. Downgrade transformers to a working version (e.g., 4.48.0)

pip install transformers==4.48.0

2. Update the tokenizer to use the correct special tokens

from transformers import AutoTokenizer

models = [
    "microsoft/mdeberta-v3-base",
    "FacebookAI/roberta-base", 
    "bert-base-uncased"
]
for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Manually add special tokens
    tokenizer.add_special_tokens([
        {"id": 1, "token": "<s>"},
        {"id": 2, "token": "</s>"}
    ])
    result = tokenizer("hello", add_special_tokens=True)
    print(f"{model_name}: input_ids={result['input_ids']}")

3. (Optional) Create a custom tokenizer class to handle SentencePiece-based tokenizers

class CustomMDEBertaTokenizer:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.add_special_tokens()

    def add_special_tokens(self):
        self.tokenizer.add_special_tokens([
            {"id": 1, "token": "<s>"},
            {"id": 2, "token": "</s>"}
        ])

    def tokenize(self, text):
        return self.tokenizer.tokenize(text, add_special_tokens=True)

# Usage
tokenizer = CustomMDEBertaTokenizer("microsoft/mdeberta-v3-base")
result = tokenizer.tokenize("hello")
print(result)

Verification

  • Run the reproduction script with the custom tokenizer class
  • Verify that the output matches the expected behavior

Extra Tips

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Expected (v4.48.0 - Working correctly)
- microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP)
- FacebookAI/roberta-base: [0, 42891, 2] (<s>, hello, </s>)
- bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP)
Actual (v5.2.0 - Broken for mdeberta only)
- microsoft/mdeberta-v3-base: [124394] ← NO special tokens!
- FacebookAI/roberta-base: [0, 42891, 2] ← Works
- bert-base-uncased: [101, 7592, 102] ← Works

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING