transformers - ✅(Solved) Fix [BUG] add_special_tokens=True doesn't add BOS/EOS tokens for microsoft/mdeberta-v3-base tokenizer in transformers >=5.0 [2 pull requests, 1 participants]

Q: Expected behavior

``` Expected (v4.48.0 - Working correctly) - microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP) - FacebookAI/roberta-base: [0, 42891, 2] ( , hello, ) - bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP) Actual (v5.2.0 - Broken for mdeberta only) - microsoft/mdeberta-v3-base: [124394] ← NO special tokens! - FacebookAI/roberta-base: [0, 42891, 2] ← Works - bert-base-uncased: [101, 7592, 102] ← Works ```

transformers2026-03-10 11:43:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44568•Fetched 2026-04-08 00:27:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Abdullahaml1

Participants

Abdullahaml1

Timeline (top)

cross-referenced ×2mentioned ×2subscribed ×2closed ×1

In transformers >=5.0, add_special_tokens=True doesn't add special tokens for microsoft/mdeberta-v3-base tokenizer. This is a regression from v4.x.

Root Cause

In transformers >=5.0, add_special_tokens=True doesn't add special tokens for microsoft/mdeberta-v3-base tokenizer. This is a regression from v4.x.

Fix Action

Fixed

Fixed by PR: Fix missing post_processor in DebertaV2Tokenizer causing no special t… (https://github.com/huggingface/transformers/pull/44570)
Fixed by PR: fix: Add BOS/EOS tokens by default for DeBERTa v2 tokenizer (https://github.com/huggingface/transformers/pull/44618)

PR fix notes

PR #44570: Fix missing post_processor in DebertaV2Tokenizer causing no special t…

Repository: huggingface/transformers
Author: umbilnm
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44570

Description (problem / solution / changelog)

What does this PR do?

In transformers v5, DebertaV2Tokenizer was rewritten to use TokenizersBackend, but the post_processor responsible for adding [CLS]/[SEP] tokens was never set. This causes
add_special_tokens=True to silently produce output without special tokens for models like microsoft/mdeberta-v3-base.

The root cause: in v4, special tokens were added via build_inputs_with_special_tokens() (Python-level). In v5, this method was removed in favor of the Rust-level post_processor on
the tokenizers backend — but for DebertaV2Tokenizer this processor was never configured. Other tokenizers like BertTokenizer set it correctly.

The fix adds a TemplateProcessing post-processor after super().__init__(), matching the pattern used by BertTokenizer and the template previously defined in
DebertaV2Converter.post_processor().

Fixes #44568

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. #44568
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap — tokenizers

Changed files

src/transformers/models/deberta_v2/tokenization_deberta_v2.py (modified, +13/-1)

PR #44618: fix: Add BOS/EOS tokens by default for DeBERTa v2 tokenizer

Repository: huggingface/transformers
Author: yunhaoli24
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44618

Description (problem / solution / changelog)

Summary

Set add_bos_token=True and add_eos_token=True by default in DebertaV2Tokenizer to fix the regression where add_special_tokens=True doesn't add BOS/EOS tokens for microsoft/mdeberta-v3-base tokenizer in transformers >=5.0.

Root Cause

In transformers v5, the tokenizer refactoring changed the default behavior for adding BOS/EOS tokens. The DebertaV2Tokenizer class was not setting the default values for add_bos_token and add_eos_token, causing these tokens to not be added even when add_special_tokens=True.

Fix

Added default values in DebertaV2Tokenizer.__init__():

self._add_bos_token = True
self._add_eos_token = True
self.update_post_processor() to apply the changes

Testing

Verified fix works with microsoft/mdeberta-v3-base and microsoft/deberta-v2-xlarge models
Added unit tests test_bos_token_with_add_bos_token_true and test_eos_token_with_add_eos_token_true

Results

Model	Before	After
`microsoft/mdeberta-v3-base`	`[124394]`	`[1, 124394, 2]`
`microsoft/deberta-v2-xlarge`	`[11496]`	`[1, 11496, 2]`

Fixes #44568

Changed files

src/transformers/models/deberta_v2/tokenization_deberta_v2.py (modified, +7/-0)
tests/models/deberta_v2/test_tokenization_deberta_v2.py (modified, +36/-0)

Code Example

from transformers import AutoTokenizer
models = [
    "microsoft/mdeberta-v3-base",
    "FacebookAI/roberta-base", 
    "bert-base-uncased"
]
for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    result = tokenizer("hello", add_special_tokens=True)
    print(f"{model_name}: input_ids={result['input_ids']}")

---

Expected (v4.48.0 - Working correctly)
- microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP)
- FacebookAI/roberta-base: [0, 42891, 2] (<s>, hello, </s>)
- bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP)
Actual (v5.2.0 - Broken for mdeberta only)
- microsoft/mdeberta-v3-base: [124394] ← NO special tokens!
- FacebookAI/roberta-base: [0, 42891, 2] ← Works
- bert-base-uncased: [101, 7592, 102] ← Works

RAW_BUFFERClick to expand / collapse

System Info

Version Details

Working version: transformers==4.48.0
Broken versions: transformers==5.0.0, 5.1.0, 5.2.0, 5.3.0

Environment

transformers: 5.2.0
tokenizers: 0.22.2
Python: 3.12
Platform: Linux

Who can help?

@ArthurZucker and @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Description

In transformers >=5.0, add_special_tokens=True doesn't add special tokens for microsoft/mdeberta-v3-base tokenizer. This is a regression from v4.x.

Reproduction

from transformers import AutoTokenizer
models = [
    "microsoft/mdeberta-v3-base",
    "FacebookAI/roberta-base", 
    "bert-base-uncased"
]
for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    result = tokenizer("hello", add_special_tokens=True)
    print(f"{model_name}: input_ids={result['input_ids']}")

Additional Notes

The issue is MODEL-SPECIFIC, not general to all tokenizers
Only microsoft/mdeberta-v3-base is affected
The tokenizer has correct bos_token_id=1 and eos_token_id=2 values
This appears to be related to DeBERTa v3's SentencePiece-based tokenizer and the v5 tokenizer redesign

Expected Behavior

The behavior should be consistent across v4.x and v5.x for backward compatibility.

Expected behavior

Expected (v4.48.0 - Working correctly)
- microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP)
- FacebookAI/roberta-base: [0, 42891, 2] (<s>, hello, </s>)
- bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP)
Actual (v5.2.0 - Broken for mdeberta only)
- microsoft/mdeberta-v3-base: [124394] ← NO special tokens!
- FacebookAI/roberta-base: [0, 42891, 2] ← Works
- bert-base-uncased: [101, 7592, 102] ← Works

extent analysis

Fix Plan

Fix Name

Add special tokens for SentencePiece-based tokenizers in transformers >=5.0

Fix Steps

1. Downgrade transformers to a working version (e.g., 4.48.0)

pip install transformers==4.48.0

2. Update the tokenizer to use the correct special tokens

from transformers import AutoTokenizer

models = [
    "microsoft/mdeberta-v3-base",
    "FacebookAI/roberta-base", 
    "bert-base-uncased"
]
for model_name in models:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Manually add special tokens
    tokenizer.add_special_tokens([
        {"id": 1, "token": "<s>"},
        {"id": 2, "token": "</s>"}
    ])
    result = tokenizer("hello", add_special_tokens=True)
    print(f"{model_name}: input_ids={result['input_ids']}")

3. (Optional) Create a custom tokenizer class to handle SentencePiece-based tokenizers

class CustomMDEBertaTokenizer:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.add_special_tokens()

    def add_special_tokens(self):
        self.tokenizer.add_special_tokens([
            {"id": 1, "token": "<s>"},
            {"id": 2, "token": "</s>"}
        ])

    def tokenize(self, text):
        return self.tokenizer.tokenize(text, add_special_tokens=True)

# Usage
tokenizer = CustomMDEBertaTokenizer("microsoft/mdeberta-v3-base")
result = tokenizer.tokenize("hello")
print(result)

Verification

Run the reproduction script with the custom tokenizer class
Verify that the output matches the expected behavior

Extra Tips

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Expected (v4.48.0 - Working correctly)
- microsoft/mdeberta-v3-base: [1, 124394, 2] (CLS, hello, SEP)
- FacebookAI/roberta-base: [0, 42891, 2] (<s>, hello, </s>)
- bert-base-uncased: [101, 7592, 102] (CLS, hello, SEP)
Actual (v5.2.0 - Broken for mdeberta only)
- microsoft/mdeberta-v3-base: [124394] ← NO special tokens!
- FacebookAI/roberta-base: [0, 42891, 2] ← Works
- bert-base-uncased: [101, 7592, 102] ← Works

#api #ssr #installation #tensor shape #autograd error #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [BUG] add_special_tokens=True doesn't add BOS/EOS tokens for microsoft/mdeberta-v3-base tokenizer in transformers >=5.0 [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #44570: Fix missing post_processor in DebertaV2Tokenizer causing no special t…

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Who can review?

Changed files

PR #44618: fix: Add BOS/EOS tokens by default for DeBERTa v2 tokenizer

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Testing

Results

Changed files

Code Example

System Info

Version Details

Environment

Who can help?

Information

Tasks

Reproduction

Description

Reproduction

Additional Notes

Expected Behavior

Expected behavior

extent analysis

Fix Plan

Fix Name

Fix Steps

1. Downgrade transformers to a working version (e.g., 4.48.0)

2. Update the tokenizer to use the correct special tokens

3. (Optional) Create a custom tokenizer class to handle SentencePiece-based tokenizers

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING