transformers - 💡(How to fix) Fix "rmihaylov/bert-base-bg" model has pad and unk tokens outside the tokenizer vocab_size [5 comments, 3 participants]

transformers2026-03-02 21:49:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44402•Fetched 2026-04-08 00:28:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5mentioned ×5subscribed ×5closed ×1

Error Message

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [79,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. ... File ".../miniconda3/lib/python3.13/site-packages/torch/nn/modules/sparse.py", line 190, in forward return F.embedding( ~~~~~~~~~~~^ input, ^^^^^^ ...<5 lines>... self.sparse, ^^^^^^^^^^^^ ) ^ File ".../miniconda3/lib/python3.13/site-packages/torch/nn/functional.py", line 2551, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered

Fix Action

Fix / Workaround

Patching the tokenizer myself by setting the pad_token and unk_token to some ID in the expected range makes the processing work, but that solution is extremely hacky

Ideally the model itself would be patched with correct PAD and UNK indices / embeddings, the tokenizers / transformers packages could compensate for this error, or maybe the model just needs to be deprecated or removed until it is properly usable.

Code Example

TokenizersBackend(name_or_path='rmihaylov/bert-base-bg', vocab_size=119547, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '<unk>', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
        2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

---

tensor([     2,     57,  38910,     86,   1525,     36,    243,   2474,    518,
           392,     19,      8,  38020,     21,   3312, 100681,     32,    985,
            27,    153,  23214,    402,    701,   4275,   1448,     19,     20,
            51,    207,   2995,     21,  11080,     40,   1506,     25,   1710,
            19,     20,   7970,     21,   3061,     19,      7,  37849,   1144,
            22,   3738,     19,      7,     25,   5558,  25814,   3230,     62,
          2824,  39769,  33752,     19,     20,     51,  10517,     26,  16506,
            31,   1506,    304,   6080,     19,     20,  32034,     21,  84886,
            27,    153,   2702,   8380,     19,     20,     51,     24,   2663,
            25,    316,     22,    877,   3438,    443,     19,      9,      3,
        119547, 119547, 119547, 119547, 119547], device='cuda:0')
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       device='cuda:0')

---

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [79,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
           ~~~~~~~~~~~^
        input,
        ^^^^^^
    ...<5 lines>...
        self.sparse,
        ^^^^^^^^^^^^
    )
    ^
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered

RAW_BUFFERClick to expand / collapse

System Info

python: 3.13.5

torch: 2.7.1+cu118

transformers: 5.2.0

tokenizers: 0.22.2

Who can help?

@ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

This is using the Stanza (1.11.1) integration with transformers. Any task using the "rmihaylov/bert-base-bg" or "rmihaylov/bert-base-theseus-bg" models, such as training on a BG Universal Dependencies dataset, fails with the following error.

When tokenizing multiple sentences of BG text using the model "rmihaylov/bert-base-bg", the <pad> and <unk> token is outside the expected dimensions of the embedding. This results in a crash when the transformer embeds the input text at the bottom layer.

Result of loading the tokenizer:

TokenizersBackend(name_or_path='rmihaylov/bert-base-bg', vocab_size=119547, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '<unk>', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
        2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Example token IDs and attention matrix from this model after processing multiple lines at once. Note that the padded tokens (attn mask 0) have ID 119547.

tensor([     2,     57,  38910,     86,   1525,     36,    243,   2474,    518,
           392,     19,      8,  38020,     21,   3312, 100681,     32,    985,
            27,    153,  23214,    402,    701,   4275,   1448,     19,     20,
            51,    207,   2995,     21,  11080,     40,   1506,     25,   1710,
            19,     20,   7970,     21,   3061,     19,      7,  37849,   1144,
            22,   3738,     19,      7,     25,   5558,  25814,   3230,     62,
          2824,  39769,  33752,     19,     20,     51,  10517,     26,  16506,
            31,   1506,    304,   6080,     19,     20,  32034,     21,  84886,
            27,    153,   2702,   8380,     19,     20,     51,     24,   2663,
            25,    316,     22,    877,   3438,    443,     19,      9,      3,
        119547, 119547, 119547, 119547, 119547], device='cuda:0')
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       device='cuda:0')

This leads to an exception / assertion:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [79,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
           ~~~~~~~~~~~^
        input,
        ^^^^^^
    ...<5 lines>...
        self.sparse,
        ^^^^^^^^^^^^
    )
    ^
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered

Patching the tokenizer myself by setting the pad_token and unk_token to some ID in the expected range makes the processing work, but that solution is extremely hacky

Expected behavior

extent analysis

Fix Plan

Patch the Tokenizer

We can fix this issue by patching the tokenizer to use the correct pad and unk token indices. However, this should be done in a more robust way than just setting the indices manually.

import transformers
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("rmihaylov/bert-base-bg")

# Get the special tokens
special_tokens = tokenizer.special_tokens_map

# Get the vocab size
vocab_size = tokenizer.vocab_size

# Patch the pad and unk token indices
special_tokens['pad_token_id'] = vocab_size - 1
special_tokens['unk_token_id'] = vocab_size - 2

# Update the tokenizer with the patched special tokens
tokenizer.update_special_tokens_map(special_tokens)

Update the Model Embeddings

Alternatively, we can update the model embeddings to include the correct pad and unk token embeddings. This can be done by patching the model's embedding layer.

import torch
from transformers import BertModel

# Load the model
model = BertModel.from_pretrained("rmihaylov/bert-base-bg")

# Get the embedding layer
embedding_layer = model.embeddings

# Get the vocab size
vocab_size = embedding_layer.weight.shape[0]

# Patch the pad and unk token embeddings
embedding_layer.weight[vocab_size - 1] = torch.zeros_like(embedding_layer.weight[vocab_size - 1])
embedding_layer.weight[vocab_size - 2] = torch.zeros_like(embedding_layer.weight[vocab_size - 2])

Update the Tokenizers and Transformers Packages

We can also update the tokenizers and transformers packages to compensate for this error. However, this would require a more significant change to the packages themselves.

Deprecate or Remove the Model

As a last resort, we can deprecate or remove the model until it is properly usable. This would require

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error #permission error #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix "rmihaylov/bert-base-bg" model has pad and unk tokens outside the tokenizer vocab_size [5 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Patch the Tokenizer

Update the Model Embeddings

Update the Tokenizers and Transformers Packages

Deprecate or Remove the Model

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix "rmihaylov/bert-base-bg" model has pad and unk tokens outside the tokenizer vocab_size [5 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Patch the Tokenizer

Update the Model Embeddings

Update the Tokenizers and Transformers Packages

Deprecate or Remove the Model

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING