transformers - 💡(How to fix) Fix "rmihaylov/bert-base-bg" model has pad and unk tokens outside the tokenizer vocab_size [5 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44402Fetched 2026-04-08 00:28:45
View on GitHub
Comments
5
Participants
3
Timeline
17
Reactions
0
Timeline (top)
commented ×5mentioned ×5subscribed ×5closed ×1

Error Message

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [79,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. ... File ".../miniconda3/lib/python3.13/site-packages/torch/nn/modules/sparse.py", line 190, in forward return F.embedding( ~~~~~~~~~~~^ input, ^^^^^^ ...<5 lines>... self.sparse, ^^^^^^^^^^^^ ) ^ File ".../miniconda3/lib/python3.13/site-packages/torch/nn/functional.py", line 2551, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered

Fix Action

Fix / Workaround

Patching the tokenizer myself by setting the pad_token and unk_token to some ID in the expected range makes the processing work, but that solution is extremely hacky

Ideally the model itself would be patched with correct PAD and UNK indices / embeddings, the tokenizers / transformers packages could compensate for this error, or maybe the model just needs to be deprecated or removed until it is properly usable.

Code Example

TokenizersBackend(name_or_path='rmihaylov/bert-base-bg', vocab_size=119547, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '<unk>', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
        2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

---

tensor([     2,     57,  38910,     86,   1525,     36,    243,   2474,    518,
           392,     19,      8,  38020,     21,   3312, 100681,     32,    985,
            27,    153,  23214,    402,    701,   4275,   1448,     19,     20,
            51,    207,   2995,     21,  11080,     40,   1506,     25,   1710,
            19,     20,   7970,     21,   3061,     19,      7,  37849,   1144,
            22,   3738,     19,      7,     25,   5558,  25814,   3230,     62,
          2824,  39769,  33752,     19,     20,     51,  10517,     26,  16506,
            31,   1506,    304,   6080,     19,     20,  32034,     21,  84886,
            27,    153,   2702,   8380,     19,     20,     51,     24,   2663,
            25,    316,     22,    877,   3438,    443,     19,      9,      3,
        119547, 119547, 119547, 119547, 119547], device='cuda:0')
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       device='cuda:0')

---

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [79,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
           ~~~~~~~~~~~^
        input,
        ^^^^^^
    ...<5 lines>...
        self.sparse,
        ^^^^^^^^^^^^
    )
    ^
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
RAW_BUFFERClick to expand / collapse

System Info

python: 3.13.5

torch: 2.7.1+cu118

transformers: 5.2.0

tokenizers: 0.22.2

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This is using the Stanza (1.11.1) integration with transformers. Any task using the "rmihaylov/bert-base-bg" or "rmihaylov/bert-base-theseus-bg" models, such as training on a BG Universal Dependencies dataset, fails with the following error.

When tokenizing multiple sentences of BG text using the model "rmihaylov/bert-base-bg", the <pad> and <unk> token is outside the expected dimensions of the embedding. This results in a crash when the transformer embeds the input text at the bottom layer.

Result of loading the tokenizer:

TokenizersBackend(name_or_path='rmihaylov/bert-base-bg', vocab_size=119547, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '<unk>', 'sep_token': '[SEP]', 'pad_token': '<pad>', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
        2: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        4: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119547: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        119548: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Example token IDs and attention matrix from this model after processing multiple lines at once. Note that the padded tokens (attn mask 0) have ID 119547.

tensor([     2,     57,  38910,     86,   1525,     36,    243,   2474,    518,
           392,     19,      8,  38020,     21,   3312, 100681,     32,    985,
            27,    153,  23214,    402,    701,   4275,   1448,     19,     20,
            51,    207,   2995,     21,  11080,     40,   1506,     25,   1710,
            19,     20,   7970,     21,   3061,     19,      7,  37849,   1144,
            22,   3738,     19,      7,     25,   5558,  25814,   3230,     62,
          2824,  39769,  33752,     19,     20,     51,  10517,     26,  16506,
            31,   1506,    304,   6080,     19,     20,  32034,     21,  84886,
            27,    153,   2702,   8380,     19,     20,     51,     24,   2663,
            25,    316,     22,    877,   3438,    443,     19,      9,      3,
        119547, 119547, 119547, 119547, 119547], device='cuda:0')
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       device='cuda:0')

This leads to an exception / assertion:

/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1553: indexSelectLargeIndex: block: [79,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
           ~~~~~~~~~~~^
        input,
        ^^^^^^
    ...<5 lines>...
        self.sparse,
        ^^^^^^^^^^^^
    )
    ^
  File ".../miniconda3/lib/python3.13/site-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered

Patching the tokenizer myself by setting the pad_token and unk_token to some ID in the expected range makes the processing work, but that solution is extremely hacky

Expected behavior

Ideally the model itself would be patched with correct PAD and UNK indices / embeddings, the tokenizers / transformers packages could compensate for this error, or maybe the model just needs to be deprecated or removed until it is properly usable.

extent analysis

Fix Plan

Patch the Tokenizer

We can fix this issue by patching the tokenizer to use the correct pad and unk token indices. However, this should be done in a more robust way than just setting the indices manually.

import transformers
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("rmihaylov/bert-base-bg")

# Get the special tokens
special_tokens = tokenizer.special_tokens_map

# Get the vocab size
vocab_size = tokenizer.vocab_size

# Patch the pad and unk token indices
special_tokens['pad_token_id'] = vocab_size - 1
special_tokens['unk_token_id'] = vocab_size - 2

# Update the tokenizer with the patched special tokens
tokenizer.update_special_tokens_map(special_tokens)

Update the Model Embeddings

Alternatively, we can update the model embeddings to include the correct pad and unk token embeddings. This can be done by patching the model's embedding layer.

import torch
from transformers import BertModel

# Load the model
model = BertModel.from_pretrained("rmihaylov/bert-base-bg")

# Get the embedding layer
embedding_layer = model.embeddings

# Get the vocab size
vocab_size = embedding_layer.weight.shape[0]

# Patch the pad and unk token embeddings
embedding_layer.weight[vocab_size - 1] = torch.zeros_like(embedding_layer.weight[vocab_size - 1])
embedding_layer.weight[vocab_size - 2] = torch.zeros_like(embedding_layer.weight[vocab_size - 2])

Update the Tokenizers and Transformers Packages

We can also update the tokenizers and transformers packages to compensate for this error. However, this would require a more significant change to the packages themselves.

Deprecate or Remove the Model

As a last resort, we can deprecate or remove the model until it is properly usable. This would require

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Ideally the model itself would be patched with correct PAD and UNK indices / embeddings, the tokenizers / transformers packages could compensate for this error, or maybe the model just needs to be deprecated or removed until it is properly usable.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix "rmihaylov/bert-base-bg" model has pad and unk tokens outside the tokenizer vocab_size [5 comments, 3 participants]