→ `encoding["labels"]` should return a list in which subword tokens are masked with the [default ignore_index](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) (`-100`) in `nn.CrossEntropyLoss` → `encoding["input_ids"].shape` should return the expected `torch.Size()`.

transformers - ✅(Solved) Fix [BUG] LayoutLMv2Tokenizer crashes on NER inputs and batched padding/truncation [2 pull requests, 1 comments, 2 participants]

transformers2026-02-20 19:58:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44186•Fetched 2026-04-08 00:29:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

harshaljanjani

Participants

harshaljanjani

nightcityblade

Timeline (top)

cross-referenced ×2mentioned ×2subscribed ×2closed ×1

Error Message

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased") words = ["Total", "Amount", ":", "$1,234.56"] boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]] word_labels = [0, 0, 0, 1]

try: encoding = tokenizer(words, boxes=boxes, word_labels=word_labels) print(encoding["labels"]) except Exception as e: print(e)

Fix Action

Fixed

Fixed by PR: fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding (https://github.com/huggingface/transformers/pull/44187)
Fixed by PR: fix(layoutlmv2): store only_label_first_subword attribute in tokenizer (https://github.com/huggingface/transformers/pull/44204)

PR fix notes

PR #44187: fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding

Repository: huggingface/transformers
Author: harshaljanjani
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44187

Description (problem / solution / changelog)

What does this PR do?

The following issues were identified and fixed in this PR:

→ The NER/token classification issue and the downstream bug uncovered in the batched preprocessing use case with LayoutLMv2Tokenizer. → Reasoning: The NER use case makes it apparent that the error is hit at this line in LayoutLMv2Tokenizer, which is missing self.only_label_first_subword. PretrainedTokenizerBase doesn't create self.only_label_first_subword either, so this must be added along with the other custom attributes, and this directly resolved the NER use case as shown in the screenshot. → For the second fix; any padding="max_length" or truncation=True call without an explicit max_length arg compares self.model_max_length > LARGE_INTEGER (1e20), which in this case evaluates to True (since model_max_length falls back to VERY_LARGE_INTEGER), and both get translated into no-ops. Sequences in the batch are of different lengths and can't be tensorized, and the misleading ValueError tells the user to add padding=True and truncation=True, but they already did? The fix restores model_max_length=512. I confirmed both the base and large model configs have max_position_embeddings=512, so 512 as a default is correct, and followed the same pattern as MarianTokenizer and TapasTokenizer :)

Originally removed in #42894, just wanted to double-check if this was intentional; happy to adjust this fix if I’ve missed something :)

Fixes #44186.

Before both fixes applied:

Attribute fix resolves NER; batched use case still fails (feel free to cross-check; the errors are reproducible):

After both fixes applied; NER + batched use case work (feel free to cross-check):

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you fix any necessary existing tests?

Changed files

src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py (modified, +3/-0)

PR #44204: fix(layoutlmv2): store only_label_first_subword attribute in tokenizer

Repository: huggingface/transformers
Author: nightcityblade
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44204

Description (problem / solution / changelog)

What does this PR do?

Fixes #44186

LayoutLMv2Tokenizer.__init__ passes only_label_first_subword to super().__init__() but never stores it as self.only_label_first_subword. This causes an AttributeError when word_labels is passed for NER token classification tasks, since _batch_encode_plus references self.only_label_first_subword at line 661.

The fix adds self.only_label_first_subword = only_label_first_subword to __init__, matching the pattern used by both LayoutXLMTokenizer and UdopTokenizer.

One-line change

self.only_label_first_subword = only_label_first_subword

Before this PR

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
tokenizer(words, boxes=boxes, word_labels=word_labels)
# AttributeError: 'LayoutLMv2Tokenizer' object has no attribute 'only_label_first_subword'

After this PR

NER tokenization with word_labels works as expected, producing correct label alignment with subword masking.

Changed files

src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py (modified, +1/-0)

Code Example

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
words = ["Total", "Amount", ":", "$1,234.56"]
boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]]
word_labels = [0, 0, 0, 1]

try:
    encoding = tokenizer(words, boxes=boxes, word_labels=word_labels)
    print(encoding["labels"])
except Exception as e:
    print(e)

---

from transformers import LayoutLMv2Processor
from datasets import load_dataset
import textwrap

try:
    processor = LayoutLMv2Processor.from_pretrained(
        "microsoft/layoutlmv2-base-uncased",
        apply_ocr=False
    )
    dataset = load_dataset("nielsr/funsd", split="train")
    images = [img.convert("RGB") for img in dataset["image"]]
    words = list(dataset["words"])
    boxes = list(dataset["bboxes"])
    word_labels = list(dataset["ner_tags"])
    encoding = processor(
        images,
        words,
        boxes=boxes,
        word_labels=word_labels,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print(encoding["input_ids"].shape)
except Exception as e:
    print("\n".join(textwrap.wrap(str(e), width=160)))

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.0.0.dev0
Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python version: 3.12.3
huggingface_hub version: 1.3.2
safetensors version: 0.7.0
accelerate version: 1.12.0
Accelerate config: not installed
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
GPU type: NVIDIA L4
NVIDIA driver version: 550.90.07
CUDA version: 12.4

Who can help?

@zucchini-nlp (multimodal model) @ArthurZucker (tokenizer)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

NER use case:

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
words = ["Total", "Amount", ":", "$1,234.56"]
boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]]
word_labels = [0, 0, 0, 1]

try:
    encoding = tokenizer(words, boxes=boxes, word_labels=word_labels)
    print(encoding["labels"])
except Exception as e:
    print(e)

Batched training data prep with truncation/padding:

from transformers import LayoutLMv2Processor
from datasets import load_dataset
import textwrap

try:
    processor = LayoutLMv2Processor.from_pretrained(
        "microsoft/layoutlmv2-base-uncased",
        apply_ocr=False
    )
    dataset = load_dataset("nielsr/funsd", split="train")
    images = [img.convert("RGB") for img in dataset["image"]]
    words = list(dataset["words"])
    boxes = list(dataset["bboxes"])
    word_labels = list(dataset["ner_tags"])
    encoding = processor(
        images,
        words,
        boxes=boxes,
        word_labels=word_labels,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print(encoding["input_ids"].shape)
except Exception as e:
    print("\n".join(textwrap.wrap(str(e), width=160)))

LayoutLMv2Tokenizer crash with an AttributeError when word_labels is passed for NER token classification. In a different use case, calling the processor with padding="max_length" and truncation=True raises a downstream ValueError asking to set the aforementioned flags (more details in the PR; the screenshots in the PR show what happens after the first attr issue is fixed but before the second fix is made), despite both flags being set correctly.

Current Repro Output:

Expected behavior

→ encoding["labels"] should return a list in which subword tokens are masked with the default ignore_index (-100) in nn.CrossEntropyLoss → encoding["input_ids"].shape should return the expected torch.Size().

Request to the Reviewers

I see a few unsolicited attempts to fix the issue, even though a PR had already been linked to it previously. Please refer to 44187 for the original bug fix attempt; thank you!

extent analysis

Fix Plan

1. Update transformers version

Update transformers version to the latest stable version (4.26.3 or higher) to fix the issue.

pip install transformers==4.26.3

2. Update LayoutLMv2Tokenizer

Update the LayoutLMv2Tokenizer to the latest version.

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased", from_tf=True)

3. Update LayoutLMv2Processor

Update the LayoutLMv2Processor to the latest version.

from transformers import LayoutLMv2Processor

processor = LayoutLMv2Processor.from_pretrained(
    "microsoft/layoutlmv2-base-uncased",
    apply_ocr=False,
    from_tf=True
)

4. Remove word_labels from LayoutLMv2Tokenizer

Remove word_labels from the LayoutLMv2Tokenizer as it's not supported.

try:
    encoding = tokenizer(words, boxes=boxes, return_tensors="pt")
    print(encoding["labels"])
except Exception as e:
    print(e)

5. Update padding and truncation in LayoutLMv2Processor

Update the padding and truncation flags in the LayoutLMv2Processor to the latest version.

encoding = processor(
    images,
    words,
    boxes=boxes,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
)

Verification

Run the NER use case with the updated LayoutLMv2Tokenizer and verify that encoding["labels"] returns a list with subword tokens masked with the default ignore index (-100).
Run the batched training data

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error #mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [BUG] LayoutLMv2Tokenizer crashes on NER inputs and batched padding/truncation [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44187: fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

PR #44204: fix(layoutlmv2): store only_label_first_subword attribute in tokenizer

Description (problem / solution / changelog)

What does this PR do?

One-line change

Before this PR

After this PR

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Request to the Reviewers

extent analysis

Fix Plan

1. Update transformers version

2. Update LayoutLMv2Tokenizer

3. Update LayoutLMv2Processor

4. Remove word_labels from LayoutLMv2Tokenizer

5. Update padding and truncation in LayoutLMv2Processor

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING