transformers - ✅(Solved) Fix [BUG] LayoutLMv2Tokenizer crashes on NER inputs and batched padding/truncation [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44186Fetched 2026-04-08 00:29:58
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Timeline (top)
cross-referenced ×2mentioned ×2subscribed ×2closed ×1

Error Message

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased") words = ["Total", "Amount", ":", "$1,234.56"] boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]] word_labels = [0, 0, 0, 1]

try: encoding = tokenizer(words, boxes=boxes, word_labels=word_labels) print(encoding["labels"]) except Exception as e: print(e)

Fix Action

Fixed

PR fix notes

PR #44187: fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding

Description (problem / solution / changelog)

What does this PR do?

The following issues were identified and fixed in this PR:

→ The NER/token classification issue and the downstream bug uncovered in the batched preprocessing use case with LayoutLMv2Tokenizer. → Reasoning: The NER use case makes it apparent that the error is hit at this line in LayoutLMv2Tokenizer, which is missing self.only_label_first_subword. PretrainedTokenizerBase doesn't create self.only_label_first_subword either, so this must be added along with the other custom attributes, and this directly resolved the NER use case as shown in the screenshot. → For the second fix; any padding="max_length" or truncation=True call without an explicit max_length arg compares self.model_max_length > LARGE_INTEGER (1e20), which in this case evaluates to True (since model_max_length falls back to VERY_LARGE_INTEGER), and both get translated into no-ops. Sequences in the batch are of different lengths and can't be tensorized, and the misleading ValueError tells the user to add padding=True and truncation=True, but they already did? The fix restores model_max_length=512. I confirmed both the base and large model configs have max_position_embeddings=512, so 512 as a default is correct, and followed the same pattern as MarianTokenizer and TapasTokenizer :)

Originally removed in #42894, just wanted to double-check if this was intentional; happy to adjust this fix if I’ve missed something :)

Fixes #44186.

Before both fixes applied:

<img width="500" height="500" alt="4" src="https://github.com/user-attachments/assets/fab48b55-d38c-4507-a5a9-6afd09fcdaa9" /><br>

Attribute fix resolves NER; batched use case still fails (feel free to cross-check; the errors are reproducible):

<img width="500" height="700" alt="5" src="https://github.com/user-attachments/assets/19e2bfd7-0c0d-4ade-8763-eb4d48872a8a" /><br>

After both fixes applied; NER + batched use case work (feel free to cross-check):

<img width="500" height="700" alt="6" src="https://github.com/user-attachments/assets/05c41d74-1c1b-4b0d-8e3c-bf56457291a2" /><br>

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you fix any necessary existing tests?

Changed files

  • src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py (modified, +3/-0)

PR #44204: fix(layoutlmv2): store only_label_first_subword attribute in tokenizer

Description (problem / solution / changelog)

What does this PR do?

Fixes #44186

LayoutLMv2Tokenizer.__init__ passes only_label_first_subword to super().__init__() but never stores it as self.only_label_first_subword. This causes an AttributeError when word_labels is passed for NER token classification tasks, since _batch_encode_plus references self.only_label_first_subword at line 661.

The fix adds self.only_label_first_subword = only_label_first_subword to __init__, matching the pattern used by both LayoutXLMTokenizer and UdopTokenizer.

One-line change

self.only_label_first_subword = only_label_first_subword

Before this PR

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
tokenizer(words, boxes=boxes, word_labels=word_labels)
# AttributeError: 'LayoutLMv2Tokenizer' object has no attribute 'only_label_first_subword'

After this PR

NER tokenization with word_labels works as expected, producing correct label alignment with subword masking.

Changed files

  • src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py (modified, +1/-0)

Code Example

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
words = ["Total", "Amount", ":", "$1,234.56"]
boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]]
word_labels = [0, 0, 0, 1]

try:
    encoding = tokenizer(words, boxes=boxes, word_labels=word_labels)
    print(encoding["labels"])
except Exception as e:
    print(e)

---

from transformers import LayoutLMv2Processor
from datasets import load_dataset
import textwrap

try:
    processor = LayoutLMv2Processor.from_pretrained(
        "microsoft/layoutlmv2-base-uncased",
        apply_ocr=False
    )
    dataset = load_dataset("nielsr/funsd", split="train")
    images = [img.convert("RGB") for img in dataset["image"]]
    words = list(dataset["words"])
    boxes = list(dataset["bboxes"])
    word_labels = list(dataset["ner_tags"])
    encoding = processor(
        images,
        words,
        boxes=boxes,
        word_labels=word_labels,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print(encoding["input_ids"].shape)
except Exception as e:
    print("\n".join(textwrap.wrap(str(e), width=160)))
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.0.0.dev0
  • Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • huggingface_hub version: 1.3.2
  • safetensors version: 0.7.0
  • accelerate version: 1.12.0
  • Accelerate config: not installed
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
  • GPU type: NVIDIA L4
  • NVIDIA driver version: 550.90.07
  • CUDA version: 12.4

Who can help?

@zucchini-nlp (multimodal model) @ArthurZucker (tokenizer)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

NER use case:

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
words = ["Total", "Amount", ":", "$1,234.56"]
boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]]
word_labels = [0, 0, 0, 1]

try:
    encoding = tokenizer(words, boxes=boxes, word_labels=word_labels)
    print(encoding["labels"])
except Exception as e:
    print(e)

Batched training data prep with truncation/padding:

from transformers import LayoutLMv2Processor
from datasets import load_dataset
import textwrap

try:
    processor = LayoutLMv2Processor.from_pretrained(
        "microsoft/layoutlmv2-base-uncased",
        apply_ocr=False
    )
    dataset = load_dataset("nielsr/funsd", split="train")
    images = [img.convert("RGB") for img in dataset["image"]]
    words = list(dataset["words"])
    boxes = list(dataset["bboxes"])
    word_labels = list(dataset["ner_tags"])
    encoding = processor(
        images,
        words,
        boxes=boxes,
        word_labels=word_labels,
        padding="max_length",
        truncation=True,
        return_tensors="pt",
    )
    print(encoding["input_ids"].shape)
except Exception as e:
    print("\n".join(textwrap.wrap(str(e), width=160)))

LayoutLMv2Tokenizer crash with an AttributeError when word_labels is passed for NER token classification. In a different use case, calling the processor with padding="max_length" and truncation=True raises a downstream ValueError asking to set the aforementioned flags (more details in the PR; the screenshots in the PR show what happens after the first attr issue is fixed but before the second fix is made), despite both flags being set correctly.

Current Repro Output:

<img width="500" height="700" alt="Image" src="https://github.com/user-attachments/assets/4311018a-3fc5-4e5a-89c0-46a4b25d0387" />

Expected behavior

encoding["labels"] should return a list in which subword tokens are masked with the default ignore_index (-100) in nn.CrossEntropyLossencoding["input_ids"].shape should return the expected torch.Size().

Request to the Reviewers

I see a few unsolicited attempts to fix the issue, even though a PR had already been linked to it previously. Please refer to 44187 for the original bug fix attempt; thank you!

extent analysis

Fix Plan

1. Update transformers version

Update transformers version to the latest stable version (4.26.3 or higher) to fix the issue.

pip install transformers==4.26.3

2. Update LayoutLMv2Tokenizer

Update the LayoutLMv2Tokenizer to the latest version.

from transformers import LayoutLMv2Tokenizer

tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased", from_tf=True)

3. Update LayoutLMv2Processor

Update the LayoutLMv2Processor to the latest version.

from transformers import LayoutLMv2Processor

processor = LayoutLMv2Processor.from_pretrained(
    "microsoft/layoutlmv2-base-uncased",
    apply_ocr=False,
    from_tf=True
)

4. Remove word_labels from LayoutLMv2Tokenizer

Remove word_labels from the LayoutLMv2Tokenizer as it's not supported.

try:
    encoding = tokenizer(words, boxes=boxes, return_tensors="pt")
    print(encoding["labels"])
except Exception as e:
    print(e)

5. Update padding and truncation in LayoutLMv2Processor

Update the padding and truncation flags in the LayoutLMv2Processor to the latest version.

encoding = processor(
    images,
    words,
    boxes=boxes,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
)

Verification

  1. Run the NER use case with the updated LayoutLMv2Tokenizer and verify that encoding["labels"] returns a list with subword tokens masked with the default ignore index (-100).
  2. Run the batched training data

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

encoding["labels"] should return a list in which subword tokens are masked with the default ignore_index (-100) in nn.CrossEntropyLossencoding["input_ids"].shape should return the expected torch.Size().

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix [BUG] LayoutLMv2Tokenizer crashes on NER inputs and batched padding/truncation [2 pull requests, 1 comments, 2 participants]