transformers - ✅(Solved) Fix Inconsistent tokenization and BLEU scores between AutoTokinzer and NllbTokenizerFast [1 pull requests, 3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44993Fetched 2026-04-08 01:30:59
View on GitHub
Comments
3
Participants
4
Timeline
12
Reactions
1
Timeline (top)
subscribed ×4commented ×3mentioned ×3cross-referenced ×1

PR fix notes

PR #45078: throw error when conversion required

Description (problem / solution / changelog)

fixes fallback https://github.com/huggingface/transformers/issues/44993

Changed files

  • src/transformers/models/auto/tokenization_auto.py (modified, +14/-9)
  • tests/models/auto/test_tokenization_auto.py (modified, +19/-0)

Code Example

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

tokenizer_fast.src_lang = "swe_Latn"
tokenizer_auto.src_lang = "swe_Latn"

inputs_auto = tokenizer_auto(sample_text, src_lang="swe_Latn", return_tensors="pt")
inputs_fast = tokenizer_fast(sample_text, src_lang="swe_Latn", return_tensors="pt")


print("NllbTokenizerFast")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer")
print("Input IDs:", inputs_auto['input_ids'][0].tolist())
print("Tokens:", tokenizer_auto.convert_ids_to_tokens(inputs_auto['input_ids'][0]))
RAW_BUFFERClick to expand / collapse

System Info

System Info

  • transformers version: 5.0.0
  • Platform: macOS-26.3.1-arm64-arm-64bit
  • Python version: 3.10.19
  • PyTorch version: 2.10.0

Information

I've been evaluating facebook/nllb-200-distilled-600M across 36 different language pairs and ran into a significant discrepancy depending on which tokenizer class is instantiated.

When using NllbTokenizerFast versus AutoTokenizer, the resulting BLEU scores are drastically different for the exact same generation parameters.

For example:

  • swe_Latn -> fra_Latn: Drops from ~43.35 BLEU (Fast) to ~9.02 BLEU (Auto).
  • spa_Latn -> fra_Latn: Jumps from ~33.97 BLEU (Dast) to ~53.25 BLEU (Auto).

To understand the massive gap in BLEU scores, I inspected the raw token outputs. I noticed that AutoTokenizer completely ignores the src_lang argument and drops the routing prefix.

However, when testing this on a second machine, both AutoTokenizer and NllbTokenizerFast produced the exact same output. After comparing the environments, I realized the only variable was the presence of the sentencepiece library:

  • With sentencepiece installed: AutoTokenizer fails to prepend the src_lang token and appends an <unk> token at the end
  • Without sentencepiece: AutoTokenizer and NllbTokenizerFast produce the same tokens

BLEU Score Heatmaps

Here is the side-by-side comparison of the 36 language pairs.

NllbTokenizerFastAutoTokenizer
NllbTokenizer HeatmapAutoTokenizer Heatmap

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

tokenizer_fast.src_lang = "swe_Latn"
tokenizer_auto.src_lang = "swe_Latn"

inputs_auto = tokenizer_auto(sample_text, src_lang="swe_Latn", return_tensors="pt")
inputs_fast = tokenizer_fast(sample_text, src_lang="swe_Latn", return_tensors="pt")


print("NllbTokenizerFast")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer")
print("Input IDs:", inputs_auto['input_ids'][0].tolist())
print("Tokens:", tokenizer_auto.convert_ids_to_tokens(inputs_auto['input_ids'][0]))

Expected behavior

With sentencepiece installed

NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

AutoTokenizer: Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3] Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>', '<unk>']

Without sentencepiece installed (For me the expected results)

NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

AutoTokenizer: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

extent analysis

Fix Plan

To resolve the discrepancy in BLEU scores between NllbTokenizerFast and AutoTokenizer, we need to ensure that AutoTokenizer correctly prepends the src_lang token and does not append an <unk> token at the end.

The issue arises when the sentencepiece library is installed. To fix this, we can try the following steps:

  • Uninstall the sentencepiece library if it's not required for other parts of the project.
  • If the sentencepiece library is necessary, update the transformers library to the latest version, as this issue might be resolved in newer versions.
  • As a temporary workaround, manually prepend the src_lang token and remove the <unk> token from the output of AutoTokenizer.

Here's an example code snippet that demonstrates the temporary workaround:

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

src_lang = "swe_Latn"

# Temporary workaround for AutoTokenizer
inputs_auto = tokenizer_auto(sample_text, return_tensors="pt")
input_ids_auto = inputs_auto['input_ids'][0].tolist()
tokens_auto = tokenizer_auto.convert_ids_to_tokens(input_ids_auto)

# Prepend src_lang token and remove <unk> token
input_ids_auto = [256167] + input_ids_auto[:-1]  # 256167 is the ID of the 'swe_Latn' token
tokens_auto = [src_lang] + tokens_auto[:-1]

print("NllbTokenizerFast")
tokenizer_fast.src_lang = src_lang
inputs_fast = tokenizer_fast(sample_text, src_lang=src_lang, return_tensors="pt")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer (with workaround)")
print("Input IDs:", input_ids_auto)
print("Tokens:", tokens_auto)

Verification

To verify that the fix worked, compare the output of NllbTokenizerFast and AutoTokenizer (with the temporary workaround) to ensure that they produce the same tokens and input IDs.

Extra Tips

  • Make sure to update the transformers library to the latest version to ensure that any known issues are resolved.
  • If the sentencepiece library is required for other parts of the project, consider using a virtual environment to isolate the dependencies and avoid conflicts.
  • When using AutoTokenizer, always verify that the src_lang token is

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

With sentencepiece installed

NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

AutoTokenizer: Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3] Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>', '<unk>']

Without sentencepiece installed (For me the expected results)

NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

AutoTokenizer: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING