#### With `sentencepiece` installed NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', ' '] AutoTokenizer: Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3] Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', ' ', ' '] #### Without `sentencepiece` installed (For me the expected results) NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', ' '] AutoTokenizer: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', ' ']

transformers - ✅(Solved) Fix Inconsistent tokenization and BLEU scores between AutoTokinzer and NllbTokenizerFast [1 pull requests, 3 comments, 4 participants]

transformers2026-03-25 12:37:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44993•Fetched 2026-04-08 01:30:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×4commented ×3mentioned ×3cross-referenced ×1

PR fix notes

PR #45078: throw error when conversion required

Repository: huggingface/transformers
Author: itazap
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45078

Description (problem / solution / changelog)

fixes fallback https://github.com/huggingface/transformers/issues/44993

Changed files

src/transformers/models/auto/tokenization_auto.py (modified, +14/-9)
tests/models/auto/test_tokenization_auto.py (modified, +19/-0)

Code Example

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

tokenizer_fast.src_lang = "swe_Latn"
tokenizer_auto.src_lang = "swe_Latn"

inputs_auto = tokenizer_auto(sample_text, src_lang="swe_Latn", return_tensors="pt")
inputs_fast = tokenizer_fast(sample_text, src_lang="swe_Latn", return_tensors="pt")


print("NllbTokenizerFast")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer")
print("Input IDs:", inputs_auto['input_ids'][0].tolist())
print("Tokens:", tokenizer_auto.convert_ids_to_tokens(inputs_auto['input_ids'][0]))

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.0.0
Platform: macOS-26.3.1-arm64-arm-64bit
Python version: 3.10.19
PyTorch version: 2.10.0

Information

I've been evaluating facebook/nllb-200-distilled-600M across 36 different language pairs and ran into a significant discrepancy depending on which tokenizer class is instantiated.

When using NllbTokenizerFast versus AutoTokenizer, the resulting BLEU scores are drastically different for the exact same generation parameters.

For example:

swe_Latn -> fra_Latn: Drops from ~43.35 BLEU (Fast) to ~9.02 BLEU (Auto).
spa_Latn -> fra_Latn: Jumps from ~33.97 BLEU (Dast) to ~53.25 BLEU (Auto).

To understand the massive gap in BLEU scores, I inspected the raw token outputs. I noticed that AutoTokenizer completely ignores the src_lang argument and drops the routing prefix.

However, when testing this on a second machine, both AutoTokenizer and NllbTokenizerFast produced the exact same output. After comparing the environments, I realized the only variable was the presence of the sentencepiece library:

With sentencepiece installed: AutoTokenizer fails to prepend the src_lang token and appends an <unk> token at the end
Without sentencepiece: AutoTokenizer and NllbTokenizerFast produce the same tokens

BLEU Score Heatmaps

Here is the side-by-side comparison of the 36 language pairs.

`NllbTokenizerFast`	`AutoTokenizer`

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

tokenizer_fast.src_lang = "swe_Latn"
tokenizer_auto.src_lang = "swe_Latn"

inputs_auto = tokenizer_auto(sample_text, src_lang="swe_Latn", return_tensors="pt")
inputs_fast = tokenizer_fast(sample_text, src_lang="swe_Latn", return_tensors="pt")


print("NllbTokenizerFast")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer")
print("Input IDs:", inputs_auto['input_ids'][0].tolist())
print("Tokens:", tokenizer_auto.convert_ids_to_tokens(inputs_auto['input_ids'][0]))

Expected behavior

With `sentencepiece` installed

NllbTokenizerFast: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

AutoTokenizer: Input IDs: [30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2, 3] Tokens: ['▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>', '<unk>']

Without `sentencepiece` installed (For me the expected results)

AutoTokenizer: Input IDs: [256167, 30, 3471, 6486, 10056, 117348, 14909, 11507, 463, 13861, 11651, 13056, 181155, 26958, 452, 150992, 1763, 492, 207809, 9520, 3288, 55723, 30650, 5536, 25424, 138458, 5733, 2] Tokens: ['swe_Latn', '▁i', '▁må', 'nda', 'gs', '▁medde', 'lade', '▁forsk', 'are', '▁från', '▁stan', 'ford', '▁university', '▁school', '▁of', '▁medicine', '▁att', '▁man', '▁tagit', '▁fram', '▁ett', '▁nytt', '▁diag', 'nost', 'iskt', '▁verkt', 'yg', '</s>']

extent analysis

Fix Plan

To resolve the discrepancy in BLEU scores between NllbTokenizerFast and AutoTokenizer, we need to ensure that AutoTokenizer correctly prepends the src_lang token and does not append an <unk> token at the end.

The issue arises when the sentencepiece library is installed. To fix this, we can try the following steps:

Uninstall the sentencepiece library if it's not required for other parts of the project.
If the sentencepiece library is necessary, update the transformers library to the latest version, as this issue might be resolved in newer versions.
As a temporary workaround, manually prepend the src_lang token and remove the <unk> token from the output of AutoTokenizer.

Here's an example code snippet that demonstrates the temporary workaround:

from transformers import AutoTokenizer, NllbTokenizerFast

model_name = "facebook/nllb-200-distilled-600M"

tokenizer_auto = AutoTokenizer.from_pretrained(model_name) 
tokenizer_fast = NllbTokenizerFast.from_pretrained(model_name)

sample_text = "i måndags meddelade forskare från stanford university school of medicine att man tagit fram ett nytt diagnostiskt verktyg..."

src_lang = "swe_Latn"

# Temporary workaround for AutoTokenizer
inputs_auto = tokenizer_auto(sample_text, return_tensors="pt")
input_ids_auto = inputs_auto['input_ids'][0].tolist()
tokens_auto = tokenizer_auto.convert_ids_to_tokens(input_ids_auto)

# Prepend src_lang token and remove <unk> token
input_ids_auto = [256167] + input_ids_auto[:-1]  # 256167 is the ID of the 'swe_Latn' token
tokens_auto = [src_lang] + tokens_auto[:-1]

print("NllbTokenizerFast")
tokenizer_fast.src_lang = src_lang
inputs_fast = tokenizer_fast(sample_text, src_lang=src_lang, return_tensors="pt")
print("Input IDs:", inputs_fast['input_ids'][0].tolist())
print("Tokens:", tokenizer_fast.convert_ids_to_tokens(inputs_fast['input_ids'][0]))

print("\n AutoTokenizer (with workaround)")
print("Input IDs:", input_ids_auto)
print("Tokens:", tokens_auto)

Verification

To verify that the fix worked, compare the output of NllbTokenizerFast and AutoTokenizer (with the temporary workaround) to ensure that they produce the same tokens and input IDs.

Extra Tips

Make sure to update the transformers library to the latest version to ensure that any known issues are resolved.
If the sentencepiece library is required for other parts of the project, consider using a virtual environment to isolate the dependencies and avoid conflicts.
When using AutoTokenizer, always verify that the src_lang token is

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

With `sentencepiece` installed

Without `sentencepiece` installed (For me the expected results)

#serialization error #model compatibility #GPU setup #container setup #orchestration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix Inconsistent tokenization and BLEU scores between AutoTokinzer and NllbTokenizerFast [1 pull requests, 3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #45078: throw error when conversion required

Description (problem / solution / changelog)

Changed files

Code Example

System Info

System Info

Information

BLEU Score Heatmaps

Who can help?

Information

Tasks

Reproduction

Expected behavior

With `sentencepiece` installed

Without `sentencepiece` installed (For me the expected results)

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

With `sentencepiece` installed

Without `sentencepiece` installed (For me the expected results)

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix Inconsistent tokenization and BLEU scores between AutoTokinzer and NllbTokenizerFast [1 pull requests, 3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #45078: throw error when conversion required

Description (problem / solution / changelog)

Changed files

Code Example

System Info

System Info

Information

BLEU Score Heatmaps

Who can help?

Information

Tasks

Reproduction

Expected behavior

With sentencepiece installed

Without sentencepiece installed (For me the expected results)

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

With sentencepiece installed

Without sentencepiece installed (For me the expected results)

Still need to ship something?

RELATED_DISCOVERY

TRENDING

With `sentencepiece` installed

Without `sentencepiece` installed (For me the expected results)

With `sentencepiece` installed

Without `sentencepiece` installed (For me the expected results)