transformers - 💡(How to fix) Fix Regression in v5.x: `ProcessorMixin._load_tokenizer_from_pretrained` forces subfolder for non-primary sub-tokenizers, breaking repos that put tokenizer files at root [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

ProcessorMixin._load_tokenizer_from_pretrained in v5.x unconditionally appends the sub-processor attribute name as a subfolder when loading a non-primary tokenizer:

# src/transformers/processing_utils.py, ~L1457-1467
is_primary = sub_processor_type == "tokenizer"
if is_primary:
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=subfolder, **kwargs)
else:
    tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=tokenizer_subfolder, **kwargs)

This breaks loading of processors whose tokenizer files live at the repo root but whose sub-processor attribute is named anything other than tokenizer — e.g. UniversalActionProcessor from physical-intelligence/fast, which uses bpe_tokenizer as the attribute name.

Under transformers v4.x this case worked because the loader could fall back to root.

Root Cause

Under transformers v4.x this case worked because the loader could fall back to root.

Fix Action

Fixed

Code Example

# src/transformers/processing_utils.py, ~L1457-1467
is_primary = sub_processor_type == "tokenizer"
if is_primary:
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=subfolder, **kwargs)
else:
    tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=tokenizer_subfolder, **kwargs)

---

# transformers==5.2.0
from transformers import AutoProcessor
proc = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
RAW_BUFFERClick to expand / collapse

Summary

ProcessorMixin._load_tokenizer_from_pretrained in v5.x unconditionally appends the sub-processor attribute name as a subfolder when loading a non-primary tokenizer:

# src/transformers/processing_utils.py, ~L1457-1467
is_primary = sub_processor_type == "tokenizer"
if is_primary:
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=subfolder, **kwargs)
else:
    tokenizer_subfolder = os.path.join(subfolder, sub_processor_type) if subfolder else sub_processor_type
    tokenizer = AutoTokenizer.from_pretrained(path, subfolder=tokenizer_subfolder, **kwargs)

This breaks loading of processors whose tokenizer files live at the repo root but whose sub-processor attribute is named anything other than tokenizer — e.g. UniversalActionProcessor from physical-intelligence/fast, which uses bpe_tokenizer as the attribute name.

Under transformers v4.x this case worked because the loader could fall back to root.

Repro

# transformers==5.2.0
from transformers import AutoProcessor
proc = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)

Fails with ValueError: Couldn't instantiate the backend tokenizer .... AutoTokenizer.from_pretrained("physical-intelligence/fast") (no subfolder arg) works, confirming the issue is specifically the forced subfolder.

Suggested fix

When the subfolder lookup returns no files (or all .no_exist), fall back to loading from root with a deprecation warning. This restores v4.x behavior while preserving the v5 intent of supporting multiple sub-tokenizers in subfolders.

The cleanest fix is probably to try root first and only escalate to subfolder if the attribute-named subdirectory exists in the repo file listing.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Regression in v5.x: `ProcessorMixin._load_tokenizer_from_pretrained` forces subfolder for non-primary sub-tokenizers, breaking repos that put tokenizer files at root [1 pull requests]