transformers - 💡(How to fix) Fix Path Traversal in Sharded Checkpoint Loader via Unsanitized `weight_map` Entries in `*.index.json`

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

get_checkpoint_shard_files returns the traversed paths without error. The downstream model loading code will attempt to open and read these files as tensor data. While the files may fail to deserialize as valid safetensors, the file contents are accessed by the process, and depending on error handling, logging, or exception messages, data may be exposed.

RAW_BUFFERClick to expand / collapse

System Info

Details The vulnerable code is in get_checkpoint_shard_files in hub.py. When loading a sharded checkpoint from a local directory, the function reads an index JSON file and extracts shard filenames from the weight_map field without any validation:

with open(index_filename) as f: index = json.loads(f.read())

shard_filenames = sorted(set(index["weight_map"].values())) These filenames are then joined directly to the model directory path:

if os.path.isdir(pretrained_model_name_or_path): shard_filenames = [os.path.join(pretrained_model_name_or_path, subfolder, f) for f in shard_filenames] return shard_filenames, sharded_metadata There is no check for:

Path traversal sequences (..) Absolute path prefixes (/) Symbolic links Whether the resolved paths remain within the model directory The returned file paths are passed back to the caller (_get_resolved_checkpoint_files in modeling_utils.py), which uses them to load model weights — effectively enabling reads of arbitrary files the process has access to.

Why existing guards are insufficient The caller _get_resolved_checkpoint_files only validates that the index file itself exists on disk (via os.path.isfile on the .safetensors.index.json path). It does not inspect or sanitize the contents of the index file before passing them to get_checkpoint_shard_files. An attacker-controlled directory needs only contain a valid index JSON file to satisfy this check.

The cached_files function (called for non-local/Hub models) does include file existence checks, but the local directory branch in get_checkpoint_shard_files returns immediately after os.path.join — cached_files is never reached for local paths.

https://github.com/huggingface/transformers/blob/ba06e3fbdf355c363ac067ebcda210017e90a852/src/transformers/utils/hub.py#L836

Who can help?

@Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

PoC Step 1: Create a malicious model directory mkdir -p /tmp/malicious_model Step 2: Create a crafted index file Write the following to /tmp/malicious_model/model.safetensors.index.json:

{ "metadata": { "total_size": 1000 }, "weight_map": { "model.layer.weight": "../../etc/passwd", "model.embed.weight": "../../etc/hostname" } } Step 3: Trigger the vulnerability from transformers import AutoModel

The loading pipeline will:

  1. Find model.safetensors.index.json in the local directory
  2. Set is_sharded = True
  3. Call get_checkpoint_shard_files, which will return: ["/tmp/malicious_model/../../etc/passwd", "/tmp/malicious_model/../../etc/hostname"] These resolve to /etc/passwd and /etc/hostname model = AutoModel.from_pretrained("/tmp/malicious_model")

Expected behavior

Observe the result get_checkpoint_shard_files returns the traversed paths without error. The downstream model loading code will attempt to open and read these files as tensor data. While the files may fail to deserialize as valid safetensors, the file contents are accessed by the process, and depending on error handling, logging, or exception messages, data may be exposed.

A more targeted attack could point shard paths at:

Other users' cached model files in ~/.cache/huggingface/ API tokens stored in ~/.cache/huggingface/token Application configuration or secrets files Any file readable by the process

Vulnerability type: Arbitrary file read via path traversal (CWE-20 / CWE-22)

Who is affected:

Any user or automated system that loads models from untrusted local directories using from_pretrained or any code path that invokes get_checkpoint_shard_files. ML pipelines and platforms that accept user-uploaded model directories (e.g., evaluation platforms, model hosting services, shared compute environments). Developers who download and load models from sources outside the Hugging Face Hub without additional validation. Attack prerequisites:

The attacker must be able to provide a local directory (or a directory downloaded/extracted from an untrusted source) that the victim passes to from_pretrained. No authentication or special privileges are required beyond the ability to place files on the filesystem. Recommended fix: Sanitize all filenames extracted from the weight_map before constructing paths:

Reject any filename containing .. components or absolute path prefixes. After joining paths, validate that the resolved path (via os.path.realpath) remains within the expected model directory. Consider rejecting filenames with path separator characters entirely, since shard files should be flat names like model-00001-of-00003.safetensors. Example fix:

import os

if os.path.isdir(pretrained_model_name_or_path): base_dir = os.path.realpath(os.path.join(pretrained_model_name_or_path, subfolder)) safe_paths = [] for f in shard_filenames: full_path = os.path.realpath(os.path.join(base_dir, f)) if not full_path.startswith(base_dir + os.sep): raise ValueError( f"Shard filename '{f}' in the checkpoint index resolves outside " f"the model directory. This may indicate a malicious index file." ) safe_paths.append(full_path) return safe_paths, sharded_metadata

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Observe the result get_checkpoint_shard_files returns the traversed paths without error. The downstream model loading code will attempt to open and read these files as tensor data. While the files may fail to deserialize as valid safetensors, the file contents are accessed by the process, and depending on error handling, logging, or exception messages, data may be exposed.

A more targeted attack could point shard paths at:

Other users' cached model files in ~/.cache/huggingface/ API tokens stored in ~/.cache/huggingface/token Application configuration or secrets files Any file readable by the process

Vulnerability type: Arbitrary file read via path traversal (CWE-20 / CWE-22)

Who is affected:

Any user or automated system that loads models from untrusted local directories using from_pretrained or any code path that invokes get_checkpoint_shard_files. ML pipelines and platforms that accept user-uploaded model directories (e.g., evaluation platforms, model hosting services, shared compute environments). Developers who download and load models from sources outside the Hugging Face Hub without additional validation. Attack prerequisites:

The attacker must be able to provide a local directory (or a directory downloaded/extracted from an untrusted source) that the victim passes to from_pretrained. No authentication or special privileges are required beyond the ability to place files on the filesystem. Recommended fix: Sanitize all filenames extracted from the weight_map before constructing paths:

Reject any filename containing .. components or absolute path prefixes. After joining paths, validate that the resolved path (via os.path.realpath) remains within the expected model directory. Consider rejecting filenames with path separator characters entirely, since shard files should be flat names like model-00001-of-00003.safetensors. Example fix:

import os

if os.path.isdir(pretrained_model_name_or_path): base_dir = os.path.realpath(os.path.join(pretrained_model_name_or_path, subfolder)) safe_paths = [] for f in shard_filenames: full_path = os.path.realpath(os.path.join(base_dir, f)) if not full_path.startswith(base_dir + os.sep): raise ValueError( f"Shard filename '{f}' in the checkpoint index resolves outside " f"the model directory. This may indicate a malicious index file." ) safe_paths.append(full_path) return safe_paths, sharded_metadata

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Path Traversal in Sharded Checkpoint Loader via Unsanitized `weight_map` Entries in `*.index.json`