llamaIndex - ✅(Solved) Fix Arbitrary file read via ImageDocument.metadata["file_path"] in image_documents_to_base64 [2 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21512Fetched 2026-05-01 05:33:18
View on GitHub
Comments
4
Participants
3
Timeline
17
Reactions
0
Author
Timeline (top)
commented ×4mentioned ×4subscribed ×4cross-referenced ×2

image_documents_to_base64 in llama-index-core opens any path supplied via ImageDocument.metadata["file_path"] and base64-encodes the bytes — with no allow-listed root, no symlink check, and no MIME validation. Applications that build ImageDocument instances from user-influenced data (a common pattern: tool arguments, JSON request bodies, deserialized documents) effectively expose an arbitrary-file-read primitive whose output is forwarded to the configured multimodal LLM.

Root Cause

image_documents_to_base64 in llama-index-core opens any path supplied via ImageDocument.metadata["file_path"] and base64-encodes the bytes — with no allow-listed root, no symlink check, and no MIME validation. Applications that build ImageDocument instances from user-influenced data (a common pattern: tool arguments, JSON request bodies, deserialized documents) effectively expose an arbitrary-file-read primitive whose output is forwarded to the configured multimodal LLM.

Fix Action

Fixed

PR fix notes

PR #21514: fix(security): Prevent arbitrary file read in image_documents_to_base64 (#21512)

Description (problem / solution / changelog)

Summary

Fixes a critical arbitrary file read vulnerability in image_documents_to_base64() that allows reading any file the process has access to via ImageDocument.metadata["file_path"].

Vulnerability

image_documents_to_base64 in llama-index-core opens any path supplied via ImageDocument.metadata["file_path"] and base64-encodes the bytes — with no allow-listed root, no symlink check, and no MIME validation. Applications that build ImageDocument instances from user-influenced data effectively expose an arbitrary-file-read primitive.

Attack vector:

doc = ImageDocument(metadata={"file_path": "/etc/passwd"})
encoded = image_documents_to_base64([doc])  # Returns base64 of /etc/passwd!

Fix

  1. _validate_image_path() — New security validation function that:

    • Rejects symlinks (prevents symlink-based path traversal)
    • Validates file content is actually an image (MIME check via filetype)
    • Enforces 50MB file size limit (prevents DoS)
    • Rejects null bytes in paths (prevents null-byte injection)
    • Validates file is a regular file (not device/socket)
  2. Path validation on all input paths — Both image_path and metadata["file_path"] are validated before reading

  3. Deprecation warningmetadata["file_path"] usage now emits a DeprecationWarning

  4. URL image validation — URL-fetched images are validated for correct MIME type

  5. Comprehensive test suite — 27 test cases covering all security scenarios

Testing

All 27 tests pass:

  • 10 path validation tests (symlinks, MIME, size, null bytes, etc.)
  • 3 encode_image tests
  • 8 image_documents_to_base64 tests (including CRITICAL /etc/passwd rejection)
  • 3 URL fetching tests
  • 3 MIME type tests

Fixes run-llama/llama_index#21512

Changed files

  • .github/ISSUE_TEMPLATE/config.yml (removed, +0/-8)
  • .github/ISSUE_TEMPLATE/docs-form.yml (removed, +0/-24)
  • .github/ISSUE_TEMPLATE/feature-form.yml (removed, +0/-31)
  • .github/ISSUE_TEMPLATE/issue-form.yml (removed, +0/-37)
  • .github/ISSUE_TEMPLATE/question-form.yml (removed, +0/-25)
  • .github/dependabot.yml (removed, +0/-14)
  • .github/pull_request_template.md (removed, +0/-46)
  • .github/workflows/build_package.yml (removed, +0/-43)
  • .github/workflows/close_new_integration_prs.yml (removed, +0/-64)
  • .github/workflows/codeql.yml (removed, +0/-81)
  • .github/workflows/core-typecheck.yml (removed, +0/-25)
  • .github/workflows/coverage_check.yml (removed, +0/-45)
  • .github/workflows/issue_classifier.yml (removed, +0/-27)
  • .github/workflows/lint.yml (removed, +0/-23)
  • .github/workflows/llama_dev_tests.yml (removed, +0/-29)
  • .github/workflows/pre_release.yml (removed, +0/-62)
  • .github/workflows/publish_sub_package.yml (removed, +0/-58)
  • .github/workflows/release.yml (removed, +0/-137)
  • .github/workflows/stale_bot.yml (removed, +0/-19)
  • .github/workflows/sync-docs.yml (removed, +0/-83)
  • .github/workflows/unit_test.yml (removed, +0/-70)
  • .gitignore (removed, +0/-49)
  • .pre-commit-config.yaml (removed, +0/-120)
  • .readthedocs.yaml (removed, +0/-26)
  • CHANGELOG.md (removed, +0/-14319)
  • CITATION.cff (removed, +0/-10)
  • CODE_OF_CONDUCT.md (removed, +0/-128)
  • CONTRIBUTING.md (removed, +0/-157)
  • LICENSE (removed, +0/-21)
  • Makefile (removed, +0/-32)
  • README.md (removed, +0/-217)
  • RELEASE_HEAD.md (removed, +0/-1)
  • SECURITY.md (removed, +0/-80)
  • STALE.md (removed, +0/-65)
  • docs.config.mjs (removed, +0/-19)
  • docs/.gitignore (removed, +0/-26)
  • docs/DOCS_README.md (removed, +0/-85)
  • docs/api_reference/api_reference/_static/assets/LlamaLogoBrowserTab.png (removed, +0/-0)
  • docs/api_reference/api_reference/_static/assets/LlamaSquareBlack.svg (removed, +0/-18)
  • docs/api_reference/api_reference/agent/index.md (removed, +0/-5)
  • docs/api_reference/api_reference/callbacks/agentops.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/aim.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/argilla.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/arize_phoenix.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/honeyhive.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/index.md (removed, +0/-13)
  • docs/api_reference/api_reference/callbacks/langfuse.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/literalai.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/llama_debug.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/openinference.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/opik.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/promptlayer.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/token_counter.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/uptrain.md (removed, +0/-3)
  • docs/api_reference/api_reference/callbacks/wandb.md (removed, +0/-3)
  • docs/api_reference/api_reference/chat_engines/condense_plus_context.md (removed, +0/-3)
  • docs/api_reference/api_reference/chat_engines/condense_question.md (removed, +0/-3)
  • docs/api_reference/api_reference/chat_engines/context.md (removed, +0/-3)
  • docs/api_reference/api_reference/chat_engines/index.md (removed, +0/-1)
  • docs/api_reference/api_reference/chat_engines/simple.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/adapter.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/alephalpha.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/alibabacloud_aisearch.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/anyscale.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/autoembeddings.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/azure_inference.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/azure_openai.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/baseten.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/bedrock.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/clarifai.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/clip.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/cloudflare_workersai.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/cohere.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/dashscope.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/databricks.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/deepinfra.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/elasticsearch.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/fastembed.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/fireworks.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/gaudi.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/gigachat.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/google_genai.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/heroku.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/huggingface.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/huggingface_api.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/huggingface_openvino.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/huggingface_optimum.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/huggingface_optimum_intel.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/ibm.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/index.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/instructor.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/ipex_llm.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/isaacus.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/jinaai.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/langchain.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/litellm.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/llamafile.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/llm_rails.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/mistralai.md (removed, +0/-3)
  • docs/api_reference/api_reference/embeddings/mixedbreadai.md (removed, +0/-3)

PR #21516: fix(security): add is_image_pil validation for metadata file_path

Description (problem / solution / changelog)

Description

Fixes an arbitrary file read vulnerability in image_documents_to_base64() where metadata["file_path"] was read without image validation.

Without this fix, an attacker could pass any file path (e.g. /etc/passwd, ~/.aws/credentials) via ImageDocument(metadata={"file_path": "/etc/passwd"}) and get it base64-encoded and forwarded to the LLM.

Fix: Added is_image_pil() validation on metadata["file_path"] — mirrors existing validation on image_path parameter.

Fixes #21512

New Package?

  • Yes
  • No

Version Bump?

  • Yes
  • No

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • I added new unit tests to cover this change

##Added tests:

  • test_metadata_file_path_non_image_rejected
  • test_metadata_file_path_valid_image_encoded
  • Fixed existing test to mock is_image_pil
  • All 13 tests passing locally

Suggested Checklist:

  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally with my changes

Changed files

  • llama-index-core/llama_index/core/multi_modal_llms/generic_utils.py (modified, +2/-1)
  • llama-index-core/tests/multi_modal_llms/test_generic_utils.py (modified, +34/-2)
  • llama-index-integrations/embeddings/llama-index-embeddings-adapter/llama_index/embeddings/adapter/utils.py (modified, +1/-0)
  • llama-index-integrations/embeddings/llama-index-embeddings-adapter/tests/test_embeddings_adapter.py (modified, +27/-0)

Code Example

for image_document in image_documents:
    if image_document.image:
        image_encodings.append(image_document.image)
    elif image_document.image_path and os.path.isfile(image_document.image_path):
        image_encodings.append(encode_image(image_document.image_path))
    elif (
        "file_path" in image_document.metadata
        and image_document.metadata["file_path"] != ""
        and os.path.isfile(image_document.metadata["file_path"])
    ):
        image_encodings.append(encode_image(image_document.metadata["file_path"]))
    elif image_document.image_url:
        response = requests.get(image_document.image_url, timeout=(60, 60))
        ...

---

from pathlib import Path

ALLOWED_ROOT = Path("/var/app/uploads").resolve()

def _safe_image_path(p: str) -> Path:
    resolved = Path(p).resolve(strict=True)
    if ALLOWED_ROOT not in resolved.parents and resolved != ALLOWED_ROOT:
        raise ValueError("image path escapes allowed directory")
    if resolved.is_symlink():
        raise ValueError("symlinks not allowed")
    return resolved

---

from llama_index.core.schema import ImageDocument
from llama_index.core.multi_modal_llms.generic_utils import image_documents_to_base64

# No image_path, no image_url — only metadata['file_path'].
# This branch is not gated by is_image_pil() validation.
doc = ImageDocument(metadata={"file_path": "/etc/passwd"})

encoded = image_documents_to_base64([doc])
print(encoded)  # base64 of /etc/passwd

---
RAW_BUFFERClick to expand / collapse

Bug Description

Summary

image_documents_to_base64 in llama-index-core opens any path supplied via ImageDocument.metadata["file_path"] and base64-encodes the bytes — with no allow-listed root, no symlink check, and no MIME validation. Applications that build ImageDocument instances from user-influenced data (a common pattern: tool arguments, JSON request bodies, deserialized documents) effectively expose an arbitrary-file-read primitive whose output is forwarded to the configured multimodal LLM.

Affected code

llama-index-core/llama_index/core/multi_modal_llms/generic_utils.py, lines 65-86:

for image_document in image_documents:
    if image_document.image:
        image_encodings.append(image_document.image)
    elif image_document.image_path and os.path.isfile(image_document.image_path):
        image_encodings.append(encode_image(image_document.image_path))
    elif (
        "file_path" in image_document.metadata
        and image_document.metadata["file_path"] != ""
        and os.path.isfile(image_document.metadata["file_path"])
    ):
        image_encodings.append(encode_image(image_document.metadata["file_path"]))
    elif image_document.image_url:
        response = requests.get(image_document.image_url, timeout=(60, 60))
        ...

encode_image then performs an unguarded open(image_path, "rb") and base64-encodes the bytes.

metadata is a free-form dict on Document, so no validator gates it. The is_image_pil check on ImageDocument.__init__ only fires for the image_path= constructor argument, not for metadata. The library's own tests use this construction pattern (tests/multi_modal_llms/test_generic_utils.py:67 instantiates ImageDocument(metadata={"file_path": "test.jpg"})).

Impact

In RAG / multimodal-agent applications that ingest user-influenced input into ImageDocument (or that allow tool calls to construct one), an attacker can read arbitrary files the process has access to — /etc/passwd, ~/.aws/credentials, /var/run/secrets/kubernetes.io/serviceaccount/token, application .env files, etc. The bytes are returned in image_encodings and assigned to ImageDocument.image, then forwarded to the LLM provider — and typically echoed back to the user when the model is asked to "describe this image."

Suggested fix

Restrict accepted paths to an explicit allow-listed root, reject symlinks, and validate MIME/size before encoding:

from pathlib import Path

ALLOWED_ROOT = Path("/var/app/uploads").resolve()

def _safe_image_path(p: str) -> Path:
    resolved = Path(p).resolve(strict=True)
    if ALLOWED_ROOT not in resolved.parents and resolved != ALLOWED_ROOT:
        raise ValueError("image path escapes allowed directory")
    if resolved.is_symlink():
        raise ValueError("symlinks not allowed")
    return resolved

A more defensive option: stop honoring metadata["file_path"] as a path source entirely, and require callers to set image_path (which goes through the existing is_image_pil check) or pre-encoded image.

Scope note

I'm aware the project's security policy classifies path-traversal-from-untrusted-paths as out of scope for the Huntr bounty (treated as application-layer responsibility). Filing this as a public bug rather than as a security report — many users may not realize that metadata["file_path"] bypasses the constructor's image-validation check.

Version

0.14.21

Steps to Reproduce

from llama_index.core.schema import ImageDocument
from llama_index.core.multi_modal_llms.generic_utils import image_documents_to_base64

# No image_path, no image_url — only metadata['file_path'].
# This branch is not gated by is_image_pil() validation.
doc = ImageDocument(metadata={"file_path": "/etc/passwd"})

encoded = image_documents_to_base64([doc])
print(encoded)  # base64 of /etc/passwd

The same primitive works against ~/.aws/credentials, /var/run/secrets/kubernetes.io/serviceaccount/token, application .env files — anything the process can read.

In a real deployment this is reachable any time user input flows into an ImageDocument's metadata (tool-call arguments, JSON request bodies, deserialized documents, third-party feeds). The encoded bytes end up in ImageDocument.image via set_base64_and_mimetype_for_image_docs and are forwarded to the multimodal LLM provider, where they are typically echoed back when the model is asked to describe the "image."

Relevant Logs/Tracebacks

Scanned automatically with https://github.com/etairl/Probus

extent analysis

TL;DR

Restrict accepted paths to an explicit allow-listed root and validate MIME/size before encoding to prevent arbitrary file read vulnerabilities.

Guidance

  • Identify and restrict access to sensitive files and directories that could be exposed through the image_documents_to_base64 function.
  • Implement a validation mechanism to ensure that only allowed file paths are processed, using techniques such as allow-listing or strict input validation.
  • Consider disabling the use of metadata["file_path"] as a path source and require callers to set image_path or pre-encoded image instead.
  • Review and update the project's security policy to include path-traversal-from-untrusted-paths as a potential security risk.

Example

from pathlib import Path

ALLOWED_ROOT = Path("/var/app/uploads").resolve()

def _safe_image_path(p: str) -> Path:
    resolved = Path(p).resolve(strict=True)
    if ALLOWED_ROOT not in resolved.parents and resolved != ALLOWED_ROOT:
        raise ValueError("image path escapes allowed directory")
    if resolved.is_symlink():
        raise ValueError("symlinks not allowed")
    return resolved

Notes

The provided fix is a suggested solution and may require additional modifications to ensure compatibility with the existing codebase. It is essential to thoroughly test and validate any changes to prevent unintended consequences.

Recommendation

Apply the suggested fix to restrict accepted paths and validate MIME/size before encoding, as it provides a more secure and defensive approach to preventing arbitrary file read vulnerabilities.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix Arbitrary file read via ImageDocument.metadata["file_path"] in image_documents_to_base64 [2 pull requests, 4 comments, 3 participants]