llamaIndex - ✅(Solved) Fix [Refactor]: Deduplicate SimpleDirectoryReader.load_file and aload_file [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

SimpleDirectoryReader.load_file and SimpleDirectoryReader.aload_file in llama-index-core/llama_index/core/readers/file/base.py repeat the same branching flow (file readers vs raw read, metadata, filename_as_id, error handling). Both call sites include # TODO: make this less redundant (around lines 590 and 660). Consolidate the shared logic so fixes stay in one place and sync/async paths do not drift.

Error Message

SimpleDirectoryReader.load_file and SimpleDirectoryReader.aload_file in llama-index-core/llama_index/core/readers/file/base.py repeat the same branching flow (file readers vs raw read, metadata, filename_as_id, error handling). Both call sites include # TODO: make this less redundant (around lines 590 and 660). Consolidate the shared logic so fixes stay in one place and sync/async paths do not drift. Today, sync vs async error handling differs: load_file wraps failures as Exception("Error loading file") from the original exception; aload_file re-raises the original. The refactor should preserve each public entry point’s behavior unless the PR explicitly documents a deliberate alignment and adds tests.

Root Cause

SimpleDirectoryReader.load_file and SimpleDirectoryReader.aload_file in llama-index-core/llama_index/core/readers/file/base.py repeat the same branching flow (file readers vs raw read, metadata, filename_as_id, error handling). Both call sites include # TODO: make this less redundant (around lines 590 and 660). Consolidate the shared logic so fixes stay in one place and sync/async paths do not drift.

PR fix notes

PR #21433: fix(core): deduplicate SimpleDirectoryReader file loading logic #21428

Description (problem / solution / changelog)

Summary

Fixes #21428

  • Refactored SimpleDirectoryReader.load_file and SimpleDirectoryReader.aload_file in llama-index-core/llama_index/core/readers/file/base.py to share one implementation of branching, metadata handling, and filename_as_id logic.
  • Removed duplicated sync/async flow previously flagged by TODO comments.
  • Added async error-path tests in llama-index-core/tests/readers/file/test_base.py to keep behavior stable and prevent drift.

Motivation

  • Sync and async file-loading paths were nearly identical but implemented separately, which increases maintenance cost and regression risk.
  • Consolidating shared logic makes future fixes easier to apply consistently across both paths.

Scope

  • Package: llama-index-core only.
  • Out of scope: integration packages and unrelated refactors.

Behavior Compatibility

  • Public method signatures and defaults are unchanged.
  • Existing behavior is preserved:
    • load_file(..., raise_on_error=True) continues to raise Exception("Error loading file") from the original exception.
    • aload_file(..., raise_on_error=True) continues to re-raise the original exception.
    • ImportError propagation remains unchanged in both paths.

Test Plan

  • From repo root: uv sync
  • From repo root: uv run make lint
  • From llama-index-core: uv run -- pytest
  • Targeted check: uv run -- pytest tests/readers/file/test_base.py

Notes

  • This is a behavior-preserving refactor focused on deduplication and long-term maintainability.

Changed files

  • llama-index-core/llama_index/core/readers/file/base.py (modified, +83/-74)
  • llama-index-core/tests/readers/file/test_base.py (modified, +34/-0)
RAW_BUFFERClick to expand / collapse

Documentation Issue Description

Summary

SimpleDirectoryReader.load_file and SimpleDirectoryReader.aload_file in llama-index-core/llama_index/core/readers/file/base.py repeat the same branching flow (file readers vs raw read, metadata, filename_as_id, error handling). Both call sites include # TODO: make this less redundant (around lines 590 and 660). Consolidate the shared logic so fixes stay in one place and sync/async paths do not drift.

Motivation

Duplicated code tends to get updated in only one path. Today, sync vs async error handling differs: load_file wraps failures as Exception("Error loading file") from the original exception; aload_file re-raises the original. The refactor should preserve each public entry point’s behavior unless the PR explicitly documents a deliberate alignment and adds tests.

Expected outcome

Less bug surface in file loading. Simpler reviews and clearer tests for directory/file reading.

Scope (per CONTRIBUTING.md)

Package: llama-index-core only. Out of scope: new integration packages or unrelated refactors.

Suggested approach

Extract shared logic into private helper(s), parameterized by how content is loaded (load_data vs aload_data). Keep public method signatures and default behavior stable.

How to validate (from CONTRIBUTING quick start)

Repo root: uv sync (and uv run pre-commit install if you use hooks). Package dir: cd llama-index-core Tests: uv run -- pytest If you change style-sensitive code: from repo root, uv run make lint (and format as you normally do for this repo, e.g. uv run make format if that is your workflow).

Acceptance criteria

load_file and aload_file share one implementation of branching / metadata / ID logic. Behavior matches pre-refactor unless the PR states otherwise and includes tests for any change. uv run -- pytest passes in llama-index-core. uv run make lint passes from repo root when applicable.

References

llama-index-core/llama_index/core/readers/file/base.py — load_file, aload_file Optional one-liner for the issue list: “Deduplicate SimpleDirectoryReader sync/async file loading; keep behavior; tests + lint per CONTRIBUTING.” Is this issue right ?or not?

extent analysis

TL;DR

Extract shared logic from SimpleDirectoryReader.load_file and SimpleDirectoryReader.aload_file into private helper(s) to reduce code duplication and simplify maintenance.

Guidance

  • Identify the common logic between load_file and aload_file in llama-index-core/llama_index/core/readers/file/base.py and extract it into a private helper function.
  • Parameterize the helper function to accommodate the differences in how content is loaded (load_data vs aload_data).
  • Ensure the public method signatures and default behavior of load_file and aload_file remain stable after the refactor.
  • Validate the changes by running uv run -- pytest and uv run make lint (if applicable) to ensure tests pass and code style is maintained.

Example

# Example of extracted private helper function
def _load_file_helper(load_data_func, filename, metadata, filename_as_id):
    # Shared logic for loading file and handling metadata
    pass

# Updated load_file and aload_file methods
def load_file(self, filename, metadata, filename_as_id):
    return self._load_file_helper(self.load_data, filename, metadata, filename_as_id)

def aload_file(self, filename, metadata, filename_as_id):
    return self._load_file_helper(self aload_data, filename, metadata, filename_as_id)

Notes

The suggested approach focuses on reducing code duplication and simplifying maintenance. However, it is essential to ensure that the behavior of load_file and aload_file remains consistent with their pre-refactor behavior, unless explicitly documented and tested changes are made.

Recommendation

Apply the suggested approach of extracting shared logic into private helper(s) to reduce code duplication and simplify maintenance, while preserving the public method signatures and default behavior of load_file and aload_file.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING