llamaIndex - 💡(How to fix) Fix New Integration: llama-index-readers-oxidize-pdf — Rust-powered PDF reader with RAG-ready chunking [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21437Fetched 2026-04-22 07:43:36
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
commented ×1mentioned ×1subscribed ×1

I've built and published llama-index-readers-oxidize-pdf, a PDF reader for LlamaIndex backed by oxidize-pdf — a pure-Rust PDF engine with first-class RAG primitives (semantic chunking, element partitioning, heading-aware context).

Root Cause

I've built and published llama-index-readers-oxidize-pdf, a PDF reader for LlamaIndex backed by oxidize-pdf — a pure-Rust PDF engine with first-class RAG primitives (semantic chunking, element partitioning, heading-aware context).

Code Example

pip install llama-index-readers-oxidize-pdf

---

from llama_index.readers.oxidize_pdf import OxidizePdfReader

reader = OxidizePdfReader()  # mode="rag" by default
documents = reader.load_data("paper.pdf")

for doc in documents:
    print(doc.metadata["chunk_index"], doc.metadata["heading_context"])
    print(doc.text[:200])
RAW_BUFFERClick to expand / collapse

Overview

I've built and published llama-index-readers-oxidize-pdf, a PDF reader for LlamaIndex backed by oxidize-pdf — a pure-Rust PDF engine with first-class RAG primitives (semantic chunking, element partitioning, heading-aware context).

Why a New Reader?

Existing LlamaIndex PDF readers have tradeoffs that leave a gap for pure-text RAG pipelines:

  • PDFReader (from llama-index-readers-file) wraps pypdf — no semantic chunking, output is raw per-page text.
  • llama-index-readers-pdf-marker (marker) is GPU-accelerated and excellent for layout-rich documents, but heavy (~GB of ML models) for lightweight pipelines.
  • llama-index-readers-smart-pdf-loader requires a LlamaParse API key.

llama-index-readers-oxidize-pdf is a middle-ground: CPU-only, no ML models, fast, and produces RAG-ready chunks (with heading context and element types) directly from the Rust core.

What It Provides

A single OxidizePdfReader(BaseReader) class with three modes:

ModeOutputUse case
rag (default)one Document per semantic chunk with heading_context, element_types, page_numbers, token_estimateVector-store ingestion for RAG
pagesone Document per page (1-indexed)Parity with PyPDFReader / LangChain-style pipelines
markdownsingle Document with the full PDF rendered to markdownPipelines that prefer markdown input

Technical Details

  • Extends BaseReader from llama_index.core.readers.base
  • Returns standard llama_index.core.schema.Document objects
  • Namespace package llama_index.readers.oxidize_pdf (PEP 420), consistent with existing LlamaIndex readers
  • [tool.llamahub] metadata configured (import_path, class_authors)
  • Depends on oxidize-pdf>=0.4.2 (pure-Python wheel, no system deps)
  • 22 behavior tests (no smoke tests); CI matrix on Python 3.10–3.13

Installation & Usage

pip install llama-index-readers-oxidize-pdf
from llama_index.readers.oxidize_pdf import OxidizePdfReader

reader = OxidizePdfReader()  # mode="rag" by default
documents = reader.load_data("paper.pdf")

for doc in documents:
    print(doc.metadata["chunk_index"], doc.metadata["heading_context"])
    print(doc.text[:200])

Request

Please list this integration on llamahub.ai so users discovering PDF readers can find it alongside the existing ones. Happy to adjust the metadata or PR to the docs index if needed.

extent analysis

TL;DR

To get the llama-index-readers-oxidize-pdf integration listed on LlamaHub, adjust the metadata and submit a PR to the docs index.

Guidance

  1. Review the metadata: Ensure the [tool.llamahub] metadata is correctly configured with import_path and class_authors.
  2. Prepare a PR: Create a pull request to the LlamaHub documentation index with the updated metadata and information about the llama-index-readers-oxidize-pdf integration.
  3. Test the integration: Verify that the OxidizePdfReader class works as expected in different modes (rag, pages, markdown) before submitting the PR.
  4. Check compatibility: Confirm that the integration is compatible with the required Python versions (3.10-3.13) and oxidize-pdf version (>=0.4.2).

Notes

The issue lacks information about specific errors or problems with the integration, so the guidance focuses on the request to list the integration on LlamaHub.

Recommendation

Apply workaround: Adjust the metadata and submit a PR to the docs index to get the integration listed on LlamaHub, as this is the primary request in the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - 💡(How to fix) Fix New Integration: llama-index-readers-oxidize-pdf — Rust-powered PDF reader with RAG-ready chunking [1 comments, 2 participants]