llamaIndex - 💡(How to fix) Fix New Integration: llama-index-readers-oxidize-pdf — Rust-powered PDF reader with RAG-ready chunking [1 comments, 2 participants]

llamaIndex2026-04-21 19:33:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21437•Fetched 2026-04-22 07:43:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bzsanti

Participants

bzsanti

Gopesh111

Timeline (top)

commented ×1mentioned ×1subscribed ×1

I've built and published llama-index-readers-oxidize-pdf, a PDF reader for LlamaIndex backed by oxidize-pdf — a pure-Rust PDF engine with first-class RAG primitives (semantic chunking, element partitioning, heading-aware context).

PyPI: https://pypi.org/project/llama-index-readers-oxidize-pdf/
Source: https://github.com/bzsanti/oxidize-pdf-integrations/tree/main/llamaindex
Core library: https://github.com/bzsanti/oxidize-python (Rust + Python bridge)

Root Cause

PyPI: https://pypi.org/project/llama-index-readers-oxidize-pdf/
Source: https://github.com/bzsanti/oxidize-pdf-integrations/tree/main/llamaindex
Core library: https://github.com/bzsanti/oxidize-python (Rust + Python bridge)

Code Example

pip install llama-index-readers-oxidize-pdf

---

from llama_index.readers.oxidize_pdf import OxidizePdfReader

reader = OxidizePdfReader()  # mode="rag" by default
documents = reader.load_data("paper.pdf")

for doc in documents:
    print(doc.metadata["chunk_index"], doc.metadata["heading_context"])
    print(doc.text[:200])

RAW_BUFFERClick to expand / collapse

Overview

PyPI: https://pypi.org/project/llama-index-readers-oxidize-pdf/
Source: https://github.com/bzsanti/oxidize-pdf-integrations/tree/main/llamaindex
Core library: https://github.com/bzsanti/oxidize-python (Rust + Python bridge)

Why a New Reader?

Existing LlamaIndex PDF readers have tradeoffs that leave a gap for pure-text RAG pipelines:

PDFReader (from llama-index-readers-file) wraps pypdf — no semantic chunking, output is raw per-page text.
llama-index-readers-pdf-marker (marker) is GPU-accelerated and excellent for layout-rich documents, but heavy (~GB of ML models) for lightweight pipelines.
llama-index-readers-smart-pdf-loader requires a LlamaParse API key.

llama-index-readers-oxidize-pdf is a middle-ground: CPU-only, no ML models, fast, and produces RAG-ready chunks (with heading context and element types) directly from the Rust core.

What It Provides

A single OxidizePdfReader(BaseReader) class with three modes:

Mode	Output	Use case
`rag` (default)	one `Document` per semantic chunk with `heading_context`, `element_types`, `page_numbers`, `token_estimate`	Vector-store ingestion for RAG
`pages`	one `Document` per page (1-indexed)	Parity with `PyPDFReader` / LangChain-style pipelines
`markdown`	single `Document` with the full PDF rendered to markdown	Pipelines that prefer markdown input

Technical Details

Extends BaseReader from llama_index.core.readers.base
Returns standard llama_index.core.schema.Document objects
Namespace package llama_index.readers.oxidize_pdf (PEP 420), consistent with existing LlamaIndex readers
[tool.llamahub] metadata configured (import_path, class_authors)
Depends on oxidize-pdf>=0.4.2 (pure-Python wheel, no system deps)
22 behavior tests (no smoke tests); CI matrix on Python 3.10–3.13

Installation & Usage

pip install llama-index-readers-oxidize-pdf

from llama_index.readers.oxidize_pdf import OxidizePdfReader

reader = OxidizePdfReader()  # mode="rag" by default
documents = reader.load_data("paper.pdf")

for doc in documents:
    print(doc.metadata["chunk_index"], doc.metadata["heading_context"])
    print(doc.text[:200])

Request

Please list this integration on llamahub.ai so users discovering PDF readers can find it alongside the existing ones. Happy to adjust the metadata or PR to the docs index if needed.

extent analysis

TL;DR

To get the llama-index-readers-oxidize-pdf integration listed on LlamaHub, adjust the metadata and submit a PR to the docs index.

Guidance

Review the metadata: Ensure the [tool.llamahub] metadata is correctly configured with import_path and class_authors.
Prepare a PR: Create a pull request to the LlamaHub documentation index with the updated metadata and information about the llama-index-readers-oxidize-pdf integration.
Test the integration: Verify that the OxidizePdfReader class works as expected in different modes (rag, pages, markdown) before submitting the PR.
Check compatibility: Confirm that the integration is compatible with the required Python versions (3.10-3.13) and oxidize-pdf version (>=0.4.2).

Notes

The issue lacks information about specific errors or problems with the integration, so the guidance focuses on the request to list the integration on LlamaHub.

Recommendation

Apply workaround: Adjust the metadata and submit a PR to the docs index to get the integration listed on LlamaHub, as this is the primary request in the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #tokenizer error #prompt formatting #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - 💡(How to fix) Fix New Integration: llama-index-readers-oxidize-pdf — Rust-powered PDF reader with RAG-ready chunking [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Overview

Why a New Reader?

What It Provides

Technical Details

Installation & Usage

Request

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

llamaIndex - 💡(How to fix) Fix New Integration: llama-index-readers-oxidize-pdf — Rust-powered PDF reader with RAG-ready chunking [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Overview

Why a New Reader?

What It Provides

Technical Details

Installation & Usage

Request

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING