llamaIndex - 💡(How to fix) Fix [Feature Request]: Retrieval quality metrics that account for precision-recall tradeoff on heterogeneous document corpora

llamaIndex2026-05-18 17:45:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Tables: require exact match retrieval; semantic similarity often fails because numbers don't embed well
Narrative prose: benefits from semantic chunking, but paragraph boundaries matter
Footnotes/cross-references: often contain the most precise answers, but are systematically under-retrieved because they're short and low-weight

Fix Action

Fix / Workaround

Workarounds today require manually filtering eval datasets by document type, running separate evaluators per type, and stitching results together — which is brittle and doesn't integrate with LlamaIndex's existing eval reporting infrastructure.

RAW_BUFFERClick to expand / collapse

Feature Description

When building production RAG pipelines on heterogeneous document corpora (e.g., financial filings that mix structured tables, unstructured narrative text, and footnotes with cross-references), LlamaIndex's current retrieval evaluation tooling surfaces a sharp precision-recall tradeoff that is difficult to diagnose and optimize without building custom evaluation harnesses from scratch.

The core problem:

In a homogeneous corpus (e.g., plain markdown docs), optimizing top_k or chunk size is relatively straightforward — more chunks = better recall, fewer = better precision. But in a heterogeneous corpus (10-K/10-Q filings, legal contracts, medical records), different document sections have fundamentally different retrieval difficulty:

Tables: require exact match retrieval; semantic similarity often fails because numbers don't embed well
Narrative prose: benefits from semantic chunking, but paragraph boundaries matter
Footnotes/cross-references: often contain the most precise answers, but are systematically under-retrieved because they're short and low-weight

The current RetrieverEvaluator measures MRR and hit_rate globally, which masks per-section-type retrieval quality. A pipeline with 95% hit rate on narrative prose and 40% hit rate on tables looks fine in aggregate but fails on the queries that matter most.

Proposed Feature:

A SegmentedRetrieverEvaluator (or a metadata-aware eval mode for RetrieverEvaluator) that:

Accepts a segment_fn: Callable[[TextNode], str] to categorize nodes by type (e.g., "table", "prose", "footnote")
Computes hit_rate, MRR, and precision@k per segment type
Outputs a breakdown table showing precision-recall per segment, so users can diagnose which part of their corpus is failing retrieval
Optionally surfaces a per-segment optimal top_k recommendation

Reference implementation approach:

This could wrap RetrieverEvaluator with a grouping layer — tag each TextNode with a segment label at ingestion, then partition eval results by label before computing metrics. No changes needed to the retriever itself.

Why this matters:

Heterogeneous corpora are the dominant use case in enterprise RAG (legal, finance, healthcare). Treating all node types equally during evaluation means teams are flying blind on exactly the segment types where retrieval failure has the highest business cost.

Reason

The current RetrieverEvaluator computes metrics globally across all retrieved nodes without any mechanism to group or segment by node metadata (e.g., node_type, source_section, document_id). Users building RAG on enterprise document types (financial filings, legal contracts, clinical notes) are forced to build custom eval harnesses outside of LlamaIndex to get segment-level visibility.

Value of Feature

Directly unblocks enterprise RAG users working on regulated-domain document types (finance, legal, healthcare) who need to understand retrieval quality at the segment level — not just global averages. This is one of the most common pain points for teams trying to push RAG into production on real enterprise document stores, and having it as a first-class feature in LlamaIndex's eval framework would significantly lower the barrier to building production-grade retrieval pipelines.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - 💡(How to fix) Fix [Feature Request]: Retrieval quality metrics that account for precision-recall tradeoff on heterogeneous document corpora

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Feature Description

Reason

Value of Feature

Still need to ship something?

TRENDING

llamaIndex - 💡(How to fix) Fix [Feature Request]: Retrieval quality metrics that account for precision-recall tradeoff on heterogeneous document corpora

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Feature Description

Reason

Value of Feature

Still need to ship something?

RELATED_DISCOVERY

TRENDING