llamaIndex - 💡(How to fix) Fix [Feature Request]: Retrieval quality metrics that account for precision-recall tradeoff on heterogeneous document corpora

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  • Tables: require exact match retrieval; semantic similarity often fails because numbers don't embed well
  • Narrative prose: benefits from semantic chunking, but paragraph boundaries matter
  • Footnotes/cross-references: often contain the most precise answers, but are systematically under-retrieved because they're short and low-weight

Fix Action

Fix / Workaround

Workarounds today require manually filtering eval datasets by document type, running separate evaluators per type, and stitching results together — which is brittle and doesn't integrate with LlamaIndex's existing eval reporting infrastructure.

RAW_BUFFERClick to expand / collapse

Feature Description

When building production RAG pipelines on heterogeneous document corpora (e.g., financial filings that mix structured tables, unstructured narrative text, and footnotes with cross-references), LlamaIndex's current retrieval evaluation tooling surfaces a sharp precision-recall tradeoff that is difficult to diagnose and optimize without building custom evaluation harnesses from scratch.

The core problem:

In a homogeneous corpus (e.g., plain markdown docs), optimizing top_k or chunk size is relatively straightforward — more chunks = better recall, fewer = better precision. But in a heterogeneous corpus (10-K/10-Q filings, legal contracts, medical records), different document sections have fundamentally different retrieval difficulty:

  • Tables: require exact match retrieval; semantic similarity often fails because numbers don't embed well
  • Narrative prose: benefits from semantic chunking, but paragraph boundaries matter
  • Footnotes/cross-references: often contain the most precise answers, but are systematically under-retrieved because they're short and low-weight

The current RetrieverEvaluator measures MRR and hit_rate globally, which masks per-section-type retrieval quality. A pipeline with 95% hit rate on narrative prose and 40% hit rate on tables looks fine in aggregate but fails on the queries that matter most.

Proposed Feature:

A SegmentedRetrieverEvaluator (or a metadata-aware eval mode for RetrieverEvaluator) that:

  1. Accepts a segment_fn: Callable[[TextNode], str] to categorize nodes by type (e.g., "table", "prose", "footnote")
  2. Computes hit_rate, MRR, and precision@k per segment type
  3. Outputs a breakdown table showing precision-recall per segment, so users can diagnose which part of their corpus is failing retrieval
  4. Optionally surfaces a per-segment optimal top_k recommendation

Reference implementation approach:

This could wrap RetrieverEvaluator with a grouping layer — tag each TextNode with a segment label at ingestion, then partition eval results by label before computing metrics. No changes needed to the retriever itself.

Why this matters:

Heterogeneous corpora are the dominant use case in enterprise RAG (legal, finance, healthcare). Treating all node types equally during evaluation means teams are flying blind on exactly the segment types where retrieval failure has the highest business cost.

Reason

The current RetrieverEvaluator computes metrics globally across all retrieved nodes without any mechanism to group or segment by node metadata (e.g., node_type, source_section, document_id). Users building RAG on enterprise document types (financial filings, legal contracts, clinical notes) are forced to build custom eval harnesses outside of LlamaIndex to get segment-level visibility.

Workarounds today require manually filtering eval datasets by document type, running separate evaluators per type, and stitching results together — which is brittle and doesn't integrate with LlamaIndex's existing eval reporting infrastructure.

Value of Feature

Directly unblocks enterprise RAG users working on regulated-domain document types (finance, legal, healthcare) who need to understand retrieval quality at the segment level — not just global averages. This is one of the most common pain points for teams trying to push RAG into production on real enterprise document stores, and having it as a first-class feature in LlamaIndex's eval framework would significantly lower the barrier to building production-grade retrieval pipelines.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING