llamaIndex - ✅(Solved) Fix [Feat]: Add a RAG failure mode checklist doc (symptoms to minimal fixes) [2 pull requests, 4 comments, 3 participants]

onestardao · 2026-02-14T08:03:36Z

[llamaIndex] PR 20721: docs: add RAG Failure Mode Checklist - Repository: run-llama/llama index - Author: ManasVardhan - State: closed | merged: True - Link: h… # PR #20721: docs: add RAG Failure Mode Checklist - Repository: run-llama/llama_index - Author: ManasVardhan - State: closed | merged: True - Link: https://github.com/run-llama/llama_index/pull/20721 ## Description (problem / solution / changelog) ## Summary Adds a comprehensive **RAG Failure Mode Checklist** documentation page to help users diagnose and fix common RAG pipeline issues. ## Failure Modes Covered 1. **Retrieval Hallucination** — retriever returns superficially relevant but wrong chunks 2. **Wrong Chunk Selection (Poor Chunking)** — critical context split across chunks 3. **Index Fragmentation** — duplicate/outdated/conflicting documents in index 4. **Config Drift** — embedding model mismatch between index and query time 5. **Embedding Model Mismatch** — wrong model for the domain 6. **Context Window Overflow** — too many chunks stuffed into LLM prompt 7. **Missing Metadata Filtering** — retrieval not scoped to relevant subset 8. **Poor Query Understanding** — ambiguous or short queries 9. **LLM Synthesis Failures** — right chunks retrieved but bad answer generated Each section includes symptoms and minimal fixes referencing LlamaIndex components. Also includes a quick diagnostic flowchart. Closes #20702 ## Changed files - `docs/src/content/docs/framework/optimizing/rag_failure_mode_checklist.md` (added, +194/-0) --- # PR #20760: docs: extend RAG Failure Mode Checklist with advanced failures - Repository: run-llama/llama_index - Author: onestardao - State: closed | merged: True - Link: https://github.com/run-llama/llama_index/pull/20760 ## Description (problem / solution / changelog) Follow-up to #20702 and #20721. This PR keeps the existing RAG Failure Mode Checklist and extends it with a small set of system-level failure families that often show up in production, without changing any of the current recommendations. Summary of changes - Keep sections 1–9 as-is (single-query failures: retrieval, chunking, embeddings, query understanding, synthesis). - Add section 10 “Embedding Metric Mismatch (Cosine Score ≠ True Meaning)” to cover cases where the distance metric or normalization does not match how meaning is distributed in the data. - Add section 11 “Session and Cache Memory Breaks” for cross-session instability caused by stateless indices, cache keys, or environment changes. - Add section 12 “Observability Gaps ("Black-Box Debugging")” to highlight that many issues cannot be fixed before basic traces and logs are in place. - Add section 13 “Index Lifecycle and Deployment Ordering” to capture failures caused by empty or half-built indices, wrong snapshot routing, or deployment ordering bugs. - Slightly update the introduction and the Quick Diagnostic Flowchart so they point to the new sections when issues appear only in production or after deploys. All new content is written in a project-native way (no external dependencies or naming schemes) and is based on recurring failure patterns seen in real-world RAG deployments. Happy to adjust wording, scope, or numbering if you would prefer a slimmer version or a separate “advanced” doc instead of extending this page. # Description This is a documentation-only change that expands the existing RAG Failure Mode Checklist with several additional failure families that commonly appear in production systems (embedding metric issues, cross-session instability, observability gaps, and index lifecycle / deployment ordering problems). Related issues: #20702, #20721 (docs follow-up; does not close new issues). ## New Package? Did I fill in the `tool.llamahub` section in the `pyproject.toml` and provide a detailed README.md for my new integration or package? - [ ] Yes - [x] No ## Version Bump? Did I bump the version in the `pyproject.toml` file of the package I am updating? (Except for the `llama-index-core` package) - [ ] Yes - [x] No ## Type of Change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] This change requires a documentation update ## How Has This Been Tested? This is a documentation-only change; no code paths were modified, so no additional tests were added. - [ ] I added new unit tests to cover this change - [ ] I believe this change is already covered by existing unit tests ## Suggested Checklist - [x] I have performed a self-review of my own changes - [ ] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [ ] I have added Google Colab support for the newly added notebooks - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass

llamaIndex2026-02-14 08:03:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#20702•Fetched 2026-04-08 00:31:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4referenced ×4cross-referenced ×2labeled ×2

RAW_BUFFERClick to expand / collapse

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hi maintainers, thanks for the project.

Problem Description In production RAG apps, most failures are not “model is dumb”, it is a small set of repeatable failure modes: retrieval hallucination, wrong chunk selected, index fragmentation, bootstrap ordering races, config drift, etc. Users often cannot name the failure mode, so they patch randomly and lose time.

Desired Solution you'd like Add a small doc page: “RAG failure mode checklist”. Structure:

common symptoms users can observe
what to inspect first (retrieval outputs, chunking, embeddings, store health, tracing)
minimal structural fixes, not generic advice
a simple taxonomy mapping so users can quickly classify before changing infra

I can draft a first version as a docs PR if the direction is accepted.

Alternatives Considered Rely on scattered troubleshooting notes and GitHub issues. This works but it is not systematic, and new users repeat the same mistakes.

Additional Context I have a compact 16 mode taxonomy (No.1 to No.16) that I can adapt into a neutral checklist format with minimal references, so it stays useful even without adopting any external tooling. If you prefer, we can keep it fully project native and only use it as an internal naming scheme.

https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

extent analysis

RAG Failure Mode Checklist Solution

Fix Plan

Create a new doc page: Add a new Markdown file in the project's documentation directory, e.g., docs/rag_failure_modes.md.
Define the taxonomy: Use the provided 16-mode taxonomy as a starting point and adapt it into a neutral checklist format. You can use the existing ProblemMap/README.md file as a reference.
List common symptoms: Document common symptoms users can observe for each failure mode, e.g.:

### Retrieval Hallucination
- Unusual or nonsensical output
- Lack of relevance to input query

Specify inspection steps: Outline what to inspect first for each failure mode, e.g.:

### Wrong Chunk Selected
- Check chunking algorithm configuration
- Verify chunking output matches expected input

Provide minimal structural fixes: Document minimal structural fixes for each failure mode, e.g.:

### Index Fragmentation
- Run index rebalancing script
- Monitor index fragmentation metrics

Add a simple taxonomy mapping: Create a simple taxonomy mapping to help users quickly classify failure modes, e.g.:

| Failure Mode | Taxonomy Number |
| --- | --- |
| Retrieval Hallucination | No. 1 |
| Wrong Chunk Selected | No. 3 |
| Index Fragmentation | No. 5 |

Review and refine: Review the checklist with the team and refine it as needed.

Verification

Verify that the new doc page is accessible and easily navigable.
Test the checklist by simulating different failure modes and verifying that users can correctly identify and address the issue.
Monitor user feedback and update the checklist as needed.

Extra Tips

Keep the checklist concise and focused on minimal structural fixes.
Use clear and concise language to avoid

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix [Feat]: Add a RAG failure mode checklist doc (symptoms to minimal fixes) [2 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #20721: docs: add RAG Failure Mode Checklist

Description (problem / solution / changelog)

Summary

Failure Modes Covered

Changed files

PR #20760: docs: extend RAG Failure Mode Checklist with advanced failures

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist

Changed files

Question Validation

Question

extent analysis

RAG Failure Mode Checklist Solution

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix [Feat]: Add a RAG failure mode checklist doc (symptoms to minimal fixes) [2 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #20721: docs: add RAG Failure Mode Checklist

Description (problem / solution / changelog)

Summary

Failure Modes Covered

Changed files

PR #20760: docs: extend RAG Failure Mode Checklist with advanced failures

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist

Changed files

Question Validation

Question

extent analysis

RAG Failure Mode Checklist Solution

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING