llamaIndex - ✅(Solved) Fix Header-Aware Deterministic Chunking & Post-RAG Verification Pipeline [2 pull requests, 3 comments, 2 participants]

DYNOSuprovo · 2026-03-29T18:35:46Z

[llamaIndex] PR 21281: feat: add HeaderAwareMarkdownSplitter node parser - Repository: run-llama/llama index - Author: shivam2407 - State: open | merged: False… # PR #21281: feat: add HeaderAwareMarkdownSplitter node parser - Repository: run-llama/llama_index - Author: shivam2407 - State: open | merged: False - Link: https://github.com/run-llama/llama_index/pull/21281 ## Description (problem / solution / changelog) ## Summary Partial fix for #21213 — implements the `HeaderAwareMarkdownSplitter` component. The `VerificationQueryEngine` component is being handled separately by @DYNOSuprovo. ### Problem LlamaIndex has two markdown-related parsers with complementary gaps: - **`MarkdownNodeParser`** — respects headers but has no token size limit (a 50k-token section becomes one node) - **`SentenceSplitter`** — enforces token limits but has no markdown awareness (severs headers from their content) ### Solution `HeaderAwareMarkdownSplitter` fills the gap: it keeps each header grouped with its body text when the section fits within `chunk_size`, and splits oversized sections at paragraph/sentence/word boundaries with the **header context prepended** to every sub-chunk. ### Usage ```python from llama_index.core.node_parser import HeaderAwareMarkdownSplitter splitter = HeaderAwareMarkdownSplitter(chunk_size=512) nodes = splitter.get_nodes_from_documents(documents) # Each node has header_path metadata: "/Introduction/Setup/" ``` ### Key features - **Header hierarchy** tracked in `header_path` metadata (same convention as `MarkdownNodeParser`) - **Deterministic** — same input always produces same splits (no embedding model needed) - **Code fence handling** — both backtick (`` ``` ``) and tilde (`~~~`) fences - **Multi-level fallback** — paragraph → sentence → word boundaries for oversized content - **Pluggable `sub_splitter`** parameter for custom split strategies (e.g. semantic splitting) ### Changes | File | Change | |---|---| | `node_parser/file/header_aware_markdown.py` | New `HeaderAwareMarkdownSplitter` class (~250 lines) | | `node_parser/file/__init__.py` | Export registration | | `node_parser/__init__.py` | Top-level export | | `tests/node_parser/test_header_aware_markdown.py` | 15 tests | ## Test plan - [x] 15 new tests pass covering: - Basic single/multiple section splits - Nested header path metadata - Oversized section splitting (paragraph, sentence, word boundaries) - Code blocks (backtick and tilde fences) - Edge cases (empty doc, no headers, header with no body) - Determinism - Custom separator and custom sub_splitter - [x] All 7 existing `MarkdownNodeParser` tests still pass ## Changed files - `llama-index-core/llama_index/core/node_parser/__init__.py` (modified, +4/-0) - `llama-index-core/llama_index/core/node_parser/file/__init__.py` (modified, +4/-0) - `llama-index-core/llama_index/core/node_parser/file/header_aware_markdown.py` (added, +312/-0) - `llama-index-core/tests/node_parser/test_header_aware_markdown.py` (added, +189/-0) --- # PR #21302: feat: add VerificationQueryEngine component (#21213) - Repository: run-llama/llama_index - Author: DYNOSuprovo - State: open | merged: False - Link: https://github.com/run-llama/llama_index/pull/21302 ## Description (problem / solution / changelog) # Description Working in parallel with @shivam2407 on Issue #21213. This PR introduces the `VerificationQueryEngine` class. It acts as a native Post-RAG guardrail that wraps any existing `BaseQueryEngine`. After the underlying engine generates a draft response, this component intercepts it and forces a secondary LLM to act as an adversarial judge—verifying the generated claims strictly against the retrieved `source_nodes`. Output metadata is augmented with verification results, and strict mode can actively block hallucinated responses. Fixes #21213 ## New Package? Did I fill in the `tool.llamahub` section in the `pyproject.toml` and provide a detailed README.md for my new integration or package? - [ ] Yes - [x] No ## Version Bump? Did I bump the version in the `pyproject.toml` file of the package I am updating? (Except for the `llama-index-core` package) - [ ] Yes - [x] No ## Type of Change - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] This change requires a documentation update ## How Has This Been Tested? - [ ] I added new unit tests to cover this change - [x] I believe this change is already covered by existing unit tests (Relies entirely on existing core abstractions / BaseQueryEngine). ## Suggested Checklist: - [x] I have performed a self-review of my own code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] I have added Google Colab support for the newly added notebooks. - [x] My changes generate no new warnings - [ ] I have added tests

llamaIndex2026-03-29 18:35:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21213•Fetched 2026-04-08 01:48:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

DYNOSuprovo

Participants

DYNOSuprovo

shivam2407

Timeline (top)

commented ×3cross-referenced ×2labeled ×2mentioned ×2

PR fix notes

PR #21281: feat: add HeaderAwareMarkdownSplitter node parser

Repository: run-llama/llama_index
Author: shivam2407
State: open | merged: False
Link: https://github.com/run-llama/llama_index/pull/21281

Description (problem / solution / changelog)

Summary

Partial fix for #21213 — implements the HeaderAwareMarkdownSplitter component. The VerificationQueryEngine component is being handled separately by @DYNOSuprovo.

Problem

LlamaIndex has two markdown-related parsers with complementary gaps:

MarkdownNodeParser — respects headers but has no token size limit (a 50k-token section becomes one node)
SentenceSplitter — enforces token limits but has no markdown awareness (severs headers from their content)

Solution

HeaderAwareMarkdownSplitter fills the gap: it keeps each header grouped with its body text when the section fits within chunk_size, and splits oversized sections at paragraph/sentence/word boundaries with the header context prepended to every sub-chunk.

Usage

from llama_index.core.node_parser import HeaderAwareMarkdownSplitter

splitter = HeaderAwareMarkdownSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)

# Each node has header_path metadata: "/Introduction/Setup/"

Key features

Header hierarchy tracked in header_path metadata (same convention as MarkdownNodeParser)
Deterministic — same input always produces same splits (no embedding model needed)
Code fence handling — both backtick (```) and tilde (~~~) fences
Multi-level fallback — paragraph → sentence → word boundaries for oversized content
Pluggable sub_splitter parameter for custom split strategies (e.g. semantic splitting)

Changes

File	Change
`node_parser/file/header_aware_markdown.py`	New `HeaderAwareMarkdownSplitter` class (~250 lines)
`node_parser/file/__init__.py`	Export registration
`node_parser/__init__.py`	Top-level export
`tests/node_parser/test_header_aware_markdown.py`	15 tests

Test plan

15 new tests pass covering:
- Basic single/multiple section splits
- Nested header path metadata
- Oversized section splitting (paragraph, sentence, word boundaries)
- Code blocks (backtick and tilde fences)
- Edge cases (empty doc, no headers, header with no body)
- Determinism
- Custom separator and custom sub_splitter
All 7 existing MarkdownNodeParser tests still pass

Changed files

llama-index-core/llama_index/core/node_parser/__init__.py (modified, +4/-0)
llama-index-core/llama_index/core/node_parser/file/__init__.py (modified, +4/-0)
llama-index-core/llama_index/core/node_parser/file/header_aware_markdown.py (added, +312/-0)
llama-index-core/tests/node_parser/test_header_aware_markdown.py (added, +189/-0)

PR #21302: feat: add VerificationQueryEngine component (#21213)

Repository: run-llama/llama_index
Author: DYNOSuprovo
State: open | merged: False
Link: https://github.com/run-llama/llama_index/pull/21302

Description (problem / solution / changelog)

Description

Working in parallel with @shivam2407 on Issue #21213.

This PR introduces the VerificationQueryEngine class. It acts as a native Post-RAG guardrail that wraps any existing BaseQueryEngine. After the underlying engine generates a draft response, this component intercepts it and forces a secondary LLM to act as an adversarial judge—verifying the generated claims strictly against the retrieved source_nodes. Output metadata is augmented with verification results, and strict mode can actively block hallucinated responses.

Fixes #21213

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests (Relies entirely on existing core abstractions / BaseQueryEngine).

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran uv run make format; uv run make lint to appease the lint gods

Changed files

llama-index-core/llama_index/core/query_engine/__init__.py (modified, +4/-0)
llama-index-core/llama_index/core/query_engine/verification_query_engine.py (added, +75/-0)

RAW_BUFFERClick to expand / collapse

Feature Description

The Problem Standard RAG pipelines often suffer from two major issues:

Context Fragmentation: Standard character/token-based splitters often cut paragraphs or sections mid-thought, separating a heading from its subsequent content.
Hallucination Risk: Standard query engines generate answers probabilistically based on retrieved context, without a deterministic step to ensure the final output is exclusively backed by cited quotes.

The Proposal: I have been building a custom extraction pipeline ("Pluto") and would love to contribute two concepts/modules upstream to LlamaIndex if the community finds them valuable:

HeaderAwareMarkdownSplitter (or NodeParser enhancement): A node parser that refuses to split text strictly by token limits if it means severing a Markdown Header (#, ##) from its immediate paragraph. It ensures that chunks remain logically grouped by author intent rather than arbitrary token counts.
VerificationQueryEngine (or Post-Process Node): A wrapper around existing query engines that adds a discrete, final "Audit" LLM call. After the initial response synthesizer generates a draft answer, this post-processor forces an LLM (ideally a larger model acting as a judge) to cross-reference every sentence of the draft against the raw source_nodes. It strips out or flags any sentence that lacks direct quote evidence, returning a "Confidence Score" alongside the final response.

Describe the proposed solution

For the Splitter: I would implement a custom NodeParser that uses Regex/AST parsing to identify Markdown structure, ensuring headers and their immediate children stay within the same TextNode whenever possible.
For the Verifier: A new component in the query_engine workflow that takes the Response object, runs a strict Boolean/Scoring prompt against it using the source_nodes as grounding truth, and appends the verification metrics to the response metadata.

Alternatives considered

Currently, users have to implement these verification loops manually in their application layer (like I did in my own project). Baking this into the framework as an optional QueryEngine or NodeParser would make building verifiable, enterprise-grade RAG pipelines much easier out-of-the-box.

Additional context

I have already implemented a working version of this logic in my own pipeline using native Python and FastAPI. I am very willing to write the code, tests, and documentation to port this into the LlamaIndex ecosystem as a PR if the maintainers think this aligns with the project's roadmap.

Let me know if you are open to a PR for either of these components!

Reason

No response

Value of Feature

No response

extent analysis

Fix Plan

To address the issues of Context Fragmentation and Hallucination Risk, we will implement two new components: HeaderAwareMarkdownSplitter and VerificationQueryEngine.

HeaderAwareMarkdownSplitter

Create a new class HeaderAwareMarkdownSplitter that inherits from the existing NodeParser.
Override the split method to use Regex/AST parsing to identify Markdown headers and ensure they are not separated from their immediate paragraphs.
Use a library like markdown to parse the Markdown structure.

Example code:

import re
import markdown

class HeaderAwareMarkdownSplitter(NodeParser):
    def split(self, text):
        # Parse Markdown structure
        md = markdown.Markdown()
        tokens = md.parse(text)
        
        # Identify headers and their immediate paragraphs
        headers = []
        for token in tokens:
            if token.type == 'header':
                headers.append(token)
        
        # Split text while keeping headers and paragraphs together
        split_text = []
        current_paragraph = ''
        for line in text.splitlines():
            if re.match(r'^#+', line):
                if current_paragraph:
                    split_text.append(current_paragraph)
                    current_paragraph = ''
                current_paragraph += line + '\n'
            else:
                current_paragraph += line + '\n'
        if current_paragraph:
            split_text.append(current_paragraph)
        
        return split_text

VerificationQueryEngine

Create a new class VerificationQueryEngine that wraps an existing query engine.
Add a new method verify that takes the response object and source nodes as input.
Use a larger language model to cross-reference the response against the source nodes and calculate a confidence score.

Example code:

class VerificationQueryEngine:
    def __init__(self, query_engine):
        self.query_engine = query_engine
    
    def verify(self, response, source_nodes):
        # Use a larger language model to verify the response
        verification_model = LargerLanguageModel()
        confidence_score = 0
        for sentence in response.sentences:
            verification_prompt = f"Is the sentence '{sentence}' supported by the source nodes?"
            verification_response = verification_model(verification_prompt, source_nodes)
            if verification_response == 'yes':
                confidence_score += 1
        confidence_score /= len(response.sentences)
        response.metadata['confidence_score'] = confidence_score
        return response

Verification

To verify the fix, test the HeaderAwareMarkdownSplitter and VerificationQueryEngine components separately and together.

Test the HeaderAwareMarkdownSplitter with sample Markdown text to ensure it splits the text correctly.
Test the VerificationQueryEngine with sample responses and source nodes to ensure it calculates the confidence score correctly.
Integrate the two components into the RAG pipeline and test the end-to-end workflow.

Extra Tips

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tensor shape #autograd error #model save/load #optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

llamaIndex - ✅(Solved) Fix Header-Aware Deterministic Chunking & Post-RAG Verification Pipeline [2 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #21281: feat: add HeaderAwareMarkdownSplitter node parser

Description (problem / solution / changelog)

Summary

Problem

Solution

Usage

Key features

Changes

Test plan

Changed files

PR #21302: feat: add VerificationQueryEngine component (#21213)

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Feature Description

Describe the proposed solution

Alternatives considered

Additional context

Reason

Value of Feature

extent analysis

Fix Plan

HeaderAwareMarkdownSplitter

VerificationQueryEngine

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING