llamaIndex - 💡(How to fix) Fix [Bug]: Semantic bleed at section boundaries in VectorStoreIndex on 10‑K filings degrades retrieval precision [2 comments, 2 participants]

llamaIndex2026-04-06 17:56:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21318•Fetched 2026-04-08 03:01:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Ruthwik-Data

Participants

logan-markewich

Ruthwik-Data

Timeline (top)

commented ×2labeled ×2closed ×1

Error Message

There is no crash or exception; this is a retrieval‑quality issue. The index and query run successfully, but the retrieved nodes frequently come from neighboring sections that are semantically close in embedding space but incorrect for the user’s question.

Root Cause

This looks like “semantic bleed” around section boundaries: because adjacent sections reuse the same entities, dates, and financial terms, their embeddings are very similar. As a result, top‑k retrieval frequently surfaces neighboring‑but‑irrelevant chunks near section boundaries, which then causes the LLM to answer from the wrong part of the filing and degrades overall QA precision.

Code Example

There is no crash or exception; this is a retrieval‑quality issue. The index and query run successfully, but the retrieved nodes frequently come from neighboring sections that are semantically close in embedding space but incorrect for the user’s question.

RAW_BUFFERClick to expand / collapse

Bug Description

In dense financial filings like 10‑Ks, adjacent sections share a lot of terminology even when they are semantically different (for example, Risk Factors vs Legal Proceedings, or MD&A vs footnotes). When I index full 10‑Ks with VectorStoreIndex and then run targeted QA queries, the retriever often returns chunks from neighboring sections instead of the section that actually answers the question.

Version

0.11.0

Steps to Reproduce

Ingest a complete SEC 10‑K filing as a single document and build a VectorStoreIndex over it using standard token-based chunking (for example, 512–1024 tokens with some overlap).

Preserve section headers / item numbers from the filing (for example, “Item 1A. Risk Factors”, “Item 3. Legal Proceedings”, “Item 7. Management’s Discussion and Analysis…”).

Run queries that clearly belong to a specific section, such as:

“How does the company describe its liquidity position in the most recent fiscal year?” (should mainly hit MD&A liquidity discussion)

“What legal proceedings does the company disclose as of the latest filing?” (should mainly hit the Legal Proceedings section)

Inspect the top‑k retrieved nodes and map each node back to its original section header.

You will often see that top‑k includes multiple nodes from immediately adjacent sections that share similar terminology but do not actually answer the query.

Relevant Logs/Tracbacks

There is no crash or exception; this is a retrieval‑quality issue. The index and query run successfully, but the retrieved nodes frequently come from neighboring sections that are semantically close in embedding space but incorrect for the user’s question.

extent analysis

TL;DR

Improve the VectorStoreIndex by implementing section-aware chunking or adjusting the embedding model to reduce semantic bleed between adjacent sections.

Guidance

Consider using a more advanced chunking strategy that takes into account section boundaries and headers to reduce the similarity between adjacent sections.
Experiment with adjusting the embedding model's hyperparameters or using a different model architecture to better capture semantic differences between sections.
Evaluate the use of section-specific keywords or tags to help the retriever distinguish between similar sections.
Inspect the top-k retrieved nodes and analyze the section distribution to better understand the semantic bleed issue.

Example

No specific code example is provided, as the issue is more related to the indexing and retrieval strategy rather than a specific code snippet.

Notes

The provided information suggests that the issue is related to the semantic similarity between adjacent sections, but more experimentation and analysis are needed to determine the best approach to resolve the issue.

Recommendation

Apply a workaround by implementing section-aware chunking or adjusting the embedding model to reduce semantic bleed between adjacent sections, as upgrading to a fixed version is not mentioned in the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#inference speed #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - 💡(How to fix) Fix [Bug]: Semantic bleed at section boundaries in VectorStoreIndex on 10‑K filings degrades retrieval precision [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

llamaIndex - 💡(How to fix) Fix [Bug]: Semantic bleed at section boundaries in VectorStoreIndex on 10‑K filings degrades retrieval precision [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING