llamaIndex - 💡(How to fix) Fix [Bug]: Semantic bleed at section boundaries in VectorStoreIndex on 10‑K filings degrades retrieval precision [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21318Fetched 2026-04-08 03:01:10
View on GitHub
Comments
2
Participants
2
Timeline
5
Reactions
0
Timeline (top)
commented ×2labeled ×2closed ×1

Error Message

There is no crash or exception; this is a retrieval‑quality issue. The index and query run successfully, but the retrieved nodes frequently come from neighboring sections that are semantically close in embedding space but incorrect for the user’s question.

Root Cause

This looks like “semantic bleed” around section boundaries: because adjacent sections reuse the same entities, dates, and financial terms, their embeddings are very similar. As a result, top‑k retrieval frequently surfaces neighboring‑but‑irrelevant chunks near section boundaries, which then causes the LLM to answer from the wrong part of the filing and degrades overall QA precision.

Code Example

There is no crash or exception; this is a retrieval‑quality issue. The index and query run successfully, but the retrieved nodes frequently come from neighboring sections that are semantically close in embedding space but incorrect for the user’s question.
RAW_BUFFERClick to expand / collapse

Bug Description

In dense financial filings like 10‑Ks, adjacent sections share a lot of terminology even when they are semantically different (for example, Risk Factors vs Legal Proceedings, or MD&A vs footnotes). When I index full 10‑Ks with VectorStoreIndex and then run targeted QA queries, the retriever often returns chunks from neighboring sections instead of the section that actually answers the question.

This looks like “semantic bleed” around section boundaries: because adjacent sections reuse the same entities, dates, and financial terms, their embeddings are very similar. As a result, top‑k retrieval frequently surfaces neighboring‑but‑irrelevant chunks near section boundaries, which then causes the LLM to answer from the wrong part of the filing and degrades overall QA precision.

Version

0.11.0

Steps to Reproduce

Ingest a complete SEC 10‑K filing as a single document and build a VectorStoreIndex over it using standard token-based chunking (for example, 512–1024 tokens with some overlap).

Preserve section headers / item numbers from the filing (for example, “Item 1A. Risk Factors”, “Item 3. Legal Proceedings”, “Item 7. Management’s Discussion and Analysis…”).

Run queries that clearly belong to a specific section, such as:

“How does the company describe its liquidity position in the most recent fiscal year?” (should mainly hit MD&A liquidity discussion)

“What legal proceedings does the company disclose as of the latest filing?” (should mainly hit the Legal Proceedings section)

Inspect the top‑k retrieved nodes and map each node back to its original section header.

You will often see that top‑k includes multiple nodes from immediately adjacent sections that share similar terminology but do not actually answer the query.

Relevant Logs/Tracbacks

There is no crash or exception; this is a retrieval‑quality issue. The index and query run successfully, but the retrieved nodes frequently come from neighboring sections that are semantically close in embedding space but incorrect for the user’s question.

extent analysis

TL;DR

Improve the VectorStoreIndex by implementing section-aware chunking or adjusting the embedding model to reduce semantic bleed between adjacent sections.

Guidance

  • Consider using a more advanced chunking strategy that takes into account section boundaries and headers to reduce the similarity between adjacent sections.
  • Experiment with adjusting the embedding model's hyperparameters or using a different model architecture to better capture semantic differences between sections.
  • Evaluate the use of section-specific keywords or tags to help the retriever distinguish between similar sections.
  • Inspect the top-k retrieved nodes and analyze the section distribution to better understand the semantic bleed issue.

Example

No specific code example is provided, as the issue is more related to the indexing and retrieval strategy rather than a specific code snippet.

Notes

The provided information suggests that the issue is related to the semantic similarity between adjacent sections, but more experimentation and analysis are needed to determine the best approach to resolve the issue.

Recommendation

Apply a workaround by implementing section-aware chunking or adjusting the embedding model to reduce semantic bleed between adjacent sections, as upgrading to a fixed version is not mentioned in the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - 💡(How to fix) Fix [Bug]: Semantic bleed at section boundaries in VectorStoreIndex on 10‑K filings degrades retrieval precision [2 comments, 2 participants]