llamaIndex - ✅(Solved) Fix [Bug]: Sync recursive retrieval misses `ref_doc_id` in dedup key [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21033Fetched 2026-04-08 00:47:57
View on GitHub
Comments
3
Participants
3
Timeline
12
Reactions
0
Timeline (top)
commented ×3cross-referenced ×3labeled ×2closed ×1

Fix Action

Fixed

PR fix notes

PR #21034: fix(core): align sync retrieval dedup key with async (hash + ref_doc_id)

Description (problem / solution / changelog)

Description

Aligns the sync dedup key in _handle_recursive_retrieval with the async path by using (node.hash, node.ref_doc_id), fixing a discrepancy introduced in #14383

Fixes #21033

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-core/llama_index/core/base/base_retriever.py (modified, +5/-4)
  • llama-index-core/tests/retrievers/test_composable_retriever.py (modified, +21/-1)

Code Example

### Relevant Logs/Tracbacks
RAW_BUFFERClick to expand / collapse

Bug Description

_handle_recursive_retrieval deduplicates nodes using node.hash alone, while its async counterpart _ahandle_recursive_retrieval uses (node.hash, node.ref_doc_id). This means two nodes with identical text and metadata but different source documents are incorrectly dropped in sync retrieval. The async behavior was intentionally updated in PR #14383 but the sync path was never updated to match.

Version

0.14.17

Steps to Reproduce


import asyncio
from llama_index.core.base.base_retriever import BaseRetriever
from llama_index.core.schema import (
    NodeRelationship,
    NodeWithScore,
    QueryBundle,
    RelatedNodeInfo,
    TextNode,
)

def make_node(text: str, metadata: dict, source_doc_id: str) -> TextNode:
    node = TextNode(text=text, metadata=metadata)
    node.relationships[NodeRelationship.SOURCE] = RelatedNodeInfo(node_id=source_doc_id)
    return node

class DuplicateContentRetriever(BaseRetriever):
    def _retrieve(self, query_bundle: QueryBundle):
        node1 = make_node("shared content", {}, "doc-1")
        node2 = make_node("shared content", {}, "doc-2")
        return [
            NodeWithScore(node=node1, score=0.9),
            NodeWithScore(node=node2, score=0.8),
        ]

retriever = DuplicateContentRetriever()

sync_nodes = retriever.retrieve("test")
print(f"Sync: {len(sync_nodes)} node(s)")
for n in sync_nodes:
    print(f"  ref_doc_id={n.node.ref_doc_id} score={n.score}")

async_nodes = await retriever.aretrieve("test")
print(f"Async: {len(async_nodes)} node(s)")
for n in async_nodes:
    print(f"  ref_doc_id={n.node.ref_doc_id} score={n.score}")

Relevant Logs/Tracbacks

Sync: 1 node(s)
  ref_doc_id=doc-1 score=0.9
Async: 2 node(s)
  ref_doc_id=doc-1 score=0.9
  ref_doc_id=doc-2 score=0.8

extent analysis

Fix Plan

To fix the issue, we need to update the _handle_recursive_retrieval method to use (node.hash, node.ref_doc_id) for deduplication, matching the async counterpart.

Steps:

  • Update the _handle_recursive_retrieval method in the BaseRetriever class.
  • Use a tuple of (node.hash, node.ref_doc_id) for deduplication.

Example Code:

def _handle_recursive_retrieval(self, query_bundle: QueryBundle, nodes: List[NodeWithScore]) -> List[NodeWithScore]:
    # ...
    seen_nodes = set()
    result = []
    for node in nodes:
        # Use a tuple of (node.hash, node.ref_doc_id) for deduplication
        node_key = (node.node.hash, node.node.ref_doc_id)
        if node_key not in seen_nodes:
            seen_nodes.add(node_key)
            result.append(node)
    # ...
    return result

Verification

To verify the fix, run the provided test code and check that the sync retrieval returns both nodes with different ref_doc_id values.

Expected Output:

Sync: 2 node(s)
  ref_doc_id=doc-1 score=0.9
  ref_doc_id=doc-2 score=0.8
Async: 2 node(s)
  ref_doc_id=doc-1 score=0.9
  ref_doc_id=doc-2 score=0.8

Extra Tips

  • Make sure to update the BaseRetriever class to ensure the fix applies to all retriever implementations.
  • Consider adding test cases to cover this scenario and prevent regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix [Bug]: Sync recursive retrieval misses `ref_doc_id` in dedup key [1 pull requests, 3 comments, 3 participants]