dify - 💡(How to fix) Fix Knowledge Base: Duplicate content chunks from different documents are collapsed during retrieval (content-only deduplication ignores metadata) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langgenius/dify#35707Fetched 2026-04-30 06:45:20
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
3
Author
Timeline (top)
assigned ×1commented ×1labeled ×1mentioned ×1

Root Cause

Root Cause (for maintainers)

Code Example

# ~line 249
content_key = (doc.provider or "dify", doc.page_content)  # ← content only
if content_key not in chosen:
    chosen[content_key] = doc
    order.append(content_key)

---

else:
    available_id = (doc.metadata or {}).get("doc_id")
    content_key = (doc.provider or "dify", available_id if available_id else doc.page_content)
    if content_key not in chosen:
        chosen[content_key] = doc
        order.append(content_key)
RAW_BUFFERClick to expand / collapse

Self Checks

  • I have read the Contributing Guide and Language Policy.
  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report, otherwise it will be closed.
  • 【中文用户 & Non English User】请使用英语提交,否则会被关闭 :)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

1.13.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create a Knowledge Base with Hybrid Search enabled.
  2. Upload two different documents (e.g. file_a.txt and file_b.txt) that share at least one section of identical text content but have different filenames/metadata.
  3. Index both documents successfully.
  4. Query the Knowledge Base with a phrase that matches the shared content.

✔️ Expected Behavior

Both documents are returned as separate retrieval results, each with their own source metadata (e.g. filename, document ID). The knowledge base correctly distinguishes two chunks that originate from different source documents, even if their text content is identical.

❌ Actual Behavior

Only one of the two documents is returned. The retrieval result for the second document is silently dropped. Any downstream citation or attribution that depends on the second document's metadata is lost.

Root Cause (for maintainers)

In api/core/rag/datasource/retrieval_service.py, the _deduplicate_documents method builds a deduplication key using page_content alone when the primary UUID path isn't taken:

# ~line 249
content_key = (doc.provider or "dify", doc.page_content)  # ← content only
if content_key not in chosen:
    chosen[content_key] = doc
    order.append(content_key)

This means two chunks with identical text but different doc_id UUIDs (originating from different source documents) are treated as duplicates, and only the first encountered is kept.

Additionally, api/libs/helper.py's generate_text_hash() hashes only page_content without including any metadata, so the same assumption propagates to embedding cache keying and thread group assignment in indexing_runner.py.

Suggested fix — prefer doc_id in the fallback branch when available:

else:
    available_id = (doc.metadata or {}).get("doc_id")
    content_key = (doc.provider or "dify", available_id if available_id else doc.page_content)
    if content_key not in chosen:
        chosen[content_key] = doc
        order.append(content_key)

Environment

  • Dify version: (fill in)
  • Vector store backend: (e.g. Weaviate, Qdrant, pgvector)
  • Search mode: Hybrid Search

extent analysis

TL;DR

Update the _deduplicate_documents method in retrieval_service.py to prefer doc_id over page_content for deduplication key generation when available.

Guidance

  • Review the suggested fix and apply it to the _deduplicate_documents method to ensure that documents with identical text content but different metadata are not treated as duplicates.
  • Verify that the doc_id is correctly populated in the document metadata to ensure accurate deduplication.
  • Test the updated method with the provided steps to reproduce the issue to confirm that both documents are returned as separate retrieval results.
  • Consider updating the generate_text_hash() function in helper.py to include metadata, such as doc_id, to ensure consistent hashing and cache keying.

Example

else:
    available_id = (doc.metadata or {}).get("doc_id")
    content_key = (doc.provider or "dify", available_id if available_id else doc.page_content)
    if content_key not in chosen:
        chosen[content_key] = doc
        order.append(content_key)

Notes

The suggested fix assumes that the doc_id is available and correctly populated in the document metadata. If this is not the case, additional modifications may be necessary to ensure accurate deduplication.

Recommendation

Apply the suggested workaround by updating the _deduplicate_documents method to prefer doc_id over page_content for deduplication key generation when available, as this should resolve the issue of silently dropped retrieval results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

dify - 💡(How to fix) Fix Knowledge Base: Duplicate content chunks from different documents are collapsed during retrieval (content-only deduplication ignores metadata) [1 comments, 2 participants]