dify - ✅(Solved) Fix [Bug] Weaviate vector database UUID generation causes identical child segments from different parent documents to overwrite each other [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langgenius/dify#36137Fetched 2026-05-14 03:46:28
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
1
Author
Timeline (top)
commented ×1cross-referenced ×1labeled ×1

Root Cause

Describe the bug When two different documents contain exactly the same text in one of their segments (e.g., both have a child segment with the content "Hello"), the Weaviate vector database uses only the text content (page_content) to generate the UUID. Because the content is identical, the generated UUID is identical. As a result, when the second document is parsed and embedded, it completely overwrites the vector node of the first document. This causes the first document's segment to become unretrievable.

Fix Action

Fixed

PR fix notes

PR #36140: fix(weaviate): include doc_id in UUID generation to prevent overwriting

Description (problem / solution / changelog)

[!IMPORTANT]

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #36137.

Summary

Fixes #36137

This PR fixes a critical bug where identical text segments from different parent documents generate the exact same UUID and overwrite each other in the Weaviate vector database.

Currently, the _get_uuids method in the Weaviate plugin uses purely doc.page_content to generate the UUID5 hash. This PR updates the UUID generation logic to include the doc_id in the hash calculation. Since doc_id is unique per segment, combining it with page_content ensures that identical text across different documents will generate distinct UUIDs, successfully preventing silent data loss in Weaviate.

Screenshots

BeforeAfter
Identical text chunks generate the same UUID and overwrite existing data.Identical text chunks from different documents generate distinct UUIDs and are stored correctly.

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods

Changed files

  • api/providers/vdb/vdb-weaviate/src/dify_vdb_weaviate/weaviate_vector.py (modified, +3/-1)

Code Example

def _get_uuids(self, documents: list[Document]) -> list[str]:
        URL_NAMESPACE = _uuid.UUID("6ba7b811-9dad-11d1-80b4-00c04fd430c8")
        uuids = []
        for doc in documents:
            uuid_val = _uuid.uuid5(URL_NAMESPACE, doc.page_content)
            uuids.append(str(uuid_val))
        return uuids

---

def _get_uuids(self, documents: list[Document]) -> list[str]:
        URL_NAMESPACE = _uuid.UUID("6ba7b811-9dad-11d1-80b4-00c04fd430c8")
        uuids = []
        for doc in documents:
            doc_id = (doc.metadata or {}).get("doc_id", "")
            # Combine doc_id and page_content to avoid overwriting identical text blocks
            text_to_hash = f"{doc_id}_{doc.page_content}" if doc_id else doc.page_content
            uuid_val = _uuid.uuid5(URL_NAMESPACE, text_to_hash)
            uuids.append(str(uuid_val))
        return uuids
RAW_BUFFERClick to expand / collapse

Self Checks

  • I have read the Contributing Guide and Language Policy.
  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report, otherwise it will be closed.
  • 【中文用户 & Non English User】请使用英语提交,否则会被关闭 :)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

1.14.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Describe the bug When two different documents contain exactly the same text in one of their segments (e.g., both have a child segment with the content "Hello"), the Weaviate vector database uses only the text content (page_content) to generate the UUID. Because the content is identical, the generated UUID is identical. As a result, when the second document is parsed and embedded, it completely overwrites the vector node of the first document. This causes the first document's segment to become unretrievable.

To Reproduce

  1. Upload Document A1 and configure it with parent-child segmentation. Suppose one of the child segments has the exact content: "Hello".
  2. Upload Document B2 and also configure it with parent-child segmentation. Ensure it also generates a child segment with the exact content: "Hello".
  3. Wait for the indexing process to finish.
  4. Perform a retrieval test using the keyword "Hello".
  5. Only Document B2 is retrieved (with a high score). Document A1 is missing.
  6. If you disable Document B2 in the knowledge base and test again, Document A1 is still not retrieved.

✔️ Expected Behavior

Identical text segments from different documents should be stored as separate nodes in the vector database and both should be retrievable based on their respective document contexts.

❌ Actual Behavior

Additional context / Root cause analysis* In api/providers/vdb/vdb-weaviate/src/dify_vdb_weaviate/weaviate_vector.py, the _get_uuids method generates deterministic UUIDs using _uuid.uuid5 purely based on doc.page_content.

    def _get_uuids(self, documents: list[Document]) -> list[str]:
        URL_NAMESPACE = _uuid.UUID("6ba7b811-9dad-11d1-80b4-00c04fd430c8")
        uuids = []
        for doc in documents:
            uuid_val = _uuid.uuid5(URL_NAMESPACE, doc.page_content)
            uuids.append(str(uuid_val))
        return uuids

When Document B2's "Hello" is embedded, it calculates the exact same UUID as A1's "Hello" and overwrites A1's node metadata (including document_id).

Suggested fixes Modify the UUID generation logic to include the unique doc_id in the hash calculation. This ensures that even if the text content is identical, the UUID will be unique per segment across different documents.

    def _get_uuids(self, documents: list[Document]) -> list[str]:
        URL_NAMESPACE = _uuid.UUID("6ba7b811-9dad-11d1-80b4-00c04fd430c8")
        uuids = []
        for doc in documents:
            doc_id = (doc.metadata or {}).get("doc_id", "")
            # Combine doc_id and page_content to avoid overwriting identical text blocks
            text_to_hash = f"{doc_id}_{doc.page_content}" if doc_id else doc.page_content
            uuid_val = _uuid.uuid5(URL_NAMESPACE, text_to_hash)
            uuids.append(str(uuid_val))
        return uuids

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

dify - ✅(Solved) Fix [Bug] Weaviate vector database UUID generation causes identical child segments from different parent documents to overwrite each other [1 pull requests, 1 comments, 2 participants]