dify - ✅(Solved) Fix RAG mixed retrieval problem [1 pull requests, 1 comments, 2 participants]

dify2026-04-22 05:20:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

langgenius/dify#35482•Fetched 2026-04-23 07:45:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

hutao86

Participants

dosubot[bot]

hutao86

Timeline (top)

commented ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: fix(rag): decouple hybrid recall from top_k for stable merged ranking… (https://github.com/langgenius/dify/pull/35498)

PR fix notes

PR #35498: fix(rag): decouple hybrid recall from top_k for stable merged ranking…

Repository: langgenius/dify
Author: thomascolden585-svg
State: open | merged: False
Link: https://github.com/langgenius/dify/pull/35498

Description (problem / solution / changelog)

Summary

Fixes mixed (hybrid) RAG retrieval so changing only Top K no longer changes which segments participate in score fusion, which could reorder the head and make the first N results differ between a small Top K and a large Top K for the same query.

Problem

Hybrid search runs vector and full-text in parallel, deduplicates, then merges scores (weighted score or reranker). The final top_k was also used as the per-channel limit for each sub-retriever. A segment can sit below k in both channels but still get a high combined score. With a small k it never entered the candidate pool; with a larger k it did—so the merged top results were inconsistent with user expectations (reported in #35482).

Solution

Introduce a dedicated per-channel recall for hybrid: min(200, max(50, final_top_k)).
Use it for embedding_search and full_text_index_search in hybrid mode.
Keep post-merge cut at the user’s top_k via DataPostProcessor.invoke(..., top_n=top_k).

Trade-offs

Slightly more work per hybrid query (larger per-channel fetch) in exchange for stable, correct fusion behavior.

Test plan

uv run pytest tests/unit_tests/core/rag/datasource/test_datasource_retrieval.py::test_hybrid_recall_top_k_for_merge_contract -q
uv run pytest tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py -k hybrid -q
Manual: same query, hybrid + weighted / rerank, Top K 3 vs 8 — first 3 should match (same ordering and segments).

Closes / fixes #35482

Changed files

api/core/rag/datasource/retrieval_service.py (modified, +33/-3)
api/tests/unit_tests/core/rag/datasource/test_datasource_retrieval.py (modified, +9/-0)

RAW_BUFFERClick to expand / collapse

Self Checks

I have read the Contributing Guide and Language Policy.
This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report, otherwise it will be closed.
【中文用户 & Non English User】请使用英语提交，否则会被关闭：）
Please do not modify this template :) and fill in all the required fields.

Dify version

1.9.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

In the case of the same text input: When I chose the mixed search and reach model, the TOPK was 3 and the score threshold was not set. The results were: A content scored 0.58 points, B content scored 0.47 points, and C content scored 0.45 points 2. When I chose the mixed search and reach model, the TOPK was 8 and the score threshold was not set. The results were: A content scored 0.58 points, D content scored 0.52 points, B content scored 0.47 points, and C content scored 0.45 points ... My question is why I only modified the Topk, and the content with higher scores in the later D did not appear in the top 3, but appeared in the top 8 and ranked second

✔️ Expected Behavior

Even if I modify the value of Top, if it is the top 8, at least the value of top 3 should be the same as the value of top 3

❌ Actual Behavior

No response

extent analysis

TL;DR

The issue can be investigated by analyzing the scoring and ranking logic in the mixed search and reach model to understand why changing the TOPK value affects the results.

Guidance

Review the scoring algorithm to ensure it is correctly calculating scores for each content item.
Verify that the ranking logic is correctly sorting content items based on their scores.
Check if there are any threshold or filtering conditions that might be affecting the results when TOPK is set to 3 versus 8.
Investigate if the issue is related to the specific model or if it's a general problem with the ranking system.

Example

No code snippet can be provided without more information about the implementation details of the mixed search and reach model.

Notes

The issue might be related to the specific implementation of the mixed search and reach model, and more information about the model's logic and configuration would be necessary to provide a more accurate diagnosis.

Recommendation

Apply workaround: temporarily set TOPK to a higher value (e.g., 8) to ensure that all relevant content items are included in the results, while investigating the root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #API middleware #SSR setup #ISR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

dify - ✅(Solved) Fix RAG mixed retrieval problem [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #35498: fix(rag): decouple hybrid recall from top_k for stable merged ranking…

Description (problem / solution / changelog)

Summary

Problem

Solution

Trade-offs

Test plan

Related

Changed files

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

dify - ✅(Solved) Fix RAG mixed retrieval problem [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #35498: fix(rag): decouple hybrid recall from top_k for stable merged ranking…

Description (problem / solution / changelog)

Summary

Problem

Solution

Trade-offs

Test plan

Related

Changed files

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING