llamaIndex - ✅(Solved) Fix [Feature Request]: Semantic Duplication Check for potential duplicated generated in `generate_synthetic_queries_over_documents` [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20809Fetched 2026-04-08 00:30:48
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
labeled ×2commented ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #20834: feat(finetuning) : add semantic deduplication for synthetic queries generation

Description (problem / solution / changelog)

Description

Fixes : #20809

Summary

Adds optional semantic deduplication for synthetic query generation in cross-encoder training.

Problem

generate_synthetic_queries_over_documents() generates duplicate questions across chunks, inflating dataset size without adding semantic value.

Solution

Add a deduplication step user can opti-in made it backward compatibility user can opt-in or out and uses faiss for faster operations

Dependencies

  • Added: faiss-cpu>=1.7.0
  • Existing: sentence-transformers>=2.3.0

Usage

questions = generate_synthetic_queries_over_documents(
    documents,
    enable_deduplication=True,  # Opt-in
    similarity_threshold=0.92    # Configurable
    embedding_model_name="all-MiniLM-L6-v2"  #Configurable 
)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-finetuning/llama_index/finetuning/cross_encoders/dataset_gen.py (modified, +127/-3)
  • llama-index-finetuning/pyproject.toml (modified, +2/-1)

Code Example

questions.extend(response_questions)
...
return questions

---

deduped_questions = list(dict.fromkeys(q.strip() for q in questions if q.strip()))
return deduped_questions
RAW_BUFFERClick to expand / collapse

Feature Description

Semantic Dedupulication

Current State: Absence of any Deduplication step

In llama_index/finetuning/cross_encoders/dataset_gen.py, generate_synthetic_queries_over_documents(...) generates synthetic questions per document chunk using an LLM and aggregates them into a flat list. However, there is currently no deduplication step before returning the final list of questions. Currently, we have (See full code):

questions.extend(response_questions)
...
return questions

All generated questions are appended and returned as-is, without normalization or deduplication.

This can result in identical questions across multiple chunks, and near-duplicate questions with minor formatting differences. Further down the training pipeline, this can entail redundant training examples in downstream cross-encoder datasets, and inflated dataset size without additional semantic coverage.

This issue becomes more noticeable when num_questions_per_chunk is large, and documents contain repetitive structure.

Proposed Improvement 1: Lightweight deduplication

Before returning questions, do a check for exact matches, preserving order, stripping whitespace, and removing empty strings:

deduped_questions = list(dict.fromkeys(q.strip() for q in questions if q.strip()))
return deduped_questions

Proposed Improvement 2: Semantic Deduplication

Main goal. We want near-duplicate detection (paraphrases, trivial rewrites, casing/punctuation changes, “What is X?” vs “Explain X.”)

We may use embedding-based near-duplicate removal -- encode each generated question into a vector embedding and compare it against embeddings of previously accepted questions. If the cosine similarity between a new question and any existing one exceeds a predefined threshold, the new question is treated as a semantic duplicate and discarded.

To implement, we should first normalize the text (e.g., stripping whitespace, removing numbering or bullet markers, optionally lowercasing), then generate embeddings using a sentence-level model such as OpenAI embeddings or a local sentence-transformers model.

To ensure scalability, an Approximate Nearest Neighbor (ANN) index (e.g., FAISS or hnswlib) can be used to efficiently retrieve the most similar existing vectors rather than comparing against all prior questions.

Key tradeoffs to consider. This semantic deduplication approach requires computing and storing embeddings, as well as maintaining a vector index for similarity search.

Reason

No response

Value of Feature

Deduplication helps further down the training pipeline as follows:

  • Cross-encoder training quality improves with diverse queries
  • Prevents redundant positives in ranking datasets
  • Reduces training time and memory usage
  • Encourages better semantic coverage across chunks

extent analysis

Semantic Deduplication Fix Plan

Lightweight Deduplication

  1. Add deduplication step: In dataset_gen.py, replace the return questions line with the following code:
deduped_questions = list(dict.fromkeys(q.strip() for q in questions if q.strip()))
return deduped_questions

This will remove exact duplicates while preserving order.

Semantic Deduplication

  1. Install required libraries:
    • sentence-transformers for sentence embeddings
    • faiss or hnswlib for Approximate Nearest Neighbor (ANN) index
    • numpy and scipy for vector operations
  2. Normalize text and generate embeddings:
    • Import sentence-transformers and load a pre-trained model (e.g., sentence-transformers/all-MiniLM-L6-v2)
    • Define a function to normalize text (strip whitespace, remove numbering/bullet markers, lowercasing)
    • Generate embeddings for each question using the pre-trained model
  3. Create an ANN index:
    • Initialize the ANN index (e.g., faiss.IndexFlatL2 for FAISS)
    • Add the initial set of question embeddings to the index
  4. Implement semantic deduplication:
    • For each new question, normalize the text and generate an embedding
    • Use the ANN index to find the most similar existing vector (e.g., index.search(embedding, k=1))
    • If the similarity exceeds a predefined threshold (e.g., 0.8), treat the new question as a semantic duplicate and discard it

Example Code Snippet

import sentence_transformers
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load pre-trained model and create a SentenceTransformer instance
model = SentenceTransformer('all-Mini

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING