langchain - ✅(Solved) Fix [Feature Request] Add SemanticSimilarityTextSplitter with Hierarchical Splitting to libs/text-splitters [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36272Fetched 2026-04-08 01:36:21
View on GitHub
Comments
2
Participants
3
Timeline
8
Reactions
0
Author
Timeline (top)
labeled ×3commented ×2closed ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #36271: feat(text-splitters): add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting

Description (problem / solution / changelog)

Resolves #36272

PR Description: feat(text-splitters): add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting

Summary

This PR introduces the SemanticSimilarityTextSplitter, a high-impact text splitter that uses sentence embeddings and cosine similarity to identify natural topic boundaries. Unlike traditional character-based splitters, it maintains semantic coherence by placing breaks only at "topic valleys."

Implementation Details

  • Hierarchical Splitting: Implements a recursive fallback mechanism that respects chunk_size by identifying the deepest semantic valleys within oversized chunks. This ensures both topic-based splitting and size-constraint compliance.
  • Breakpoint Strategies: Supports three robust thresholding types for breakpoint detection:
  • percentile (Default): Splits based on similarity scores below a target percentile.
  • standard_deviation: Splits based on scores falling below a mean-minus-sigma threshold.
  • interquartile: Splits based on the interquartile range (IQR) for outliers.
  • Dynamic Safety Mechanism: Automatically adjusts thresholds to prevent over-segmentation in redundant documents or documents with highly uniform similarity distributions.

Why this is needed

Standard text splitters (like RecursiveCharacterTextSplitter) often break mid-concept, degrading performance in RAG pipelines. SemanticSimilarityTextSplitter ensures that each chunk represents a semantically distinct topic, significantly improving retrieval precision and LLM context quality.

Impact & Performance

  • Deterministic Mocking: Includes a comprehensive unit test suite in tests/unit_tests/test_semantic_splitter.py covering threshold types and hierarchical edge cases.
  • Compatibility: Verified for full compatibility with Python 3.8 through 3.12, passing all 160 unit tests in the libs/text-splitters monorepo.

Proposed Solution Highlights

  • No breaking changes: Inherits from TextSplitter and maintains the established public API pattern.
  • Professional Standards: Fully documented with Google-style docstrings and type hints.

Usage Guide

The SemanticSimilarityTextSplitter can be configured to focus on either aggressive topic-based splitting or more relaxed semantic grouping.

from langchain_text_splitters.semantic import SemanticSimilarityTextSplitter
from langchain_openai import OpenAIEmbeddings

# Initialize the splitter with default settings
embeddings = OpenAIEmbeddings()
splitter = SemanticSimilarityTextSplitter(
    embeddings=embeddings, 
    breakpoint_threshold_type="percentile",  # "percentile", "standard_deviation", or "interquartile"
    breakpoint_threshold_amount=95.0,        # Higher percentile -> fewer, larger chunks
    chunk_size=4000,                         # Hierarchical split fallback starts here
)

# Generate semantically coherent chunks
chunks = splitter.split_text("Your long document text...")

Available Parameters:

ParameterDefaultDescription
embeddings(Required)Any standard LangChain Embeddings model.
breakpoint_threshold_type"percentile"The strategy used to compute semantic break points.
breakpoint_threshold_amountNoneThe numeric trigger for the chosen strategy (Defaults: 95.0 for %tile, 3.0 for SD, 1.5 for IQR).
sentence_split_regexNoneCustom regex pattern for initial sentence tokenization.
chunk_size4000The hard size limit (in characters). If a semantic chunk is larger, hierarchical refinement splits it at its next deepest valley.

Changed files

  • libs/text-splitters/langchain_text_splitters/__init__.py (modified, +2/-0)
  • libs/text-splitters/langchain_text_splitters/semantic.py (added, +320/-0)
  • libs/text-splitters/tests/unit_tests/test_semantic_splitter.py (added, +280/-0)
RAW_BUFFERClick to expand / collapse

Checked other resources

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Feature Description

I would like LangChain to support a robust, core implementation of semantic text splitting within libs/text-splitters.

This feature would allow users to split documents based on the cosine similarity of sentence embeddings, ensuring that chunks are semantically coherent and focused on a single topic. Crucially, it includes Hierarchical Refinement: if a semantically identified chunk exceeds the chunk_size limit, the splitter recursively finds the next most significant semantic shifts (the "deepest valleys") within that chunk to satisfy the size constraint without breaking mid-sentence.

Use Case

I'm trying to build high-precision RAG (Retrieval-Augmented Generation) applications where context quality is paramount.

Currently, I have to work around this by using character-based splitters which often break mid-concept, or experimental semantic chunkers that do not strictly enforce chunk_size or handle redundant similarity distributions well.

This feature would help users generate "Topic-Pure" chunks that significantly improve retrieval precision and prevent LLMs from being confused by fragmented context.

Proposed Solution

I have implemented SemanticSimilarityTextSplitter which inherits from TextSplitter.

The API looks like:

python from langchain_text_splitters.semantic import SemanticSimilarityTextSplitter from langchain_openai import OpenAIEmbeddings embeddings = OpenAIEmbeddings() splitter = SemanticSimilarityTextSplitter( embeddings=embeddings, breakpoint_threshold_type="percentile", # percentile, standard_deviation, interquartile breakpoint_threshold_amount=95.0, chunk_size=4000 ) chunks = splitter.split_text("Your document here...")

  • Flexible Thresholding: Supports percentile, standard_deviation, and interquartile strategies.
  • Hierarchical Fallback: Uses a "Deepest Valley" recursive search to keep oversized chunks as semantically intact as possible.
  • Robustness: Handles single-sentence documents and documents with zero-variance similarity distributions gracefully.

Alternatives Considered

I've tried using:

  1. RecursiveCharacterTextSplitter: It is fast but structurally blind to meaning, often splitting at non-semantic character counts.
  2. SemanticChunker (Experimental): It lacks strict chunk_size enforcement and sometimes fails on edge cases like redundant sentence similarity.

Alternative approaches I considered: Purely semantic splitting without size limits: This fails when chunks become too large for modern LLM context windows or efficient embedding storage.

Additional Context

I have fully implemented this feature in my fork (krish1440/langchain).

The code strictly follows Google-style docstrings and monorepo guidelines. I have added a comprehensive test suite (test_semantic_splitter.py) with 28 passing unit tests verified on Python 3.8 through 3.12.

extent analysis

Fix Plan

To implement the semantic text splitting feature in LangChain, follow these steps:

  • Update the langchain-text-splitters package to include the SemanticSimilarityTextSplitter class.
  • Install the required dependencies, including langchain-openai for embeddings.
  • Use the SemanticSimilarityTextSplitter class to split text into semantically coherent chunks.

Example code:

from langchain_text_splitters.semantic import SemanticSimilarityTextSplitter
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and splitter
embeddings = OpenAIEmbeddings()
splitter = SemanticSimilarityTextSplitter(
    embeddings=embeddings, 
    breakpoint_threshold_type="percentile",  
    breakpoint_threshold_amount=95.0,
    chunk_size=4000
)

# Split text into chunks
chunks = splitter.split_text("Your document here...")

Verification

To verify that the fix worked, you can:

  • Run the comprehensive test suite (test_semantic_splitter.py) to ensure that the SemanticSimilarityTextSplitter class is working correctly.
  • Test the split_text method with different inputs and chunk sizes to ensure that it produces the expected output.

Extra Tips

  • Make sure to follow the Google-style docstrings and monorepo guidelines when implementing the feature.
  • Consider adding more unit tests to cover edge cases and ensure the robustness of the implementation.
  • You can adjust the breakpoint_threshold_type and breakpoint_threshold_amount parameters to fine-tune the splitting behavior for your specific use case.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

langchain - ✅(Solved) Fix [Feature Request] Add SemanticSimilarityTextSplitter with Hierarchical Splitting to libs/text-splitters [1 pull requests, 2 comments, 3 participants]