langchain - ✅(Solved) Fix [Feature Request] Add SemanticSimilarityTextSplitter with Hierarchical Splitting to libs/text-splitters [1 pull requests, 2 comments, 3 participants]

krish1440 · 2026-03-26T12:50:51Z

[langchain] PR 36271: feat text-splitters : add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting - Repository: langchain-ai/langchai… # PR #36271: feat(text-splitters): add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting - Repository: langchain-ai/langchain - Author: krish1440 - State: closed | merged: False - Link: https://github.com/langchain-ai/langchain/pull/36271 ## Description (problem / solution / changelog) Resolves #36272 ## PR Description: feat(text-splitters): add [SemanticSimilarityTextSplitter](cci:2://file:///e:/Project/A--Langchain/libs/text-splitters/langchain_text_splitters/semantic.py:45:0-319:21) with hierarchical size-constrained splitting ### Summary This PR introduces the [SemanticSimilarityTextSplitter](cci:2://file:///e:/Project/A--Langchain/libs/text-splitters/langchain_text_splitters/semantic.py:45:0-319:21), a high-impact text splitter that uses sentence embeddings and cosine similarity to identify natural topic boundaries. Unlike traditional character-based splitters, it maintains semantic coherence by placing breaks only at "topic valleys." ### Implementation Details - **Hierarchical Splitting**: Implements a recursive fallback mechanism that respects [chunk_size](cci:1://file:///e:/Project/A--Langchain/libs/text-splitters/tests/unit_tests/test_text_splitters.py:3622:0-3647:69) by identifying the deepest semantic valleys within oversized chunks. This ensures both topic-based splitting and size-constraint compliance. - **Breakpoint Strategies**: Supports three robust thresholding types for breakpoint detection: - `percentile` (Default): Splits based on similarity scores below a target percentile. - [standard_deviation](cci:1://file:///e:/Project/A--Langchain/libs/text-splitters/tests/unit_tests/test_semantic_splitter.py:107:4-113:59): Splits based on scores falling below a mean-minus-sigma threshold. - [interquartile](cci:1://file:///e:/Project/A--Langchain/libs/text-splitters/tests/unit_tests/test_semantic_splitter.py:115:4-121:59): Splits based on the interquartile range (IQR) for outliers. - **Dynamic Safety Mechanism**: Automatically adjusts thresholds to prevent over-segmentation in redundant documents or documents with highly uniform similarity distributions. ### Why this is needed Standard text splitters (like [RecursiveCharacterTextSplitter](cci:2://file:///e:/Project/A--Langchain/libs/text-splitters/langchain_text_splitters/character.py:87:0-802:29)) often break mid-concept, degrading performance in RAG pipelines. [SemanticSimilarityTextSplitter](cci:2://file:///e:/Project/A--Langchain/libs/text-splitters/langchain_text_splitters/semantic.py:45:0-319:21) ensures that each chunk represents a semantically distinct topic, significantly improving retrieval precision and LLM context quality. ### Impact & Performance - **Deterministic Mocking**: Includes a comprehensive unit test suite in [tests/unit_tests/test_semantic_splitter.py](cci:7://file:///e:/Project/A--Langchain/libs/text-splitters/tests/unit_tests/test_semantic_splitter.py:0:0-0:0) covering threshold types and hierarchical edge cases. - **Compatibility**: Verified for full compatibility with Python 3.8 through 3.12, passing all 160 unit tests in the `libs/text-splitters` monorepo. ### Proposed Solution Highlights - **No breaking changes**: Inherits from `TextSplitter` and maintains the established public API pattern. - **Professional Standards**: Fully documented with Google-style docstrings and type hints. ### Usage Guide The `SemanticSimilarityTextSplitter` can be configured to focus on either aggressive topic-based splitting or more relaxed semantic grouping. ```python from langchain_text_splitters.semantic import SemanticSimilarityTextSplitter from langchain_openai import OpenAIEmbeddings # Initialize the splitter with default settings embeddings = OpenAIEmbeddings() splitter = SemanticSimilarityTextSplitter( embeddings=embeddings, breakpoint_threshold_type="percentile", # "percentile", "standard_deviation", or "interquartile" breakpoint_threshold_amount=95.0, # Higher percentile -> fewer, larger chunks chunk_size=4000, # Hierarchical split fallback starts here ) # Generate semantically coherent chunks chunks = splitter.split_text("Your long document text...") ``` #### Available Parameters: | Parameter | Default | Description | | :--- | :--- | :--- | | `embeddings` | (Required) | Any standard LangChain `Embeddings` model. | | `breakpoint_threshold_type` | `"percentile"` | The strategy used to compute semantic break points. | | `breakpoint_threshold_amount` | `None` | The numeric trigger for the chosen strategy (Defaults: 95.0 for %tile, 3.0 for SD, 1.5 for IQR). | | `sentence_split_regex` | `None` | Custom regex pattern for initial sentence tokenization. | | `chunk_size` | `4000` | The hard size limit (in characters). If a semantic chunk is larger, hierarchical refinement splits it at its next deepest valley. | --- ## Changed files - `libs/text-splitters/lan

langchain2026-03-26 12:50:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

langchain-ai/langchain#36272•Fetched 2026-04-08 01:36:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

labeled ×3commented ×2closed ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: feat(text-splitters): add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting (https://github.com/langchain-ai/langchain/pull/36271)

PR fix notes

PR #36271: feat(text-splitters): add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting

Repository: langchain-ai/langchain
Author: krish1440
State: closed | merged: False
Link: https://github.com/langchain-ai/langchain/pull/36271

Description (problem / solution / changelog)

Resolves #36272

PR Description: feat(text-splitters): add SemanticSimilarityTextSplitter with hierarchical size-constrained splitting

Summary

This PR introduces the SemanticSimilarityTextSplitter, a high-impact text splitter that uses sentence embeddings and cosine similarity to identify natural topic boundaries. Unlike traditional character-based splitters, it maintains semantic coherence by placing breaks only at "topic valleys."

Implementation Details

Hierarchical Splitting: Implements a recursive fallback mechanism that respects chunk_size by identifying the deepest semantic valleys within oversized chunks. This ensures both topic-based splitting and size-constraint compliance.
Breakpoint Strategies: Supports three robust thresholding types for breakpoint detection:
percentile (Default): Splits based on similarity scores below a target percentile.
standard_deviation: Splits based on scores falling below a mean-minus-sigma threshold.
interquartile: Splits based on the interquartile range (IQR) for outliers.
Dynamic Safety Mechanism: Automatically adjusts thresholds to prevent over-segmentation in redundant documents or documents with highly uniform similarity distributions.

Why this is needed

Standard text splitters (like RecursiveCharacterTextSplitter) often break mid-concept, degrading performance in RAG pipelines. SemanticSimilarityTextSplitter ensures that each chunk represents a semantically distinct topic, significantly improving retrieval precision and LLM context quality.

Impact & Performance

Deterministic Mocking: Includes a comprehensive unit test suite in tests/unit_tests/test_semantic_splitter.py covering threshold types and hierarchical edge cases.
Compatibility: Verified for full compatibility with Python 3.8 through 3.12, passing all 160 unit tests in the libs/text-splitters monorepo.

Proposed Solution Highlights

No breaking changes: Inherits from TextSplitter and maintains the established public API pattern.
Professional Standards: Fully documented with Google-style docstrings and type hints.

Usage Guide

The SemanticSimilarityTextSplitter can be configured to focus on either aggressive topic-based splitting or more relaxed semantic grouping.

from langchain_text_splitters.semantic import SemanticSimilarityTextSplitter
from langchain_openai import OpenAIEmbeddings

# Initialize the splitter with default settings
embeddings = OpenAIEmbeddings()
splitter = SemanticSimilarityTextSplitter(
    embeddings=embeddings, 
    breakpoint_threshold_type="percentile",  # "percentile", "standard_deviation", or "interquartile"
    breakpoint_threshold_amount=95.0,        # Higher percentile -> fewer, larger chunks
    chunk_size=4000,                         # Hierarchical split fallback starts here
)

# Generate semantically coherent chunks
chunks = splitter.split_text("Your long document text...")

Available Parameters:

Parameter	Default	Description
`embeddings`	(Required)	Any standard LangChain `Embeddings` model.
`breakpoint_threshold_type`	`"percentile"`	The strategy used to compute semantic break points.
`breakpoint_threshold_amount`	`None`	The numeric trigger for the chosen strategy (Defaults: 95.0 for %tile, 3.0 for SD, 1.5 for IQR).
`sentence_split_regex`	`None`	Custom regex pattern for initial sentence tokenization.
`chunk_size`	`4000`	The hard size limit (in characters). If a semantic chunk is larger, hierarchical refinement splits it at its next deepest valley.

Changed files

libs/text-splitters/langchain_text_splitters/__init__.py (modified, +2/-0)
libs/text-splitters/langchain_text_splitters/semantic.py (added, +320/-0)
libs/text-splitters/tests/unit_tests/test_semantic_splitter.py (added, +280/-0)

extent analysis

Fix Plan

To implement the semantic text splitting feature in LangChain, follow these steps:

Update the langchain-text-splitters package to include the SemanticSimilarityTextSplitter class.
Install the required dependencies, including langchain-openai for embeddings.
Use the SemanticSimilarityTextSplitter class to split text into semantically coherent chunks.

Example code:

from langchain_text_splitters.semantic import SemanticSimilarityTextSplitter
from langchain_openai import OpenAIEmbeddings

# Initialize embeddings and splitter
embeddings = OpenAIEmbeddings()
splitter = SemanticSimilarityTextSplitter(
    embeddings=embeddings, 
    breakpoint_threshold_type="percentile",  
    breakpoint_threshold_amount=95.0,
    chunk_size=4000
)

# Split text into chunks
chunks = splitter.split_text("Your document here...")

Verification

To verify that the fix worked, you can:

Run the comprehensive test suite (test_semantic_splitter.py) to ensure that the SemanticSimilarityTextSplitter class is working correctly.
Test the split_text method with different inputs and chunk sizes to ensure that it produces the expected output.

Extra Tips

Make sure to follow the Google-style docstrings and monorepo guidelines when implementing the feature.
Consider adding more unit tests to cover edge cases and ensure the robustness of the implementation.
You can adjust the breakpoint_threshold_type and breakpoint_threshold_amount parameters to fine-tune the splitting behavior for your specific use case.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.