langchain - 💡(How to fix) Fix feat(text-splitters): add SemanticCohesionSplitter with adaptive Max-Min cohesion threshold

Submission checklist

This is a feature request, not a bug report or usage question.
I added a clear and descriptive title that summarizes the feature request.
I used the GitHub search to find a similar feature request and didn't find it.
I checked the LangChain documentation and API reference to see if this feature already exists.
This is not related to the langchain-community package.

Package (Required)

Feature Description

Add SemanticCohesionSplitter to langchain-text-splitters. It is an embedding-based splitter that groups sentences using a per-chunk adaptive Max-Min cohesion threshold:

Each in-progress chunk maintains a cohesion bar equal to the minimum pairwise similarity among the sentences already inside the chunk.
A candidate next sentence joins the current chunk if its maximum similarity to any sentence already in the chunk meets that bar. Otherwise it opens a new chunk.
The bar starts at a configurable initial_threshold while the chunk has only one sentence, and adapts thereafter.

After grouping, post-processing enforces hard min_chunk_chars (forward-merge below the minimum) and max_chunk_chars (sliding-window fallback above the maximum), so callers get predictable size bounds suitable for vector-store ingestion limits.

Proposed public API:

from langchain_text_splitters import (
    SemanticCohesionSplitter,
    chinese_sentence_splitter,
    default_sentence_splitter,
)
from langchain_openai import OpenAIEmbeddings

splitter = SemanticCohesionSplitter(
    embeddings=OpenAIEmbeddings(),
    initial_threshold=0.6,
    min_chunk_chars=200,
    max_chunk_chars=1500,
    sentence_splitter=chinese_sentence_splitter,  # optional
)
chunks = splitter.split_text(text)
chunks = await splitter.asplit_text(text)  # uses aembed_documents

Use Case

RAG ingestion pipelines need semantic chunks with predictable size bounds. The existing langchain_experimental.SemanticChunker (Greg Kamradt design) detects break points on the global distribution of adjacent-sentence distances; it produces high-quality semantic boundaries but provides no size guarantee, so downstream chunks routinely exceed vector-store payload caps or fall below useful retrieval size.

Practical workflows where this matters:

Long-form mixed-language documents where paragraph length varies widely and a per-chunk cohesion bar produces more stable groupings than global distance percentiles.
Mixed technical content (prose + code + tables) where similarity distribution is bimodal and percentile-based break points misfire.
Pipelines that must respect a hard upper bound (vector store row size, model context window, downstream re-ranker token budget).

Proposed Solution

New module libs/text-splitters/langchain_text_splitters/semantic_cohesion.py exposing SemanticCohesionSplitter(TextSplitter). The sync path uses Embeddings.embed_documents; the async path uses aembed_documents. The sentence splitter is pluggable via a Callable[[str], list[str]]; the module ships with default_sentence_splitter (English-leaning regex; preserves separators so emitted chunks remain text.find-able for add_start_index metadata) and chinese_sentence_splitter (CJK terminator + closing-punctuation aware). The cosine similarity matrix is computed via NumPy when present and a pure-Python fallback otherwise, so the package picks up no new runtime dependencies.

Algorithm sketch:

sentences = sentence_splitter(text)
V = embed_documents(sentences); L2-normalise each row
sim[i][j] = dot(V[i], V[j])
groups = [[0]]
for i in 1..N-1:
    cur = groups[-1]
    bar = initial_threshold if len(cur) == 1
          else min(sim[a][b] for a,b in pairs(cur))
    if max(sim[i][j] for j in cur) >= bar:
        cur.append(i)
    else:
        groups.append([i])
raw = ["".join(sentences[j] for j in g) for g in groups]
chunks = enforce_bounds(raw, min_chunk_chars, max_chunk_chars,
                        chunk_overlap, strip=strip_whitespace)

Implementation: ~280 lines of code plus ~330 lines of unit tests. The strip_whitespace option inherited from TextSplitter is honoured by the windowing post-processor. Existing test patterns (DeterministicEmbeddings style) keep the suite independent of network calls.

Alternatives Considered

langchain_experimental.SemanticChunker: different algorithm (break-point detection on the global distribution of adjacent-sentence distances). Has no chunk-size guarantee, which is the main pain point in production RAG pipelines. The Max-Min rule operates per chunk and adds hard size bounds. Keep both — they solve adjacent problems.
RecursiveCharacterTextSplitter + re-ranker: works but loses semantic locality at the split step and requires an additional LLM pass to recover.
Subclassing SemanticChunker to bolt on size bounds: couples the new behaviour to the experimental package and inherits its dependence on the global distance distribution. Cleaner to ship a separate splitter.

Additional Context

Hard size bounds are the key differentiator from SemanticChunker.
The Max-Min rule is independent of the global distance distribution, which makes it more stable on heterogeneous corpora (mixed languages, mixed structure).
The API mirrors existing splitters (Embeddings injected via constructor, **kwargs forwarded to TextSplitter for chunk_overlap, add_start_index, strip_whitespace, etc.).
min_chunk_chars and max_chunk_chars are character counts on purpose — the inherited length_function is intentionally not used for the bounds, so behaviour stays predictable across tokenisers. This is documented in the class docstring.
A working implementation lives on a local branch and is ready to push once an accepted label is added here.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

langchain - 💡(How to fix) Fix feat(text-splitters): add SemanticCohesionSplitter with adaptive Max-Min cohesion threshold

Recommended Tools

GitHub issue graph ai analysis

Code Example

Submission checklist

Package (Required)

Feature Description

Use Case

Proposed Solution

Alternatives Considered

Additional Context

Still need to ship something?

TRENDING

langchain - 💡(How to fix) Fix feat(text-splitters): add SemanticCohesionSplitter with adaptive Max-Min cohesion threshold

Recommended Tools

GitHub issue graph ai analysis

Code Example

Submission checklist

Package (Required)

Feature Description

Use Case

Proposed Solution

Alternatives Considered

Additional Context

Still need to ship something?

RELATED_DISCOVERY

TRENDING