langchain - 💡(How to fix) Fix feat(text-splitters): add SemanticCohesionSplitter with adaptive Max-Min cohesion threshold

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

from langchain_text_splitters import (
    SemanticCohesionSplitter,
    chinese_sentence_splitter,
    default_sentence_splitter,
)
from langchain_openai import OpenAIEmbeddings

splitter = SemanticCohesionSplitter(
    embeddings=OpenAIEmbeddings(),
    initial_threshold=0.6,
    min_chunk_chars=200,
    max_chunk_chars=1500,
    sentence_splitter=chinese_sentence_splitter,  # optional
)
chunks = splitter.split_text(text)
chunks = await splitter.asplit_text(text)  # uses aembed_documents

---

sentences = sentence_splitter(text)
V = embed_documents(sentences); L2-normalise each row
sim[i][j] = dot(V[i], V[j])
groups = [[0]]
for i in 1..N-1:
    cur = groups[-1]
    bar = initial_threshold if len(cur) == 1
          else min(sim[a][b] for a,b in pairs(cur))
    if max(sim[i][j] for j in cur) >= bar:
        cur.append(i)
    else:
        groups.append([i])
raw = ["".join(sentences[j] for j in g) for g in groups]
chunks = enforce_bounds(raw, min_chunk_chars, max_chunk_chars,
                        chunk_overlap, strip=strip_whitespace)
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Feature Description

Add SemanticCohesionSplitter to langchain-text-splitters. It is an embedding-based splitter that groups sentences using a per-chunk adaptive Max-Min cohesion threshold:

  • Each in-progress chunk maintains a cohesion bar equal to the minimum pairwise similarity among the sentences already inside the chunk.
  • A candidate next sentence joins the current chunk if its maximum similarity to any sentence already in the chunk meets that bar. Otherwise it opens a new chunk.
  • The bar starts at a configurable initial_threshold while the chunk has only one sentence, and adapts thereafter.

After grouping, post-processing enforces hard min_chunk_chars (forward-merge below the minimum) and max_chunk_chars (sliding-window fallback above the maximum), so callers get predictable size bounds suitable for vector-store ingestion limits.

Proposed public API:

from langchain_text_splitters import (
    SemanticCohesionSplitter,
    chinese_sentence_splitter,
    default_sentence_splitter,
)
from langchain_openai import OpenAIEmbeddings

splitter = SemanticCohesionSplitter(
    embeddings=OpenAIEmbeddings(),
    initial_threshold=0.6,
    min_chunk_chars=200,
    max_chunk_chars=1500,
    sentence_splitter=chinese_sentence_splitter,  # optional
)
chunks = splitter.split_text(text)
chunks = await splitter.asplit_text(text)  # uses aembed_documents

Use Case

RAG ingestion pipelines need semantic chunks with predictable size bounds. The existing langchain_experimental.SemanticChunker (Greg Kamradt design) detects break points on the global distribution of adjacent-sentence distances; it produces high-quality semantic boundaries but provides no size guarantee, so downstream chunks routinely exceed vector-store payload caps or fall below useful retrieval size.

Practical workflows where this matters:

  • Long-form mixed-language documents where paragraph length varies widely and a per-chunk cohesion bar produces more stable groupings than global distance percentiles.
  • Mixed technical content (prose + code + tables) where similarity distribution is bimodal and percentile-based break points misfire.
  • Pipelines that must respect a hard upper bound (vector store row size, model context window, downstream re-ranker token budget).

Proposed Solution

New module libs/text-splitters/langchain_text_splitters/semantic_cohesion.py exposing SemanticCohesionSplitter(TextSplitter). The sync path uses Embeddings.embed_documents; the async path uses aembed_documents. The sentence splitter is pluggable via a Callable[[str], list[str]]; the module ships with default_sentence_splitter (English-leaning regex; preserves separators so emitted chunks remain text.find-able for add_start_index metadata) and chinese_sentence_splitter (CJK terminator + closing-punctuation aware). The cosine similarity matrix is computed via NumPy when present and a pure-Python fallback otherwise, so the package picks up no new runtime dependencies.

Algorithm sketch:

sentences = sentence_splitter(text)
V = embed_documents(sentences); L2-normalise each row
sim[i][j] = dot(V[i], V[j])
groups = [[0]]
for i in 1..N-1:
    cur = groups[-1]
    bar = initial_threshold if len(cur) == 1
          else min(sim[a][b] for a,b in pairs(cur))
    if max(sim[i][j] for j in cur) >= bar:
        cur.append(i)
    else:
        groups.append([i])
raw = ["".join(sentences[j] for j in g) for g in groups]
chunks = enforce_bounds(raw, min_chunk_chars, max_chunk_chars,
                        chunk_overlap, strip=strip_whitespace)

Implementation: ~280 lines of code plus ~330 lines of unit tests. The strip_whitespace option inherited from TextSplitter is honoured by the windowing post-processor. Existing test patterns (DeterministicEmbeddings style) keep the suite independent of network calls.

Alternatives Considered

  1. langchain_experimental.SemanticChunker: different algorithm (break-point detection on the global distribution of adjacent-sentence distances). Has no chunk-size guarantee, which is the main pain point in production RAG pipelines. The Max-Min rule operates per chunk and adds hard size bounds. Keep both — they solve adjacent problems.
  2. RecursiveCharacterTextSplitter + re-ranker: works but loses semantic locality at the split step and requires an additional LLM pass to recover.
  3. Subclassing SemanticChunker to bolt on size bounds: couples the new behaviour to the experimental package and inherits its dependence on the global distance distribution. Cleaner to ship a separate splitter.

Additional Context

  • Hard size bounds are the key differentiator from SemanticChunker.
  • The Max-Min rule is independent of the global distance distribution, which makes it more stable on heterogeneous corpora (mixed languages, mixed structure).
  • The API mirrors existing splitters (Embeddings injected via constructor, **kwargs forwarded to TextSplitter for chunk_overlap, add_start_index, strip_whitespace, etc.).
  • min_chunk_chars and max_chunk_chars are character counts on purpose — the inherited length_function is intentionally not used for the bounds, so behaviour stays predictable across tokenisers. This is documented in the class docstring.
  • A working implementation lives on a local branch and is ready to push once an accepted label is added here.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

langchain - 💡(How to fix) Fix feat(text-splitters): add SemanticCohesionSplitter with adaptive Max-Min cohesion threshold