langchain - ✅(Solved) Fix Review, improvement, and migration of SemanticChunker to a non-experimental langchain pachage [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#35553Fetched 2026-04-08 00:25:39
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Timeline (top)
labeled ×3closed ×1commented ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #35668: chore: update dependencies and add numpy support

Description (problem / solution / changelog)

Summary

This PR migrates SemanticChunker into langchain-text-splitters and aligns behavior for production use, including backward compatible initialization, deterministic threshold validation, and exact source span chunk assembly for reliable start_index metadata.

It also adds realistic stress and integration coverage for large HTML like documents, long multi-turn transcripts, and high volume batch splitting workloads.

Fixes #35553

Scope of Changes

  • Exported SemanticChunker in langchain_text_splitters.__init__
  • Added langchain_text_splitters/semantic.py
  • Restored compatibility for legacy positional constructor usage
  • Enforced explicit validation for invalid breakpoint_threshold_type
  • Preserved precedence behavior when both number_of_chunks and breakpoint_threshold_amount are provided
  • Reworked chunk assembly to use exact character spans from the original input
  • Added unit and integration coverage for:
    • threshold behavior and edge cases
    • exact start_index correctness
    • minchunk merge behavior
    • stress scenarios at scale

Breaking Changes

None intended.

Dependency / Lockfile Notes

  • Added numpy to test group in libs/text-splitters/pyproject.toml
  • Updated libs/text-splitters/uv.lock accordingly

Validation

Executed in libs/text-splitters:

  • make format
  • make lint
  • make test (174 passed, 10 skipped)

Additional verification:

  • pytest tests/unit_tests/test_semantic_chunker.py -q (52 passed)
  • pytest tests/integration_tests/test_semantic_chunker.py -q (10 passed)
  • pytest tests/integration_tests -q (29 passed, 2 existing spaCy warnings)

Social Handles

Twitter: @SusmoySaswat

Changed files

  • libs/text-splitters/langchain_text_splitters/__init__.py (modified, +2/-0)
  • libs/text-splitters/langchain_text_splitters/semantic.py (added, +653/-0)
  • libs/text-splitters/pyproject.toml (modified, +3/-1)
  • libs/text-splitters/tests/integration_tests/test_semantic_chunker.py (added, +314/-0)
  • libs/text-splitters/tests/unit_tests/test_semantic_chunker.py (added, +513/-0)
  • libs/text-splitters/uv.lock (modified, +6/-0)
RAW_BUFFERClick to expand / collapse

Checked other resources

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Feature Description

I believe it’s time we move the SemanticChunker from the experimental package into the core library. The main goal here is to provide a more sophisticated, meaning-driven standard for document splitting that finally moves beyond the limitations of simple character counts.

As it stands, the experimental implementation already does a fantastic job of using embedding similarity to find natural thematic breaks. This is absolutely critical for RAG systems, especially when we are parsing dense HTML pages or scientific articles where cutting a thought in half mid-sentence can completely ruin retrieval quality.

This isn't just a theoretical improvement; there is solid research backing this up, such as the paper "Knowledge-aware semantic chunking for enhancing retrieval-augmented generation" recently published on ScienceDirect (2025). Their findings confirm that using semantic-driven segments significantly boosts accuracy and context integrity compared to traditional recursive methods. By officializing this method and perhaps refining how it handles edge cases like merging very small residual chunks instead of just dropping them, we can offer a much more robust and intelligent tool for the entire community. It’s a powerful feature that deserves to be a first-class citizen in the LangChain ecosystem.

Use Case

For applying chunks by context, before executing the embedding and preserving the semantic dialogue logic.

Proposed Solution

The same as previously predicted:

https://github.com/langchain-ai/langchain-experimental/blob/a6c66481ee56b38166b3daeea7eb767eac92146b/libs/experimental/langchain_experimental/text_splitter.py#L99

It is appropriate to analyze areas for improvement.

Alternatives Considered

No response

Additional Context

No response

extent analysis

Problem Summary

Move SemanticChunker from experimental package to core library

Root Cause Analysis

The root cause is that the SemanticChunker is currently in the experimental package and needs to be moved to the core library for better integration and usage.

Fix Plan

Step 1: Refactor the SemanticChunker code

Update the code to be more robust and handle edge cases like merging small residual chunks.

# Before
def split_text(text):
    # simple character count method
    return text.split()

# After
def split_text(text):
    # use embedding similarity to find natural thematic breaks
    chunks = []
    current_chunk = ""
    for word in text.split():
        if word.endswith("."):
            chunks.append(current_chunk)
            current_chunk = word
        else:
            current_chunk += " " + word
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

Step 2: Update the core library to include the SemanticChunker

Add the refactored SemanticChunker code to the core library and update the documentation to reflect its new status.

# langchain/text_splitter.py
from langchain.text_splitter import split_text

def split_text_with_semantics(text):
    return split_text(text)

Step 3: Update the experimental package to use the new core library

Update the experimental package to use the new core library and remove the duplicate implementation.

# langchain-experimental/text_splitter.py
from langchain.text_splitter import split_text_with_semantics

def split_text_with_semantics(text):
    return split_text_with_semantics(text)

Verification

Verify that the SemanticChunker is working correctly by testing it with various inputs and comparing the results with the expected output.

Extra Tips

  • Make sure to update the documentation and tests to reflect the changes.
  • Consider adding more features to the SemanticChunker, such as

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING