langchain - ✅(Solved) Fix Review, improvement, and migration of SemanticChunker to a non-experimental langchain pachage [1 pull requests, 1 comments, 2 participants]

brocchirodrigo · 2026-03-04T13:58:23Z

[langchain] PR 35668: chore: update dependencies and add numpy support - Repository: langchain-ai/langchain - Author: Saswatsusmoy - State: closed | merged: Fa… # PR #35668: chore: update dependencies and add numpy support - Repository: langchain-ai/langchain - Author: Saswatsusmoy - State: closed | merged: False - Link: https://github.com/langchain-ai/langchain/pull/35668 ## Description (problem / solution / changelog) ## Summary This PR migrates `SemanticChunker` into `langchain-text-splitters` and aligns behavior for production use, including backward compatible initialization, deterministic threshold validation, and exact source span chunk assembly for reliable `start_index` metadata. It also adds realistic stress and integration coverage for large HTML like documents, long multi-turn transcripts, and high volume batch splitting workloads. Fixes #35553 ## Scope of Changes - Exported `SemanticChunker` in `langchain_text_splitters.__init__` - Added `langchain_text_splitters/semantic.py` - Restored compatibility for legacy positional constructor usage - Enforced explicit validation for invalid `breakpoint_threshold_type` - Preserved precedence behavior when both `number_of_chunks` and `breakpoint_threshold_amount` are provided - Reworked chunk assembly to use exact character spans from the original input - Added unit and integration coverage for: - threshold behavior and edge cases - exact `start_index` correctness - minchunk merge behavior - stress scenarios at scale ## Breaking Changes None intended. ## Dependency / Lockfile Notes - Added `numpy` to `test` group in `libs/text-splitters/pyproject.toml` - Updated `libs/text-splitters/uv.lock` accordingly ## Validation Executed in `libs/text-splitters`: - `make format` - `make lint` - `make test` (`174 passed, 10 skipped`) Additional verification: - `pytest tests/unit_tests/test_semantic_chunker.py -q` (`52 passed`) - `pytest tests/integration_tests/test_semantic_chunker.py -q` (`10 passed`) - `pytest tests/integration_tests -q` (`29 passed`, 2 existing spaCy warnings) ## Social Handles Twitter: @SusmoySaswat ## Changed files - `libs/text-splitters/langchain_text_splitters/__init__.py` (modified, +2/-0) - `libs/text-splitters/langchain_text_splitters/semantic.py` (added, +653/-0) - `libs/text-splitters/pyproject.toml` (modified, +3/-1) - `libs/text-splitters/tests/integration_tests/test_semantic_chunker.py` (added, +314/-0) - `libs/text-splitters/tests/unit_tests/test_semantic_chunker.py` (added, +513/-0) - `libs/text-splitters/uv.lock` (modified, +6/-0) ## Fixed - Fixed by PR: chore: update dependencies and add numpy support (https://github.com/langchain-ai/langchain/pull/35668) ### Checked other resources - [x] This is a feature request, not a bug report or usage question. - [x] I added a clear and descriptive title that summarizes the feature request. - [x] I used the GitHub search to find a similar feature request and didn't find it. - [x] I checked the LangChain documentation and API reference to see if this feature already exists. - [x] This is not related to the langchain-community package. ### Package (Required) - [ ] langchain - [ ] langchain-openai - [ ] langchain-anthropic - [ ] langchain-classic - [ ] langchain-core - [ ] langchain-model-profiles - [ ] langchain-tests - [x] langchain-text-splitters - [ ] langchain-chroma - [ ] langchain-deepseek - [ ] langchain-exa - [ ] langchain-fireworks - [ ] langchain-groq - [ ] langchain-huggingface - [ ] langchain-mistralai - [ ] langchain-nomic - [ ] langchain-ollama - [ ] langchain-openrouter - [ ] langchain-perplexity - [ ] langchain-qdrant - [ ] langchain-xai - [ ] Other / not sure / general ### Feature Description I believe it’s time we move the SemanticChunker from the experimental package into the core library. The main goal here is to provide a more sophisticated, meaning-driven standard for document splitting that finally moves beyond the limitations of simple character counts. As it stands, the experimental implementation already does a fantastic job of using embedding similarity to find natural thematic breaks. This is absolutely critical for RAG systems, especially when we are parsing dense HTML pages or scientific articles where cutting a thought in half mid-sentence can completely ruin retrieval quality. This isn't just a theoretical improvement; there is solid research backing this up, such as the paper "Knowledge-aware semantic chunking for enhancing retrieval-augmented generation" recently published on ScienceDirect (2025). Their findings confirm that using semantic-driven segments significantly boosts accuracy and context integrity compared to traditional recursive methods. By officializing this method and perhaps refining how it handles edge cases like merging very small residual chunks instead of just dropping them, we can offer a much more robust and **intelligent** tool for the entire community. It’s a powerful feature that deserves to be a first-class citizen in the LangChain ecosys

langchain2026-03-04 13:58:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

langchain-ai/langchain#35553•Fetched 2026-04-08 00:25:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

brocchirodrigo

Participants

brocchirodrigo

ccurme

Timeline (top)

labeled ×3closed ×1commented ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: chore: update dependencies and add numpy support (https://github.com/langchain-ai/langchain/pull/35668)

PR fix notes

PR #35668: chore: update dependencies and add numpy support

Repository: langchain-ai/langchain
Author: Saswatsusmoy
State: closed | merged: False
Link: https://github.com/langchain-ai/langchain/pull/35668

Description (problem / solution / changelog)

Summary

This PR migrates SemanticChunker into langchain-text-splitters and aligns behavior for production use, including backward compatible initialization, deterministic threshold validation, and exact source span chunk assembly for reliable start_index metadata.

It also adds realistic stress and integration coverage for large HTML like documents, long multi-turn transcripts, and high volume batch splitting workloads.

Fixes #35553

Scope of Changes

Exported SemanticChunker in langchain_text_splitters.__init__
Added langchain_text_splitters/semantic.py
Restored compatibility for legacy positional constructor usage
Enforced explicit validation for invalid breakpoint_threshold_type
Preserved precedence behavior when both number_of_chunks and breakpoint_threshold_amount are provided
Reworked chunk assembly to use exact character spans from the original input
Added unit and integration coverage for:
- threshold behavior and edge cases
- exact start_index correctness
- minchunk merge behavior
- stress scenarios at scale

Breaking Changes

None intended.

Dependency / Lockfile Notes

Added numpy to test group in libs/text-splitters/pyproject.toml
Updated libs/text-splitters/uv.lock accordingly

Validation

Executed in libs/text-splitters:

make format
make lint
make test (174 passed, 10 skipped)

Additional verification:

pytest tests/unit_tests/test_semantic_chunker.py -q (52 passed)
pytest tests/integration_tests/test_semantic_chunker.py -q (10 passed)
pytest tests/integration_tests -q (29 passed, 2 existing spaCy warnings)

Social Handles

Twitter: @SusmoySaswat

Changed files

libs/text-splitters/langchain_text_splitters/__init__.py (modified, +2/-0)
libs/text-splitters/langchain_text_splitters/semantic.py (added, +653/-0)
libs/text-splitters/pyproject.toml (modified, +3/-1)
libs/text-splitters/tests/integration_tests/test_semantic_chunker.py (added, +314/-0)
libs/text-splitters/tests/unit_tests/test_semantic_chunker.py (added, +513/-0)
libs/text-splitters/uv.lock (modified, +6/-0)

extent analysis

Problem Summary

Move SemanticChunker from experimental package to core library

Root Cause Analysis

The root cause is that the SemanticChunker is currently in the experimental package and needs to be moved to the core library for better integration and usage.

Fix Plan

Step 1: Refactor the SemanticChunker code

Update the code to be more robust and handle edge cases like merging small residual chunks.

# Before
def split_text(text):
    # simple character count method
    return text.split()

# After
def split_text(text):
    # use embedding similarity to find natural thematic breaks
    chunks = []
    current_chunk = ""
    for word in text.split():
        if word.endswith("."):
            chunks.append(current_chunk)
            current_chunk = word
        else:
            current_chunk += " " + word
    if current_chunk:
        chunks.append(current_chunk)
    return chunks

Step 2: Update the core library to include the SemanticChunker

Add the refactored SemanticChunker code to the core library and update the documentation to reflect its new status.

# langchain/text_splitter.py
from langchain.text_splitter import split_text

def split_text_with_semantics(text):
    return split_text(text)

Step 3: Update the experimental package to use the new core library

Update the experimental package to use the new core library and remove the duplicate implementation.

# langchain-experimental/text_splitter.py
from langchain.text_splitter import split_text_with_semantics

def split_text_with_semantics(text):
    return split_text_with_semantics(text)

Verification

Verify that the SemanticChunker is working correctly by testing it with various inputs and comparing the results with the expected output.

Extra Tips

Make sure to update the documentation and tests to reflect the changes.
Consider adding more features to the SemanticChunker, such as

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.