langchain - 💡(How to fix) Fix feat(text-splitters): add PySBDTextSplitter [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36999Fetched 2026-04-25 06:03:10
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
labeled ×3issue_type_added ×1
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a feature request, not a bug report or usage question.
  • I added a clear and descriptive title that summarizes the feature request.
  • I used the GitHub search to find a similar feature request and didn't find it.
  • I checked the LangChain documentation and API reference to see if this feature already exists.
  • This is not related to the langchain-community package.

Package (Required)

  • langchain
  • langchain-openai
  • langchain-anthropic
  • langchain-classic
  • langchain-core
  • langchain-model-profiles
  • langchain-tests
  • langchain-text-splitters
  • langchain-chroma
  • langchain-deepseek
  • langchain-exa
  • langchain-fireworks
  • langchain-groq
  • langchain-huggingface
  • langchain-mistralai
  • langchain-nomic
  • langchain-ollama
  • langchain-openrouter
  • langchain-perplexity
  • langchain-qdrant
  • langchain-xai
  • Other / not sure / general

Feature Description

Add a PySBDTextSplitter class to langchain-text-splitters that uses the pysbd (Python Sentence Boundary Disambiguation) library for sentence-level text splitting.

pysbd is a rule-based sentence boundary detector that correctly handles edge cases like abbreviations (Dr., U.S.A.), decimal numbers (3.14), and ellipsis. It supports 22 languages.

Use Case

Current splitters like CharacterTextSplitter or RecursiveCharacterTextSplitter split on character count and can break sentences mid-way. SpacyTextSplitter and NLTKTextSplitter exist but require heavy dependencies.

pysbd is lightweight, rule-based, and language-aware — making it a great alternative for users who want accurate sentence splitting without loading a full NLP model.

Proposed Solution

A new class PySBDTextSplitter extending TextSplitter base class, similar to the existing SpacyTextSplitter pattern. Supports language parameter for all 22 pysbd languages. I have already written the implementation and tests, ready to open a PR once approved.

Alternatives Considered

SpacyTextSplitter — heavier dependency, model download required. NLTKTextSplitter — less accurate on abbreviations and edge cases.

Additional Context

No response

extent analysis

TL;DR

Implement the proposed PySBDTextSplitter class in the langchain-text-splitters package to enable accurate sentence-level text splitting using the pysbd library.

Guidance

  • Review the proposed implementation and tests for the PySBDTextSplitter class to ensure it meets the requirements and is compatible with the existing TextSplitter base class.
  • Verify that the implementation supports all 22 languages provided by the pysbd library.
  • Consider adding documentation and examples for the new PySBDTextSplitter class to facilitate its usage.
  • Evaluate the performance and accuracy of the PySBDTextSplitter class in comparison to existing splitters like SpacyTextSplitter and NLTKTextSplitter.

Example

No code snippet is provided as the implementation is already written and ready for review.

Notes

The proposed solution seems to address the need for accurate sentence splitting without relying on heavy NLP models. However, it is essential to review and test the implementation thoroughly to ensure its quality and compatibility.

Recommendation

Apply the proposed workaround by implementing the PySBDTextSplitter class, as it offers a lightweight and language-aware solution for sentence-level text splitting.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING