llamaIndex - ✅(Solved) Fix [Feature Request]: Enhance Semantic Splitting with Language-Aware Chunking [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20775Fetched 2026-04-08 00:31:04
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
labeled ×2commented ×1cross-referenced ×1referenced ×1

Fix Action

Fixed

PR fix notes

PR #20787: Enhance Semantic Splitting with Language-Aware Chunking

Description (problem / solution / changelog)

Description

The current SemanticSplitterNodeParser doesn't handle different languages well, currently relies on a default sentence tokenizer (NLTK Punkt) and a fixed buffer_size. It uses the same buffer sze for all text, which breaks down with:

Non-Latin scripts (CJK languages or RTL Scripts) Multilingual documents Mixed-language content

Fixes #20775

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-core/llama_index/core/node_parser/text/language_utils.py (added, +219/-0)
  • llama-index-core/llama_index/core/node_parser/text/semantic_splitter.py (modified, +224/-14)
  • llama-index-core/tests/test_language_utils.py (added, +273/-0)
RAW_BUFFERClick to expand / collapse

Feature Description

The current SemanticSplitterNodeParser doesn't handle different languages well, currently relies on a default sentence tokenizer (NLTK Punkt) and a fixed buffer_size. It uses the same buffer sze for all text, which breaks down with:

  • Non-Latin scripts (CJK languages or RTL Scripts)
  • Multilingual documents
  • Mixed-language content

Files to modify:

  • llama-index-core/llama_index/core/node_parser/text/semantic_splitter.py

I am ready to submit a PR for this enhancement upon approval.

Reason

  • Fixed buffer_size doesn't adapt to language density
  • Basic sentence tokenization failure.
  • No language detection in the chunking process
  • Embeddings aren't chosen based on detected language

Value of Feature

Add language-aware chunking that:

  1. Detects language automatically
  2. Uses script-specific tokenizers
  3. Adapts buffer sizes based on language density
  4. Supports multilingual documents with balanced chunks per language

Proposed Feature: Add an optional language_aware mode that:

Uses a lightweight language detection library (e.g. fasttext) to identify the dominant script. Switches the sentence_splitter dynamically (e.g., using SpaCy for specific languages or regex-based character splitting for CJK). Allows for buffer_size scaling based on the detected language density. Unit Tests: Add a comprehensive test suite covering multilingual and mixed-script documents.

extent analysis

<think> Okay, let's tackle this problem. The user wants to modify the SemanticSplitterNodeParser to handle different languages better. The main issues are fixed buffer sizes and using a default tokenizer that doesn't work for all languages.

First, I need to detect the language. Fasttext is a good choice here. So, I'll add a language detection step using fasttext. Then, depending on the detected language, switch the sentence tokenizer. For example, use SpaCy for languages it supports and a regex-based splitter for CJK.

Next, adjust the buffer size based on the language. Maybe set different buffer sizes for different scripts. For instance, CJK might need a smaller buffer because characters are denser.

I should also handle multilingual documents by splitting them into chunks per language. Maybe process each chunk with the appropriate tokenizer and buffer size.

Testing is important. Need to write unit tests for various languages and mixed-content scenarios. Also, make sure the code is optional, so users can enable the language-aware mode if needed.

Wait, what about dependencies? Adding fasttext and SpaCy might require users to install them. Should mention that in the documentation. Also, handle cases where the language detection isn't confident enough. Maybe default to the original tokenizer if detection fails.

Let me outline the steps: modify the class to include language detection, switch tokenizers, adjust buffer sizes, and test. Provide code examples for each part. Use a try-except block for importing fasttext to keep it optional. Maybe use a factory pattern for tokenizers based on detected language.

Need to make sure the code is efficient and doesn't add too much overhead. Also, consider performance implications of running language detection on large texts. Maybe process in chunks or use a streaming approach if possible.

Examples in code would help. Show how to integrate fasttext, switch tokenizers, and adjust buffer sizes. Also, mention the necessary imports and any configuration changes. Finally, verify the fix by testing with

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING