llamaIndex - ✅(Solved) Fix [Feature Request]: Enhance Semantic Splitting with Language-Aware Chunking [1 pull requests, 1 comments, 1 participants]

Feature Description

The current SemanticSplitterNodeParser doesn't handle different languages well, currently relies on a default sentence tokenizer (NLTK Punkt) and a fixed buffer_size. It uses the same buffer sze for all text, which breaks down with:

Non-Latin scripts (CJK languages or RTL Scripts)
Multilingual documents
Mixed-language content

Files to modify:

llama-index-core/llama_index/core/node_parser/text/semantic_splitter.py

I am ready to submit a PR for this enhancement upon approval.

Reason

Fixed buffer_size doesn't adapt to language density
Basic sentence tokenization failure.
No language detection in the chunking process
Embeddings aren't chosen based on detected language

Value of Feature

Add language-aware chunking that:

Detects language automatically
Uses script-specific tokenizers
Adapts buffer sizes based on language density
Supports multilingual documents with balanced chunks per language

Proposed Feature: Add an optional language_aware mode that:

Uses a lightweight language detection library (e.g. fasttext) to identify the dominant script. Switches the sentence_splitter dynamically (e.g., using SpaCy for specific languages or regex-based character splitting for CJK). Allows for buffer_size scaling based on the detected language density. Unit Tests: Add a comprehensive test suite covering multilingual and mixed-script documents.

<think> Okay, let's tackle this problem. The user wants to modify the SemanticSplitterNodeParser to handle different languages better. The main issues are fixed buffer sizes and using a default tokenizer that doesn't work for all languages.

First, I need to detect the language. Fasttext is a good choice here. So, I'll add a language detection step using fasttext. Then, depending on the detected language, switch the sentence tokenizer. For example, use SpaCy for languages it supports and a regex-based splitter for CJK.

Next, adjust the buffer size based on the language. Maybe set different buffer sizes for different scripts. For instance, CJK might need a smaller buffer because characters are denser.

I should also handle multilingual documents by splitting them into chunks per language. Maybe process each chunk with the appropriate tokenizer and buffer size.

Testing is important. Need to write unit tests for various languages and mixed-content scenarios. Also, make sure the code is optional, so users can enable the language-aware mode if needed.

Wait, what about dependencies? Adding fasttext and SpaCy might require users to install them. Should mention that in the documentation. Also, handle cases where the language detection isn't confident enough. Maybe default to the original tokenizer if detection fails.

Let me outline the steps: modify the class to include language detection, switch tokenizers, adjust buffer sizes, and test. Provide code examples for each part. Use a try-except block for importing fasttext to keep it optional. Maybe use a factory pattern for tokenizers based on detected language.

Need to make sure the code is efficient and doesn't add too much overhead. Also, consider performance implications of running language detection on large texts. Maybe process in chunks or use a streaming approach if possible.

Examples in code would help. Show how to integrate fasttext, switch tokenizers, and adjust buffer sizes. Also, mention the necessary imports and any configuration changes. Finally, verify the fix by testing with

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix [Feature Request]: Enhance Semantic Splitting with Language-Aware Chunking [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #20787: Enhance Semantic Splitting with Language-Aware Chunking

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Feature Description

Reason

Value of Feature

extent analysis

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix [Feature Request]: Enhance Semantic Splitting with Language-Aware Chunking [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #20787: Enhance Semantic Splitting with Language-Aware Chunking

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Feature Description

Reason

Value of Feature

extent analysis

Still need to ship something?

RELATED_DISCOVERY

TRENDING