langchain - 💡(How to fix) Fix Bug: `RecursiveJsonSplitter` silently violates `max_chunk_size` for nested JSON — path overhead ignored in size estimate

Q: Expected behavior

All output chunks produced by `split_json()` and `split_text()` must have a serialized size ≤ `max_chunk_size`. This is the core contract of the class.

langchain2026-05-28 10:47:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Chunk 0: 59 bytes (max_chunk_size=50) AssertionError: CONTRACT VIOLATED: chunk 0 is 59 bytes > 50

Root Cause

In _json_split() (json.py line 98):

size = self._json_size({key: value})   # ← only measures {key: value} in isolation
remaining = self.max_chunk_size - chunk_size

if size < remaining:
    self._set_nested_dict(chunks[-1], new_path, value)  # ← inserts at full depth!

size measures the byte-length of {key: value} in isolation. But _set_nested_dict inserts the value at the full new_path depth inside the chunk — meaning the overhead of all ancestor keys in current_path is completely ignored.

Concrete trace for the reproducer above (max_chunk_size=50):

Variable	Value	Meaning
`chunk_size`	2	empty chunk `{}`
`size`	47	`_json_size({"level2": {"level3": "..."}})` ← path overhead excluded
`remaining`	48	`50 - 2`
Decision	`47 < 48` → MERGE	wrong: actual insertion adds 59 bytes
Actual result	59 bytes in chunk	`{"level1": {"level2": {"level3": "..."}}}`

The 12-byte overhead of "level1": was never counted.

Fix Action

Fix / Workaround

This is a bug, not a usage question.
I added a clear and descriptive title that summarizes this issue.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain.
I posted a self-contained, minimal, reproducible example.

Code Example

from langchain_text_splitters import RecursiveJsonSplitter
import json

splitter = RecursiveJsonSplitter(max_chunk_size=50)

data = {
    "level1": {
        "level2": {
            "level3": "this is a test string"
        }
    }
}

chunks = splitter.split_json(data)

for i, chunk in enumerate(chunks):
    size = len(json.dumps(chunk))
    print(f"Chunk {i}: {size} bytes  (max_chunk_size=50)")
    assert size <= 50, f"CONTRACT VIOLATED: chunk {i} is {size} bytes > 50"

---

Chunk 0: 59 bytes  (max_chunk_size=50)
AssertionError: CONTRACT VIOLATED: chunk 0 is 59 bytes > 50

---

size = self._json_size({key: value})   # ← only measures {key: value} in isolation
remaining = self.max_chunk_size - chunk_size

if size < remaining:
    self._set_nested_dict(chunks[-1], new_path, value)  # ← inserts at full depth!

---

# Current (broken):
size = self._json_size({key: value})

# Fixed: measure the actual overhead including the full path
test_chunk: dict = {}
self._set_nested_dict(test_chunk, new_path, value)
size = self._json_size(test_chunk)

RAW_BUFFERClick to expand / collapse

Submission checklist

This is a bug, not a usage question.
I added a clear and descriptive title that summarizes this issue.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain.
I posted a self-contained, minimal, reproducible example.

Package

langchain-text-splitters

Reproduction

from langchain_text_splitters import RecursiveJsonSplitter
import json

splitter = RecursiveJsonSplitter(max_chunk_size=50)

data = {
    "level1": {
        "level2": {
            "level3": "this is a test string"
        }
    }
}

chunks = splitter.split_json(data)

for i, chunk in enumerate(chunks):
    size = len(json.dumps(chunk))
    print(f"Chunk {i}: {size} bytes  (max_chunk_size=50)")
    assert size <= 50, f"CONTRACT VIOLATED: chunk {i} is {size} bytes > 50"

Output:

Chunk 0: 59 bytes  (max_chunk_size=50)
AssertionError: CONTRACT VIOLATED: chunk 0 is 59 bytes > 50

Expected behavior

All output chunks produced by split_json() and split_text() must have a serialized size ≤ max_chunk_size. This is the core contract of the class.

Actual behavior

Chunks silently exceed max_chunk_size for nested JSON structures. No error or warning is raised. The violation grows with nesting depth.

Root cause

In _json_split() (json.py line 98):

size = self._json_size({key: value})   # ← only measures {key: value} in isolation
remaining = self.max_chunk_size - chunk_size

if size < remaining:
    self._set_nested_dict(chunks[-1], new_path, value)  # ← inserts at full depth!

Concrete trace for the reproducer above (max_chunk_size=50):

Variable	Value	Meaning
`chunk_size`	2	empty chunk `{}`
`size`	47	`_json_size({"level2": {"level3": "..."}})` ← path overhead excluded
`remaining`	48	`50 - 2`
Decision	`47 < 48` → MERGE	wrong: actual insertion adds 59 bytes
Actual result	59 bytes in chunk	`{"level1": {"level2": {"level3": "..."}}}`

The 12-byte overhead of "level1": was never counted.

Impact

This breaks the core use case of RecursiveJsonSplitter in RAG pipelines: developers set max_chunk_size to stay within an LLM's token budget. Because the guarantee is silently violated:

Chunks fed to an LLM may exceed the expected token limit → unexpected context window overflow errors
The violation is silent and scales with nesting depth — deeply nested API responses (e.g. OpenAPI schemas, Kubernetes configs) can produce chunks significantly larger than max_chunk_size

Suggested fix

Replace the isolated _json_size({key: value}) estimate with the actual size delta that would result from inserting this item at new_path:

# Current (broken):
size = self._json_size({key: value})

# Fixed: measure the actual overhead including the full path
test_chunk: dict = {}
self._set_nested_dict(test_chunk, new_path, value)
size = self._json_size(test_chunk)

This correctly accounts for path overhead at every recursion level.

System Info

langchain-text-splitters latest (master), Python 3.11+.

Discovered via architectural analysis with Hokmah (TransitionGraph + IdeaGraph on 499 commits).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

All output chunks produced by split_json() and split_text() must have a serialized size ≤ max_chunk_size. This is the core contract of the class.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

langchain - 💡(How to fix) Fix Bug: `RecursiveJsonSplitter` silently violates `max_chunk_size` for nested JSON — path overhead ignored in size estimate

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Submission checklist

Package

Reproduction

Expected behavior

Actual behavior

Root cause

Impact

Suggested fix

System Info

FAQ

Expected behavior

Still need to ship something?

TRENDING