langchain - 💡(How to fix) Fix Bug: `RecursiveJsonSplitter` silently violates `max_chunk_size` for nested JSON — path overhead ignored in size estimate

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Chunk 0: 59 bytes (max_chunk_size=50) AssertionError: CONTRACT VIOLATED: chunk 0 is 59 bytes > 50

Root Cause

In _json_split() (json.py line 98):

size = self._json_size({key: value})   # ← only measures {key: value} in isolation
remaining = self.max_chunk_size - chunk_size

if size < remaining:
    self._set_nested_dict(chunks[-1], new_path, value)  # ← inserts at full depth!

size measures the byte-length of {key: value} in isolation. But _set_nested_dict inserts the value at the full new_path depth inside the chunk — meaning the overhead of all ancestor keys in current_path is completely ignored.

Concrete trace for the reproducer above (max_chunk_size=50):

VariableValueMeaning
chunk_size2empty chunk {}
size47_json_size({"level2": {"level3": "..."}})path overhead excluded
remaining4850 - 2
Decision47 < 48MERGEwrong: actual insertion adds 59 bytes
Actual result59 bytes in chunk{"level1": {"level2": {"level3": "..."}}}

The 12-byte overhead of "level1": was never counted.

Fix Action

Fix / Workaround

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain.
  • I posted a self-contained, minimal, reproducible example.

Code Example

from langchain_text_splitters import RecursiveJsonSplitter
import json

splitter = RecursiveJsonSplitter(max_chunk_size=50)

data = {
    "level1": {
        "level2": {
            "level3": "this is a test string"
        }
    }
}

chunks = splitter.split_json(data)

for i, chunk in enumerate(chunks):
    size = len(json.dumps(chunk))
    print(f"Chunk {i}: {size} bytes  (max_chunk_size=50)")
    assert size <= 50, f"CONTRACT VIOLATED: chunk {i} is {size} bytes > 50"

---

Chunk 0: 59 bytes  (max_chunk_size=50)
AssertionError: CONTRACT VIOLATED: chunk 0 is 59 bytes > 50

---

size = self._json_size({key: value})   # ← only measures {key: value} in isolation
remaining = self.max_chunk_size - chunk_size

if size < remaining:
    self._set_nested_dict(chunks[-1], new_path, value)  # ← inserts at full depth!

---

# Current (broken):
size = self._json_size({key: value})

# Fixed: measure the actual overhead including the full path
test_chunk: dict = {}
self._set_nested_dict(test_chunk, new_path, value)
size = self._json_size(test_chunk)
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain.
  • I posted a self-contained, minimal, reproducible example.

Package

langchain-text-splitters

Reproduction

from langchain_text_splitters import RecursiveJsonSplitter
import json

splitter = RecursiveJsonSplitter(max_chunk_size=50)

data = {
    "level1": {
        "level2": {
            "level3": "this is a test string"
        }
    }
}

chunks = splitter.split_json(data)

for i, chunk in enumerate(chunks):
    size = len(json.dumps(chunk))
    print(f"Chunk {i}: {size} bytes  (max_chunk_size=50)")
    assert size <= 50, f"CONTRACT VIOLATED: chunk {i} is {size} bytes > 50"

Output:

Chunk 0: 59 bytes  (max_chunk_size=50)
AssertionError: CONTRACT VIOLATED: chunk 0 is 59 bytes > 50

Expected behavior

All output chunks produced by split_json() and split_text() must have a serialized size ≤ max_chunk_size. This is the core contract of the class.

Actual behavior

Chunks silently exceed max_chunk_size for nested JSON structures. No error or warning is raised. The violation grows with nesting depth.

Root cause

In _json_split() (json.py line 98):

size = self._json_size({key: value})   # ← only measures {key: value} in isolation
remaining = self.max_chunk_size - chunk_size

if size < remaining:
    self._set_nested_dict(chunks[-1], new_path, value)  # ← inserts at full depth!

size measures the byte-length of {key: value} in isolation. But _set_nested_dict inserts the value at the full new_path depth inside the chunk — meaning the overhead of all ancestor keys in current_path is completely ignored.

Concrete trace for the reproducer above (max_chunk_size=50):

VariableValueMeaning
chunk_size2empty chunk {}
size47_json_size({"level2": {"level3": "..."}})path overhead excluded
remaining4850 - 2
Decision47 < 48MERGEwrong: actual insertion adds 59 bytes
Actual result59 bytes in chunk{"level1": {"level2": {"level3": "..."}}}

The 12-byte overhead of "level1": was never counted.

Impact

This breaks the core use case of RecursiveJsonSplitter in RAG pipelines: developers set max_chunk_size to stay within an LLM's token budget. Because the guarantee is silently violated:

  • Chunks fed to an LLM may exceed the expected token limit → unexpected context window overflow errors
  • The violation is silent and scales with nesting depth — deeply nested API responses (e.g. OpenAPI schemas, Kubernetes configs) can produce chunks significantly larger than max_chunk_size

Suggested fix

Replace the isolated _json_size({key: value}) estimate with the actual size delta that would result from inserting this item at new_path:

# Current (broken):
size = self._json_size({key: value})

# Fixed: measure the actual overhead including the full path
test_chunk: dict = {}
self._set_nested_dict(test_chunk, new_path, value)
size = self._json_size(test_chunk)

This correctly accounts for path overhead at every recursion level.

System Info

langchain-text-splitters latest (master), Python 3.11+.

Discovered via architectural analysis with Hokmah (TransitionGraph + IdeaGraph on 499 commits).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

All output chunks produced by split_json() and split_text() must have a serialized size ≤ max_chunk_size. This is the core contract of the class.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING