litellm - ✅(Solved) Fix [Bug]: O(n²) json.loads retry in handle_accumulated_json_chunk blocks event loop, kills liveness probes [1 pull requests, 1 participants]

6matt · 2026-04-21T17:01:10Z

[litellm] PR 26187: perf streaming : eliminate O n² json.loads in accumulated-JSON paths Vertex, Anthropic, SageMaker - Repository: BerriAI/litellm - Author: A… # PR #26187: perf(streaming): eliminate O(n²) json.loads in accumulated-JSON paths (Vertex, Anthropic, SageMaker) - Repository: BerriAI/litellm - Author: Anai-Guo - State: open | merged: False - Link: https://github.com/BerriAI/litellm/pull/26187 ## Description (problem / solution / changelog) ## Problem Closes #26181. When Vertex AI (Gemini), Anthropic, or SageMaker streaming responses arrive as fragmented SSE chunks, `handle_accumulated_json_chunk` / `_handle_accumulated_json_chunk` concatenates every chunk into a string (`self.accumulated_json += chunk`) and calls `json.loads()` on the **entire accumulated string** after each new chunk. This is **O(n²)** in total work: every chunk re-parses the full buffer from the beginning. For large responses (long code generation, big tool-call payloads — several MB), `json.loads` — a single CPython C call that holds the GIL — blocks for seconds at a time. In the default single-process uvicorn deployment this **freezes the event loop**, causing liveness probe failures and kubelet pod restarts. ## Fix Two changes per affected path: 1. **List-based accumulation** — replace `self.accumulated_json += chunk` (O(n) copy per call → O(n²) total) with `self.accumulated_json_chunks.append(chunk)` (O(1) per call). The join (`"".join(chunks)`) is deferred to parse time. 2. **Completeness heuristic** — skip `json.loads` entirely unless the last non-whitespace character of the new chunk is `}` or `]`. A well-formed JSON object/array always ends with one of these; incomplete fragments never do. This eliminates the vast majority of unnecessary `json.loads` calls (and their GIL holds) at the cost of one extra `str.rstrip()` check. Together, these changes reduce total CPU work from O(n²) to O(n) for the common case where JSON fragments arrive one at a time. ### Files changed | File | Pattern fixed | |---|---| | `litellm/llms/sagemaker/common_utils.py` | `iter_bytes` + `aiter_bytes` local `accumulated_json` | | `litellm/llms/anthropic/chat/handler.py` | `self.accumulated_json` instance variable + `_handle_accumulated_json_chunk` | | `litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py` | `self.accumulated_json` instance variable + `handle_accumulated_json_chunk` | ### Behaviour unchanged - Complete JSON chunks (last char `}` or `]`) parse on the first attempt — same latency as before. - Incomplete chunks are buffered — same accumulation semantics as before. - All existing error paths (`json.JSONDecodeError`, `UnicodeDecodeError`, `StopIteration`/`StopAsyncIteration` tail handlers) are preserved. - OpenAI paths are unaffected (they already do single-chunk parsing). 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Changed files - `litellm/llms/anthropic/chat/handler.py` (modified, +16/-14) - `litellm/llms/sagemaker/common_utils.py` (modified, +24/-27) - `litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py` (modified, +10/-10) ## Fixed - Fixed by PR: perf(streaming): eliminate O(n²) json.loads in accumulated-JSON paths (Vertex, Anthropic, SageMaker) (https://github.com/BerriAI/litellm/pull/26187) ## What happened? When a Vertex AI (Gemini), Anthropic, or SageMaker streaming response arrives as fragmented SSE chunks, `handle_accumulated_json_chunk` concatenates every chunk into `self.accumulated_json` and calls `json.loads()` on the **entire accumulated string** after each chunk. If the JSON is incomplete, `JSONDecodeError` is caught and the next chunk triggers another full parse of the now-larger string. This is **O(n²)** in total work for n bytes of streaming data. For large responses (long code generation, large tool-call payloads), the accumulated string grows to megabytes and `json.loads` — a single CPython C call that holds the GIL — blocks for seconds at a time. **In a uvicorn single-process async deployment (the default proxy configuration), this freezes the entire asyncio event loop.** The liveness probe handler never gets a turn to run, the probe times out, and kubelet restarts the pod. We confirmed this with **7 consecutive `py-spy` thread dumps spanning ~70 seconds**, all showing the same state: - MainThread (`active+gil`): stuck in `json.loads → json.decoder.decode → json.decoder.raw_decode`, called from `handle_accumulated_json_chunk` - All 15 other threads: idle, waiting for work ### Affected code paths | Provider | File | Method/Pattern | |---|---|---| | **Vertex AI (Gemini)** | `litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py` ~L3307 | `handle_accumulated_json_chunk` | | **Anthropic** | `litellm/llms/anthropic/chat/handler.py` ~L1088 | `_handle_accumulated_json_chunk` | | **SageMaker** | `litellm/llms/sagemaker/common_utils.py` ~L75 (sync) / ~L127 (async) | `accumulated_json` local variable in `iter_bytes` / `aiter_bytes` | All three are copy-pas

litellm2026-04-21 17:01:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26181•Fetched 2026-04-22 07:45:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

6matt

Participants

6matt

Timeline (top)

cross-referenced ×3labeled ×1

Error Message

self.accumulated_json += message # unbounded string growth try: _data = json.loads(self.accumulated_json) # re-parse entire blob every chunk self.accumulated_json = "" return self.chunk_parser(chunk=_data) except json.JSONDecodeError: return None # keep accumulating → next chunk re-parses even more

Fix Action

Fixed

Fixed by PR: perf(streaming): eliminate O(n²) json.loads in accumulated-JSON paths (Vertex, Anthropic, SageMaker) (https://github.com/BerriAI/litellm/pull/26187)

PR fix notes

PR #26187: perf(streaming): eliminate O(n²) json.loads in accumulated-JSON paths (Vertex, Anthropic, SageMaker)

Repository: BerriAI/litellm
Author: Anai-Guo
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/26187

Description (problem / solution / changelog)

Problem

Closes #26181.

When Vertex AI (Gemini), Anthropic, or SageMaker streaming responses arrive as fragmented SSE chunks, handle_accumulated_json_chunk / _handle_accumulated_json_chunk concatenates every chunk into a string (self.accumulated_json += chunk) and calls json.loads() on the entire accumulated string after each new chunk.

This is O(n²) in total work: every chunk re-parses the full buffer from the beginning. For large responses (long code generation, big tool-call payloads — several MB), json.loads — a single CPython C call that holds the GIL — blocks for seconds at a time. In the default single-process uvicorn deployment this freezes the event loop, causing liveness probe failures and kubelet pod restarts.

Fix

Two changes per affected path:

List-based accumulation — replace self.accumulated_json += chunk (O(n) copy per call → O(n²) total) with self.accumulated_json_chunks.append(chunk) (O(1) per call). The join ("".join(chunks)) is deferred to parse time.
Completeness heuristic — skip json.loads entirely unless the last non-whitespace character of the new chunk is } or ]. A well-formed JSON object/array always ends with one of these; incomplete fragments never do. This eliminates the vast majority of unnecessary json.loads calls (and their GIL holds) at the cost of one extra str.rstrip() check.

Together, these changes reduce total CPU work from O(n²) to O(n) for the common case where JSON fragments arrive one at a time.

Files changed

File	Pattern fixed
`litellm/llms/sagemaker/common_utils.py`	`iter_bytes` + `aiter_bytes` local `accumulated_json`
`litellm/llms/anthropic/chat/handler.py`	`self.accumulated_json` instance variable + `_handle_accumulated_json_chunk`
`litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py`	`self.accumulated_json` instance variable + `handle_accumulated_json_chunk`

Behaviour unchanged

Complete JSON chunks (last char } or ]) parse on the first attempt — same latency as before.
Incomplete chunks are buffered — same accumulation semantics as before.
All existing error paths (json.JSONDecodeError, UnicodeDecodeError, StopIteration/StopAsyncIteration tail handlers) are preserved.
OpenAI paths are unaffected (they already do single-chunk parsing).

🤖 Generated with Claude Code

Changed files

litellm/llms/anthropic/chat/handler.py (modified, +16/-14)
litellm/llms/sagemaker/common_utils.py (modified, +24/-27)
litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py (modified, +10/-10)

Code Example

self.accumulated_json += message          # unbounded string growth
try:
    _data = json.loads(self.accumulated_json)  # re-parse entire blob every chunk
    self.accumulated_json = ""
    return self.chunk_parser(chunk=_data)
except json.JSONDecodeError:
    return None  # keep accumulating → next chunk re-parses even more

---

py-spy dump --pid <proxy_pid>

---

# 7 consecutive py-spy thread dumps (70 seconds), all identical:

Thread 1 (active+gil): "MainThread"
  json.loads (json/__init__.py)
  json.decoder.JSONDecoder.decode (json/decoder.py)
  json.decoder.JSONDecoder.raw_decode (json/decoder.py)
  handle_accumulated_json_chunk (vertex_and_google_ai_studio_gemini.py:3071)
  ...

# All 15 other threads: idle (waiting for work in ThreadPoolExecutor/AnyIO)

RAW_BUFFERClick to expand / collapse

What happened?

When a Vertex AI (Gemini), Anthropic, or SageMaker streaming response arrives as fragmented SSE chunks, handle_accumulated_json_chunk concatenates every chunk into self.accumulated_json and calls json.loads() on the entire accumulated string after each chunk. If the JSON is incomplete, JSONDecodeError is caught and the next chunk triggers another full parse of the now-larger string.

This is O(n²) in total work for n bytes of streaming data. For large responses (long code generation, large tool-call payloads), the accumulated string grows to megabytes and json.loads — a single CPython C call that holds the GIL — blocks for seconds at a time.

In a uvicorn single-process async deployment (the default proxy configuration), this freezes the entire asyncio event loop. The liveness probe handler never gets a turn to run, the probe times out, and kubelet restarts the pod.

We confirmed this with 7 consecutive py-spy thread dumps spanning ~70 seconds, all showing the same state:

MainThread (active+gil): stuck in json.loads → json.decoder.decode → json.decoder.raw_decode, called from handle_accumulated_json_chunk
All 15 other threads: idle, waiting for work

Affected code paths

Provider	File	Method/Pattern
Vertex AI (Gemini)	`litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py` ~L3307	`handle_accumulated_json_chunk`
Anthropic	`litellm/llms/anthropic/chat/handler.py` ~L1088	`_handle_accumulated_json_chunk`
SageMaker	`litellm/llms/sagemaker/common_utils.py` ~L75 (sync) / ~L127 (async)	`accumulated_json` local variable in `iter_bytes` / `aiter_bytes`

All three are copy-paste of the same pattern:

self.accumulated_json += message          # unbounded string growth
try:
    _data = json.loads(self.accumulated_json)  # re-parse entire blob every chunk
    self.accumulated_json = ""
    return self.chunk_parser(chunk=_data)
except json.JSONDecodeError:
    return None  # keep accumulating → next chunk re-parses even more

OpenAI is not affected — its iterator does a single json.loads per chunk with no accumulation.

Why it's dangerous

Unbounded accumulation — no cap on accumulated_json size
O(n²) total CPU — every chunk re-parses the entire buffer, not just the new bytes
GIL-blocking — json.loads is a single C call; it cannot yield to the event loop
Event loop starvation — in the default single-process uvicorn deployment, a multi-second json.loads call prevents all HTTP handlers (including health/liveness probes) from running

Steps to Reproduce

Deploy LiteLLM proxy with a Vertex AI Gemini model (or Anthropic/SageMaker)
Send a streaming request that produces a large response (e.g., long code generation, or a response with large tool-call JSON payloads — several MB total)
Ensure the SSE transport fragments the JSON across multiple chunks (this happens naturally at network boundaries; more likely with larger responses)
Observe: the proxy becomes unresponsive for 10+ seconds during the streaming response; liveness probes fail; under Kubernetes, the pod gets restarted

To confirm via profiling:

py-spy dump --pid <proxy_pid>

You'll see MainThread stuck in json.loads called from handle_accumulated_json_chunk.

Relevant log output

# 7 consecutive py-spy thread dumps (70 seconds), all identical:

Thread 1 (active+gil): "MainThread"
  json.loads (json/__init__.py)
  json.decoder.JSONDecoder.decode (json/decoder.py)
  json.decoder.JSONDecoder.raw_decode (json/decoder.py)
  handle_accumulated_json_chunk (vertex_and_google_ai_studio_gemini.py:3071)
  ...

# All 15 other threads: idle (waiting for work in ThreadPoolExecutor/AnyIO)

Related issues

#13505 — Streaming CPU is 4-5x non-streaming (this bug is a significant contributor)
#20268 — Sync next() blocking event loop during streaming (same symptom class)
#24788 — Sync call blocking event loop → pod restarts (same symptom class)
#16562 (closed) — Fix that expanded handle_accumulated_json_chunk usage, making this O(n²) path trigger more frequently

Possible fix directions

Collect chunks in a list, join only once — append to list[str], only "".join() + json.loads() when a heuristic suggests completeness (e.g., balanced braces, or trailing } / ])
Incremental JSON parser — use a library like ijson to parse as bytes arrive
Offload to a thread — run the parse in asyncio.to_thread() so it doesn't block the event loop (addresses the liveness probe issue but not the O(n²) CPU waste)
Cap accumulated buffer size — fail loudly rather than silently accumulating unbounded data

Options 1+3 combined would address both the CPU waste and the event loop starvation.

Component: SDK (litellm Python package) + Proxy (the proxy deployment is where liveness probe failures manifest)

LiteLLM version: v1.83.7 (also confirmed on litellm_internal_staging at v1.83.9-nightly)

extent analysis

TL;DR

Collect chunks in a list and join only once when a heuristic suggests completeness to avoid O(n²) total CPU waste and GIL-blocking.

Guidance

Identify a suitable heuristic to determine JSON completeness, such as balanced braces or trailing } / ], to decide when to join and parse the accumulated chunks.
Consider using a library like ijson for incremental JSON parsing to further optimize the parsing process.
Offloading the parsing to a thread using asyncio.to_thread() can help prevent event loop starvation but may not address the underlying CPU waste issue.
Implementing a cap on the accumulated buffer size can prevent unbounded growth but may require additional error handling.

Example

import json

class JSONParser:
    def __init__(self):
        self.chunks = []
        self.complete = False

    def add_chunk(self, chunk):
        self.chunks.append(chunk)
        if self.is_complete():
            self.parse_json()

    def is_complete(self):
        # Implement heuristic to check for JSON completeness
        # For example, check for balanced braces or trailing `}` / `]`
        pass

    def parse_json(self):
        try:
            data = json.loads(''.join(self.chunks))
            # Process the parsed JSON data
        except json.JSONDecodeError:
            # Handle parsing error
            pass

# Usage
parser = JSONParser()
parser.add_chunk('{"key": "value"')
parser.add_chunk('"}')  # Assuming this completes the JSON

Notes

The provided example is a simplified illustration and may require modifications to fit the specific use case. The choice of heuristic for determining JSON completeness will depend on the specific requirements and constraints of the application.

Recommendation

Apply a workaround by collecting chunks in a list and joining only once when a heuristic suggests completeness, as this addresses both the CPU waste and event loop starvation issues. This approach can be combined with offloading the parsing to a thread to further improve performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.