litellm - ✅(Solved) Fix [Bug]: O(n²) json.loads retry in handle_accumulated_json_chunk blocks event loop, kills liveness probes [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26181Fetched 2026-04-22 07:45:57
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3labeled ×1

Error Message

self.accumulated_json += message # unbounded string growth try: _data = json.loads(self.accumulated_json) # re-parse entire blob every chunk self.accumulated_json = "" return self.chunk_parser(chunk=_data) except json.JSONDecodeError: return None # keep accumulating → next chunk re-parses even more

Fix Action

Fixed

PR fix notes

PR #26187: perf(streaming): eliminate O(n²) json.loads in accumulated-JSON paths (Vertex, Anthropic, SageMaker)

Description (problem / solution / changelog)

Problem

Closes #26181.

When Vertex AI (Gemini), Anthropic, or SageMaker streaming responses arrive as fragmented SSE chunks, handle_accumulated_json_chunk / _handle_accumulated_json_chunk concatenates every chunk into a string (self.accumulated_json += chunk) and calls json.loads() on the entire accumulated string after each new chunk.

This is O(n²) in total work: every chunk re-parses the full buffer from the beginning. For large responses (long code generation, big tool-call payloads — several MB), json.loads — a single CPython C call that holds the GIL — blocks for seconds at a time. In the default single-process uvicorn deployment this freezes the event loop, causing liveness probe failures and kubelet pod restarts.

Fix

Two changes per affected path:

  1. List-based accumulation — replace self.accumulated_json += chunk (O(n) copy per call → O(n²) total) with self.accumulated_json_chunks.append(chunk) (O(1) per call). The join ("".join(chunks)) is deferred to parse time.

  2. Completeness heuristic — skip json.loads entirely unless the last non-whitespace character of the new chunk is } or ]. A well-formed JSON object/array always ends with one of these; incomplete fragments never do. This eliminates the vast majority of unnecessary json.loads calls (and their GIL holds) at the cost of one extra str.rstrip() check.

Together, these changes reduce total CPU work from O(n²) to O(n) for the common case where JSON fragments arrive one at a time.

Files changed

FilePattern fixed
litellm/llms/sagemaker/common_utils.pyiter_bytes + aiter_bytes local accumulated_json
litellm/llms/anthropic/chat/handler.pyself.accumulated_json instance variable + _handle_accumulated_json_chunk
litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.pyself.accumulated_json instance variable + handle_accumulated_json_chunk

Behaviour unchanged

  • Complete JSON chunks (last char } or ]) parse on the first attempt — same latency as before.
  • Incomplete chunks are buffered — same accumulation semantics as before.
  • All existing error paths (json.JSONDecodeError, UnicodeDecodeError, StopIteration/StopAsyncIteration tail handlers) are preserved.
  • OpenAI paths are unaffected (they already do single-chunk parsing).

🤖 Generated with Claude Code

Changed files

  • litellm/llms/anthropic/chat/handler.py (modified, +16/-14)
  • litellm/llms/sagemaker/common_utils.py (modified, +24/-27)
  • litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py (modified, +10/-10)

Code Example

self.accumulated_json += message          # unbounded string growth
try:
    _data = json.loads(self.accumulated_json)  # re-parse entire blob every chunk
    self.accumulated_json = ""
    return self.chunk_parser(chunk=_data)
except json.JSONDecodeError:
    return None  # keep accumulating → next chunk re-parses even more

---

py-spy dump --pid <proxy_pid>

---

# 7 consecutive py-spy thread dumps (70 seconds), all identical:

Thread 1 (active+gil): "MainThread"
  json.loads (json/__init__.py)
  json.decoder.JSONDecoder.decode (json/decoder.py)
  json.decoder.JSONDecoder.raw_decode (json/decoder.py)
  handle_accumulated_json_chunk (vertex_and_google_ai_studio_gemini.py:3071)
  ...

# All 15 other threads: idle (waiting for work in ThreadPoolExecutor/AnyIO)
RAW_BUFFERClick to expand / collapse

What happened?

When a Vertex AI (Gemini), Anthropic, or SageMaker streaming response arrives as fragmented SSE chunks, handle_accumulated_json_chunk concatenates every chunk into self.accumulated_json and calls json.loads() on the entire accumulated string after each chunk. If the JSON is incomplete, JSONDecodeError is caught and the next chunk triggers another full parse of the now-larger string.

This is O(n²) in total work for n bytes of streaming data. For large responses (long code generation, large tool-call payloads), the accumulated string grows to megabytes and json.loads — a single CPython C call that holds the GIL — blocks for seconds at a time.

In a uvicorn single-process async deployment (the default proxy configuration), this freezes the entire asyncio event loop. The liveness probe handler never gets a turn to run, the probe times out, and kubelet restarts the pod.

We confirmed this with 7 consecutive py-spy thread dumps spanning ~70 seconds, all showing the same state:

  • MainThread (active+gil): stuck in json.loads → json.decoder.decode → json.decoder.raw_decode, called from handle_accumulated_json_chunk
  • All 15 other threads: idle, waiting for work

Affected code paths

ProviderFileMethod/Pattern
Vertex AI (Gemini)litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py ~L3307handle_accumulated_json_chunk
Anthropiclitellm/llms/anthropic/chat/handler.py ~L1088_handle_accumulated_json_chunk
SageMakerlitellm/llms/sagemaker/common_utils.py ~L75 (sync) / ~L127 (async)accumulated_json local variable in iter_bytes / aiter_bytes

All three are copy-paste of the same pattern:

self.accumulated_json += message          # unbounded string growth
try:
    _data = json.loads(self.accumulated_json)  # re-parse entire blob every chunk
    self.accumulated_json = ""
    return self.chunk_parser(chunk=_data)
except json.JSONDecodeError:
    return None  # keep accumulating → next chunk re-parses even more

OpenAI is not affected — its iterator does a single json.loads per chunk with no accumulation.

Why it's dangerous

  1. Unbounded accumulation — no cap on accumulated_json size
  2. O(n²) total CPU — every chunk re-parses the entire buffer, not just the new bytes
  3. GIL-blockingjson.loads is a single C call; it cannot yield to the event loop
  4. Event loop starvation — in the default single-process uvicorn deployment, a multi-second json.loads call prevents all HTTP handlers (including health/liveness probes) from running

Steps to Reproduce

  1. Deploy LiteLLM proxy with a Vertex AI Gemini model (or Anthropic/SageMaker)
  2. Send a streaming request that produces a large response (e.g., long code generation, or a response with large tool-call JSON payloads — several MB total)
  3. Ensure the SSE transport fragments the JSON across multiple chunks (this happens naturally at network boundaries; more likely with larger responses)
  4. Observe: the proxy becomes unresponsive for 10+ seconds during the streaming response; liveness probes fail; under Kubernetes, the pod gets restarted

To confirm via profiling:

py-spy dump --pid <proxy_pid>

You'll see MainThread stuck in json.loads called from handle_accumulated_json_chunk.

Relevant log output

# 7 consecutive py-spy thread dumps (70 seconds), all identical:

Thread 1 (active+gil): "MainThread"
  json.loads (json/__init__.py)
  json.decoder.JSONDecoder.decode (json/decoder.py)
  json.decoder.JSONDecoder.raw_decode (json/decoder.py)
  handle_accumulated_json_chunk (vertex_and_google_ai_studio_gemini.py:3071)
  ...

# All 15 other threads: idle (waiting for work in ThreadPoolExecutor/AnyIO)

Related issues

  • #13505 — Streaming CPU is 4-5x non-streaming (this bug is a significant contributor)
  • #20268 — Sync next() blocking event loop during streaming (same symptom class)
  • #24788 — Sync call blocking event loop → pod restarts (same symptom class)
  • #16562 (closed) — Fix that expanded handle_accumulated_json_chunk usage, making this O(n²) path trigger more frequently

Possible fix directions

  1. Collect chunks in a list, join only once — append to list[str], only "".join() + json.loads() when a heuristic suggests completeness (e.g., balanced braces, or trailing } / ])
  2. Incremental JSON parser — use a library like ijson to parse as bytes arrive
  3. Offload to a thread — run the parse in asyncio.to_thread() so it doesn't block the event loop (addresses the liveness probe issue but not the O(n²) CPU waste)
  4. Cap accumulated buffer size — fail loudly rather than silently accumulating unbounded data

Options 1+3 combined would address both the CPU waste and the event loop starvation.


Component: SDK (litellm Python package) + Proxy (the proxy deployment is where liveness probe failures manifest)

LiteLLM version: v1.83.7 (also confirmed on litellm_internal_staging at v1.83.9-nightly)

extent analysis

TL;DR

Collect chunks in a list and join only once when a heuristic suggests completeness to avoid O(n²) total CPU waste and GIL-blocking.

Guidance

  • Identify a suitable heuristic to determine JSON completeness, such as balanced braces or trailing } / ], to decide when to join and parse the accumulated chunks.
  • Consider using a library like ijson for incremental JSON parsing to further optimize the parsing process.
  • Offloading the parsing to a thread using asyncio.to_thread() can help prevent event loop starvation but may not address the underlying CPU waste issue.
  • Implementing a cap on the accumulated buffer size can prevent unbounded growth but may require additional error handling.

Example

import json

class JSONParser:
    def __init__(self):
        self.chunks = []
        self.complete = False

    def add_chunk(self, chunk):
        self.chunks.append(chunk)
        if self.is_complete():
            self.parse_json()

    def is_complete(self):
        # Implement heuristic to check for JSON completeness
        # For example, check for balanced braces or trailing `}` / `]`
        pass

    def parse_json(self):
        try:
            data = json.loads(''.join(self.chunks))
            # Process the parsed JSON data
        except json.JSONDecodeError:
            # Handle parsing error
            pass

# Usage
parser = JSONParser()
parser.add_chunk('{"key": "value"')
parser.add_chunk('"}')  # Assuming this completes the JSON

Notes

The provided example is a simplified illustration and may require modifications to fit the specific use case. The choice of heuristic for determining JSON completeness will depend on the specific requirements and constraints of the application.

Recommendation

Apply a workaround by collecting chunks in a list and joining only once when a heuristic suggests completeness, as this addresses both the CPU waste and event loop starvation issues. This approach can be combined with offloading the parsing to a thread to further improve performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - ✅(Solved) Fix [Bug]: O(n²) json.loads retry in handle_accumulated_json_chunk blocks event loop, kills liveness probes [1 pull requests, 1 participants]