litellm - 💡(How to fix) Fix [Bug]: Prisma disconnect() blocks asyncio event loop via synchronous process.wait(), causing liveness probe failures [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26191Fetched 2026-04-22 07:45:50
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Error Message

Logs preceding crash show DB connectivity loss:

Prisma DB reconnect failed (2 consecutive). reason=engine_process_death error=Could not connect to the query engine httpx.ConnectError: All connection attempts failed Giving up get_data(...) after 3 tries (httpx.ConnectError: All connection attempts failed)

Liveness probe fails with timeout (server completely unresponsive, not returning errors):

Liveness probe failed: context deadline exceeded

Pod killed by kubelet:

Last State: Terminated, Reason: Error, Exit Code: 137

Root Cause

The asyncio.wait_for() wrapper around the disconnect call at utils.py:4134 cannot help because process.wait() is synchronous — it never yields back to the event loop, so the timeout never fires.

Code Example

# Logs preceding crash show DB connectivity loss:
Prisma DB reconnect failed (2 consecutive). reason=engine_process_death error=Could not connect to the query engine
httpx.ConnectError: All connection attempts failed
Giving up get_data(...) after 3 tries (httpx.ConnectError: All connection attempts failed)

# Liveness probe fails with timeout (server completely unresponsive, not returning errors):
Liveness probe failed: context deadline exceeded

# Pod killed by kubelet:
Last State: Terminated, Reason: Error, Exit Code: 137

---

loop = asyncio.get_running_loop()
await asyncio.wait_for(
    loop.run_in_executor(None, self.db._original_prisma._engine.close),
    timeout=10,
)
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When the database becomes unreachable, the proxy's DB reconnect logic calls await self.db.disconnect() which ultimately invokes prisma-client-py's synchronous process.wait() (a blocking waitpid syscall) on the query engine subprocess. This freezes the entire asyncio event loop for 30-120+ seconds while the Rust query engine waits for TCP close operations to time out.

During this time, no coroutines can run, including the /health/liveliness endpoint. In Kubernetes, this causes liveness probe failures and pod restarts (exit code 137 / SIGKILL).

The asyncio.wait_for() wrapper around the disconnect call at utils.py:4134 cannot help because process.wait() is synchronous — it never yields back to the event loop, so the timeout never fires.

Two code paths trigger this:

  1. _do_direct_reconnect (litellm/proxy/utils.py:4120-4134) — called when the engine process is alive but the DB is unreachable. Calls await self.db.disconnect() which hits the blocking process.wait().

  2. recreate_prisma_client (litellm/proxy/db/prisma_client.py:217-236) — called from the heavy-reconnect path. Also calls await self._original_prisma.disconnect() as a first step before creating a new client.

Expected behavior: Reconnection should not block the event loop. Since the old Prisma client is being discarded anyway, the reconnect paths should kill the engine subprocess directly (SIGTERM → short grace → SIGKILL) and create a fresh client, avoiding the blocking close() path entirely.

Steps to Reproduce

  1. Run a LiteLLM proxy connected to a PostgreSQL database
  2. Make the database unreachable (e.g., network partition, firewall rule, or kill the DB process)
  3. Wait for DB operations to start failing — the reconnect logic triggers _do_direct_reconnect
  4. Observe that the /health/liveliness endpoint becomes unresponsive for 30-120+ seconds
  5. In Kubernetes, the liveness probe fails and the kubelet kills the pod (exit code 137)

The issue is in prisma-client-py's generated QueryEngine.close()self.process.wait(timeout=None), which is a synchronous blocking call. Even disconnect(timeout=timedelta(seconds=10)) would block the event loop for up to 10 seconds synchronously — long enough to fail a typical liveness probe.

Relevant log output

# Logs preceding crash show DB connectivity loss:
Prisma DB reconnect failed (2 consecutive). reason=engine_process_death error=Could not connect to the query engine
httpx.ConnectError: All connection attempts failed
Giving up get_data(...) after 3 tries (httpx.ConnectError: All connection attempts failed)

# Liveness probe fails with timeout (server completely unresponsive, not returning errors):
Liveness probe failed: context deadline exceeded

# Pod killed by kubelet:
Last State: Terminated, Reason: Error, Exit Code: 137

Suggested Fix

In both _do_direct_reconnect and recreate_prisma_client, replace await self.db.disconnect() with the existing _kill_engine_process helper (SIGTERM → 0.5s grace → SIGKILL) followed by creating a fresh Prisma client. This avoids the blocking close() path entirely.

For the PrismaClient.disconnect() wrapper used during graceful shutdown, offload the blocking call to a thread executor:

loop = asyncio.get_running_loop()
await asyncio.wait_for(
    loop.run_in_executor(None, self.db._original_prisma._engine.close),
    timeout=10,
)

Related issues

This is the same symptom class (synchronous calls blocking the asyncio event loop, causing liveness failures) as:

  • #24788 — sync convert_url_to_base64() blocks event loop
  • #26181 — O(n²) json.loads retry blocks event loop
  • #20268 — sync next() on boto3 iterators blocks event loop

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on?

v1.83.7 (prisma-client-py 0.11.0)

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Replace await self.db.disconnect() with _kill_engine_process followed by creating a fresh Prisma client to avoid the blocking close() path.

Guidance

  • Identify the two code paths that trigger the issue: _do_direct_reconnect and recreate_prisma_client, and replace the await self.db.disconnect() call with the suggested fix.
  • Use the existing _kill_engine_process helper to kill the engine subprocess directly, followed by creating a fresh Prisma client.
  • For the PrismaClient.disconnect() wrapper, offload the blocking call to a thread executor using loop.run_in_executor to avoid blocking the event loop.
  • Verify the fix by testing the reconnect logic and checking the liveness probe response.

Example

# Replace await self.db.disconnect() with:
await _kill_engine_process()
self.db = create_fresh_prisma_client()

# Offload blocking call to thread executor:
loop = asyncio.get_running_loop()
await asyncio.wait_for(
    loop.run_in_executor(None, self.db._original_prisma._engine.close),
    timeout=10,
)

Notes

This fix assumes that the _kill_engine_process helper is correctly implemented and that creating a fresh Prisma client does not introduce any new issues. Additionally, the loop.run_in_executor approach may have performance implications and should be monitored.

Recommendation

Apply the suggested fix to replace await self.db.disconnect() with _kill_engine_process followed by creating a fresh Prisma client, as it directly addresses the root cause of the issue and avoids the blocking close() path.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Prisma disconnect() blocks asyncio event loop via synchronous process.wait(), causing liveness probe failures [1 participants]