litellm - 💡(How to fix) Fix [Bug]: Prisma disconnect() blocks asyncio event loop via synchronous process.wait(), causing liveness probe failures [1 participants]

litellm2026-04-21 20:26:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#26191•Fetched 2026-04-22 07:45:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

6matt

Participants

6matt

Timeline (top)

cross-referenced ×1labeled ×1

Error Message

Logs preceding crash show DB connectivity loss:

Prisma DB reconnect failed (2 consecutive). reason=engine_process_death error=Could not connect to the query engine httpx.ConnectError: All connection attempts failed Giving up get_data(...) after 3 tries (httpx.ConnectError: All connection attempts failed)

Liveness probe fails with timeout (server completely unresponsive, not returning errors):

Liveness probe failed: context deadline exceeded

Pod killed by kubelet:

Last State: Terminated, Reason: Error, Exit Code: 137

Root Cause

The asyncio.wait_for() wrapper around the disconnect call at utils.py:4134 cannot help because process.wait() is synchronous — it never yields back to the event loop, so the timeout never fires.

Code Example

# Logs preceding crash show DB connectivity loss:
Prisma DB reconnect failed (2 consecutive). reason=engine_process_death error=Could not connect to the query engine
httpx.ConnectError: All connection attempts failed
Giving up get_data(...) after 3 tries (httpx.ConnectError: All connection attempts failed)

# Liveness probe fails with timeout (server completely unresponsive, not returning errors):
Liveness probe failed: context deadline exceeded

# Pod killed by kubelet:
Last State: Terminated, Reason: Error, Exit Code: 137

---

loop = asyncio.get_running_loop()
await asyncio.wait_for(
    loop.run_in_executor(None, self.db._original_prisma._engine.close),
    timeout=10,
)

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When the database becomes unreachable, the proxy's DB reconnect logic calls await self.db.disconnect() which ultimately invokes prisma-client-py's synchronous process.wait() (a blocking waitpid syscall) on the query engine subprocess. This freezes the entire asyncio event loop for 30-120+ seconds while the Rust query engine waits for TCP close operations to time out.

During this time, no coroutines can run, including the /health/liveliness endpoint. In Kubernetes, this causes liveness probe failures and pod restarts (exit code 137 / SIGKILL).

Two code paths trigger this:

_do_direct_reconnect (litellm/proxy/utils.py:4120-4134) — called when the engine process is alive but the DB is unreachable. Calls await self.db.disconnect() which hits the blocking process.wait().
recreate_prisma_client (litellm/proxy/db/prisma_client.py:217-236) — called from the heavy-reconnect path. Also calls await self._original_prisma.disconnect() as a first step before creating a new client.

Expected behavior: Reconnection should not block the event loop. Since the old Prisma client is being discarded anyway, the reconnect paths should kill the engine subprocess directly (SIGTERM → short grace → SIGKILL) and create a fresh client, avoiding the blocking close() path entirely.

Steps to Reproduce

Run a LiteLLM proxy connected to a PostgreSQL database
Make the database unreachable (e.g., network partition, firewall rule, or kill the DB process)
Wait for DB operations to start failing — the reconnect logic triggers _do_direct_reconnect
Observe that the /health/liveliness endpoint becomes unresponsive for 30-120+ seconds
In Kubernetes, the liveness probe fails and the kubelet kills the pod (exit code 137)

The issue is in prisma-client-py's generated QueryEngine.close() → self.process.wait(timeout=None), which is a synchronous blocking call. Even disconnect(timeout=timedelta(seconds=10)) would block the event loop for up to 10 seconds synchronously — long enough to fail a typical liveness probe.

Relevant log output

# Logs preceding crash show DB connectivity loss:
Prisma DB reconnect failed (2 consecutive). reason=engine_process_death error=Could not connect to the query engine
httpx.ConnectError: All connection attempts failed
Giving up get_data(...) after 3 tries (httpx.ConnectError: All connection attempts failed)

# Liveness probe fails with timeout (server completely unresponsive, not returning errors):
Liveness probe failed: context deadline exceeded

# Pod killed by kubelet:
Last State: Terminated, Reason: Error, Exit Code: 137

Suggested Fix

In both _do_direct_reconnect and recreate_prisma_client, replace await self.db.disconnect() with the existing _kill_engine_process helper (SIGTERM → 0.5s grace → SIGKILL) followed by creating a fresh Prisma client. This avoids the blocking close() path entirely.

For the PrismaClient.disconnect() wrapper used during graceful shutdown, offload the blocking call to a thread executor:

loop = asyncio.get_running_loop()
await asyncio.wait_for(
    loop.run_in_executor(None, self.db._original_prisma._engine.close),
    timeout=10,
)

Related issues

This is the same symptom class (synchronous calls blocking the asyncio event loop, causing liveness failures) as:

#24788 — sync convert_url_to_base64() blocks event loop
#26181 — O(n²) json.loads retry blocks event loop
#20268 — sync next() on boto3 iterators blocks event loop

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on?

v1.83.7 (prisma-client-py 0.11.0)

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Replace await self.db.disconnect() with _kill_engine_process followed by creating a fresh Prisma client to avoid the blocking close() path.

Guidance

Identify the two code paths that trigger the issue: _do_direct_reconnect and recreate_prisma_client, and replace the await self.db.disconnect() call with the suggested fix.
Use the existing _kill_engine_process helper to kill the engine subprocess directly, followed by creating a fresh Prisma client.
For the PrismaClient.disconnect() wrapper, offload the blocking call to a thread executor using loop.run_in_executor to avoid blocking the event loop.
Verify the fix by testing the reconnect logic and checking the liveness probe response.

Example

# Replace await self.db.disconnect() with:
await _kill_engine_process()
self.db = create_fresh_prisma_client()

# Offload blocking call to thread executor:
loop = asyncio.get_running_loop()
await asyncio.wait_for(
    loop.run_in_executor(None, self.db._original_prisma._engine.close),
    timeout=10,
)

Notes

This fix assumes that the _kill_engine_process helper is correctly implemented and that creating a fresh Prisma client does not introduce any new issues. Additionally, the loop.run_in_executor approach may have performance implications and should be monitored.

Recommendation

Apply the suggested fix to replace await self.db.disconnect() with _kill_engine_process followed by creating a fresh Prisma client, as it directly addresses the root cause of the issue and avoids the blocking close() path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#output truncation #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Prisma disconnect() blocks asyncio event loop via synchronous process.wait(), causing liveness probe failures [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Logs preceding crash show DB connectivity loss:

Liveness probe fails with timeout (server completely unresponsive, not returning errors):

Pod killed by kubelet:

Root Cause

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

Suggested Fix

Related issues

What part of LiteLLM is this about?

What LiteLLM version are you on?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Prisma disconnect() blocks asyncio event loop via synchronous process.wait(), causing liveness probe failures [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Logs preceding crash show DB connectivity loss:

Liveness probe fails with timeout (server completely unresponsive, not returning errors):

Pod killed by kubelet:

Root Cause

Code Example

Check for existing issues

What happened?

Steps to Reproduce

Relevant log output

Suggested Fix

Related issues

What part of LiteLLM is this about?

What LiteLLM version are you on?

Twitter / LinkedIn details

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING