litellm - 💡(How to fix) Fix [Bug]: PrismaWrapper.__getattr__ deadlocks event loop for 30s on RDS IAM token expiry, causing liveness probe failures [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26192Fetched 2026-04-22 07:45:48
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Error Message

def getattr(self, name: str): original_attr = getattr(self._original_prisma, name)

if self.iam_token_db_auth:
    db_url = os.getenv("DATABASE_URL")
    if self.is_token_expired(db_url):
        verbose_proxy_logger.warning(
            "RDS IAM token expired in __getattr__ - scheduling background refresh..."
        )
        # Update DATABASE_URL immediately so concurrent __getattr__
        # calls see a non-expired URL and don't pile on.
        self.get_rds_iam_token()
        try:
            asyncio.get_running_loop()
            # On the event loop — MUST NOT block.
            asyncio.ensure_future(self._safe_refresh_token())
        except RuntimeError:
            # No running event loop — safe to block.
            new_db_url = os.getenv("DATABASE_URL", "")
            if new_db_url:
                asyncio.run(self.recreate_prisma_client(new_db_url))
                original_attr = getattr(self._original_prisma, name)

return original_attr

Root Cause

This was introduced as a fix for #16220 (RDS IAM auth connection failures after 15 minutes). The previous code used fire-and-forget (asyncio.ensure_future), which meant the returned attribute referenced the old (disconnected) Prisma client. The fix changed it to future.result(timeout=30) to wait for the reconnection — but this creates a deadlock when called from the event loop thread (which is always the case in the proxy).

Code Example

# On the event loop thread:
future = asyncio.run_coroutine_threadsafe(
    self.recreate_prisma_client(new_db_url), loop  # scheduled on THIS loop
)
future.result(timeout=30)  # blocks THIS loop's thread

---

t=0s     __getattr__ deadlock starts (coroutine A accesses prisma_client.db.*)
t=30s    TimeoutError, event loop unblocks momentarily
t=30.00Xs __getattr__ deadlock starts (coroutine B accesses prisma_client.db.*)
t=60s    TimeoutError, event loop unblocks momentarily
...repeats indefinitely

---

# This WARNING line appears just before the 30-second freeze:
RDS IAM token expired in __getattr__ - proactive refresh may have failed. Triggering synchronous fallback refresh...

# After 30 seconds, the deadlock times out:
Failed to refresh token synchronously: <TimeoutError or CancelledError>

# These repeat back-to-back until the pod is killed:
# (no other log output during the 30s freezes — the event loop is blocked)

# Kubernetes:
Liveness probe failed: context deadline exceeded

---

def __getattr__(self, name: str):
    original_attr = getattr(self._original_prisma, name)

    if self.iam_token_db_auth:
        db_url = os.getenv("DATABASE_URL")
        if self.is_token_expired(db_url):
            verbose_proxy_logger.warning(
                "RDS IAM token expired in __getattr__ - scheduling background refresh..."
            )
            # Update DATABASE_URL immediately so concurrent __getattr__
            # calls see a non-expired URL and don't pile on.
            self.get_rds_iam_token()
            try:
                asyncio.get_running_loop()
                # On the event loop — MUST NOT block.
                asyncio.ensure_future(self._safe_refresh_token())
            except RuntimeError:
                # No running event loop — safe to block.
                new_db_url = os.getenv("DATABASE_URL", "")
                if new_db_url:
                    asyncio.run(self.recreate_prisma_client(new_db_url))
                    original_attr = getattr(self._original_prisma, name)

    return original_attr
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When RDS IAM token auth is enabled (--iam_token_db_auth), PrismaWrapper.__getattr__ contains a synchronous future.result(timeout=30) call that deadlocks the asyncio event loop for exactly 30 seconds every time the IAM token expires and the background refresh task has failed.

The mechanism:

PrismaWrapper.__getattr__ is invoked synchronously on the event loop thread whenever any Prisma model attribute is accessed (e.g. prisma_client.db.litellm_teamtable). When the IAM token is expired, the current code at prisma_client.py:328-333 does:

# On the event loop thread:
future = asyncio.run_coroutine_threadsafe(
    self.recreate_prisma_client(new_db_url), loop  # scheduled on THIS loop
)
future.result(timeout=30)  # blocks THIS loop's thread

run_coroutine_threadsafe schedules recreate_prisma_client onto the same event loop that is now blocked by future.result(). The coroutine can never execute because the loop is frozen waiting for it. This is a deadlock. It always times out after exactly 30 seconds.

The cascade:

After the 30-second timeout, future.result() raises TimeoutError. The token is still expired (the refresh never ran). The very next coroutine that accesses prisma_client.db.<anything> triggers __getattr__ again → another 30-second deadlock. This repeats:

t=0s     __getattr__ deadlock starts (coroutine A accesses prisma_client.db.*)
t=30s    TimeoutError, event loop unblocks momentarily
t=30.00Xs __getattr__ deadlock starts (coroutine B accesses prisma_client.db.*)
t=60s    TimeoutError, event loop unblocks momentarily
...repeats indefinitely

During each 30-second freeze, no coroutines can run — including /health/liveliness. With a typical K8s liveness probe (5s timeout, 10s period, 5 failure threshold), the pod is restarted after 50 seconds.

Why __getattr__ runs on the event loop thread:

Every Prisma DB query in the proxy accesses model attributes through PrismaWrapper.__getattr__:

  • prisma_client.db.litellm_teamtable.find_unique(...)__getattr__("litellm_teamtable")
  • prisma_client.db.connect()__getattr__("connect")
  • prisma_client.db.query_raw(...)__getattr__("query_raw")

These are called from user_api_key_auth (every request), Prometheus budget metrics (LoggingWorker), DB health watchdog, spend tracking, etc. — all async functions on the event loop.

Root Cause

This was introduced as a fix for #16220 (RDS IAM auth connection failures after 15 minutes). The previous code used fire-and-forget (asyncio.ensure_future), which meant the returned attribute referenced the old (disconnected) Prisma client. The fix changed it to future.result(timeout=30) to wait for the reconnection — but this creates a deadlock when called from the event loop thread (which is always the case in the proxy).

Steps to Reproduce

  1. Run a LiteLLM proxy with --iam_token_db_auth connected to AWS RDS
  2. Either wait 15+ minutes for the IAM token to expire, or simulate by injecting an expired token into DATABASE_URL
  3. Ensure the background token refresh task (_token_refresh_loop) has failed (e.g., AWS STS is temporarily unreachable)
  4. Any DB operation triggers __getattr__ → 30-second event loop freeze
  5. /health/liveliness becomes unresponsive → K8s kills the pod

Relevant log output

# This WARNING line appears just before the 30-second freeze:
RDS IAM token expired in __getattr__ - proactive refresh may have failed. Triggering synchronous fallback refresh...

# After 30 seconds, the deadlock times out:
Failed to refresh token synchronously: <TimeoutError or CancelledError>

# These repeat back-to-back until the pod is killed:
# (no other log output during the 30s freezes — the event loop is blocked)

# Kubernetes:
Liveness probe failed: context deadline exceeded

Suggested Fix

Replace the blocking future.result() in __getattr__ with a non-blocking approach:

  1. Call get_rds_iam_token() synchronously to update DATABASE_URL (prevents thundering herd — subsequent __getattr__ calls see a non-expired URL)
  2. Schedule _safe_refresh_token() as a background task via asyncio.ensure_future() (uses _reconnection_lock to serialize concurrent attempts)
  3. Return the stale attribute — the caller's Prisma query will fail with a connection error, which all callers already handle via try/except
  4. The next call after the background refresh completes succeeds
def __getattr__(self, name: str):
    original_attr = getattr(self._original_prisma, name)

    if self.iam_token_db_auth:
        db_url = os.getenv("DATABASE_URL")
        if self.is_token_expired(db_url):
            verbose_proxy_logger.warning(
                "RDS IAM token expired in __getattr__ - scheduling background refresh..."
            )
            # Update DATABASE_URL immediately so concurrent __getattr__
            # calls see a non-expired URL and don't pile on.
            self.get_rds_iam_token()
            try:
                asyncio.get_running_loop()
                # On the event loop — MUST NOT block.
                asyncio.ensure_future(self._safe_refresh_token())
            except RuntimeError:
                # No running event loop — safe to block.
                new_db_url = os.getenv("DATABASE_URL", "")
                if new_db_url:
                    asyncio.run(self.recreate_prisma_client(new_db_url))
                    original_attr = getattr(self._original_prisma, name)

    return original_attr

The trade-off: one failed DB call while the background refresh runs. This is strictly better than a 30-second event loop deadlock that cascades into pod restarts.

Related Issues

  • #16220 — the original RDS IAM auth bug that introduced the future.result() code
  • #26191 — Prisma.disconnect() blocks event loop via synchronous process.wait() (same symptom class, different root cause)
  • #24788 — sync convert_url_to_base64() blocks event loop
  • #26181 — O(n²) json.loads retry blocks event loop

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on?

v1.83.7

Twitter / LinkedIn details

No response

extent analysis

TL;DR

Replace the blocking future.result() in __getattr__ with a non-blocking approach to prevent the 30-second event loop deadlock.

Guidance

  • Identify the __getattr__ method in PrismaWrapper and replace the synchronous future.result(timeout=30) call with an asynchronous approach using asyncio.ensure_future() to schedule the _safe_refresh_token() background task.
  • Update the DATABASE_URL immediately in __getattr__ to prevent concurrent calls from piling on.
  • Return the stale attribute and let the caller's Prisma query fail with a connection error, which can be handled via try/except.
  • Verify that the event loop is not blocked by checking the log output for the "RDS IAM token expired in getattr" warning and ensuring that subsequent DB operations do not trigger a 30-second freeze.

Example

The suggested fix provides an example of how to modify the __getattr__ method to use a non-blocking approach:

def __getattr__(self, name: str):
    # ...
    if self.iam_token_db_auth:
        # ...
        asyncio.ensure_future(self._safe_refresh_token())
    # ...

Notes

This fix introduces a trade-off where one failed DB call may occur while the background refresh runs, but this is considered strictly better than the 30-second event loop deadlock that cascades into pod restarts.

Recommendation

Apply the suggested fix to replace the blocking future.result() call with a non-blocking approach, as this will prevent the 30-second event loop deadlock and improve the overall stability of the system.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: PrismaWrapper.__getattr__ deadlocks event loop for 30s on RDS IAM token expiry, causing liveness probe failures [1 participants]