litellm - ✅(Solved) Fix [Bug]: shared_aiohttp_session has no auto-recovery — once closed, stays closed for pod lifetime [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#23806Fetched 2026-04-08 00:49:02
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×2closed ×1cross-referenced ×1referenced ×1

Error Message

any transient error), it stays closed for the entire pod lifetime.

Root Cause

Root cause

Fix Action

Fixed

PR fix notes

PR #23808: fix: auto-recover shared aiohttp session when closed

Description (problem / solution / changelog)

Summary

Fixes #23806 — shared aiohttp session has no auto-recovery: once closed, stays closed for pod lifetime.

Problem

The shared aiohttp.ClientSession created at proxy startup has no recovery logic. When the session closes (due to network interruption, idle timeout, or Redis failover side effects), it stays closed permanently. All subsequent requests fall back to creating a new HTTPS connection per request, silently losing connection pooling for the entire pod lifetime.

Root Cause

add_shared_session_to_data() checks shared_aiohttp_session.closed but only logs and continues — it never recreates the session:

if shared_aiohttp_session is not None and not shared_aiohttp_session.closed:
    data["shared_session"] = shared_aiohttp_session  # reuse
else:
    # falls back permanently — session is never recreated
    verbose_proxy_logger.info("No shared session available")

Fix

  • Make add_shared_session_to_data() async
  • When the session is found closed, call _initialize_shared_aiohttp_session() to recreate it
  • Update the global shared_aiohttp_session so subsequent requests also benefit
  • Log a warning when recreating (vs. the silent info log before)
  • Handle recreation failure gracefully (continue without session reuse)

Tests

Added 4 test cases in tests/test_litellm/proxy/test_aiohttp_session_recovery.py:

  • Open session → attached as-is
  • Closed session → recreated and attached
  • Closed session + recreation failure → graceful fallback
  • No session (None) → no-op

Changed files

  • litellm/proxy/route_llm_request.py (modified, +78/-8)
  • tests/test_litellm/proxy/test_aiohttp_session_recovery.py (added, +182/-0)

Code Example

if shared_aiohttp_session is not None and not shared_aiohttp_session.closed:
      data["shared_session"] = shared_aiohttp_session  # reuse pooled connection
  else:
      logger.info("SESSION REUSE: No shared session available for this request")
      # ↑ falls back permanently — session is never recreated

  Once shared_aiohttp_session.closed == True, no code path recreates it.
  The proxy_server creates the session once at startup and never again:

  # proxy_server.py — created once, never recreated
  shared_aiohttp_session = await _initialize_shared_aiohttp_session()
  # limit=300 total connections, limit_per_host=50, keepalive_timeout=120s

  Evidence (pod logs)

  Full session lifecycle visible on a 3-hour-old pod (Vertex AI backend):

SESSION REUSE: Created shared aiohttp session (ID: 139414346342688, limit=300, limit_per_host=50)
SESSION REUSE: Attached shared aiohttp session to request (ID: 139414346342688)
SESSION REUSE: Attached shared aiohttp session to request (ID: 139414346342688)
SESSION REUSE: No shared session available for this request   ← session died, never recovered
SESSION REUSE: No shared session available for this request

  Expected behavior

  When the session is found closed, the proxy should recreate it:

  if shared_aiohttp_session is None or shared_aiohttp_session.closed:
      logger.warning("SESSION REUSE: Session closed, recreating...")
      shared_aiohttp_session = await _initialize_shared_aiohttp_session()
  data["shared_session"] = shared_aiohttp_session

  Environment

  - LiteLLM version: 1.81.12
  - Backend: Vertex AI
  - Deployment: Kubernetes (pod-based), multiple replicas

  Are you an ML Ops team?

  Yes



### Steps to Reproduce

  ## Steps to Reproduce

  ### Actual scenario (production trigger)

  These are the exact conditions that caused the session to close in staging
  (GKE, LiteLLM v1.81.12, Vertex AI backend):

  1. Run LiteLLM proxy with `shared_aiohttp_session` enabled (default).
     Confirm session is created at startup:
     SESSION REUSE: Created shared aiohttp session (ID: ..., limit=300, limit_per_host=50)
  2. Allow a Redis Sentinel failover to occur (or simulate Redis unavailability).
  This causes `async_increment()` / `update_spend` to block on every write.
  3. Observe background jobs (`update_spend` every 12s, `ProxyConfig.add_deployment`
  every 30s) begin to overlap and stall:
     Execution of job skipped: maximum number of running instances reached (1)
  4. With the async event loop saturated by stalled coroutines, the aiohttp
  `TCPConnector` fails to service keepalive pings within `keepalive_timeout=120s`.
  The connector closes, setting `shared_aiohttp_session.closed = True`.
  5. All subsequent requests permanently fall back to per-request HTTPS connections:
     SESSION REUSE: No shared session available for this request  ← repeats forever
  6. Pod CPU climbs to ~980m and stays there. `/health/readiness` begins timing out.
  **The only recovery is a pod restart** — the session is never recreated in-process.

  ---

  ### Minimal reproduction (for contributors — no Redis required)

  This reproduces the no-recovery behavior directly, without needing the full
  Redis/event-loop cascade:

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

What happened?

The shared aiohttp.ClientSession created at proxy startup (_initialize_shared_aiohttp_session) has no auto-recovery logic. Once the session closes (due to network interruption, idle timeout, or any transient error), it stays closed for the entire pod lifetime. All subsequent requests fall back to creating a new HTTPS connection per request, silently losing the benefit of connection pooling.

Root cause

In route_llm_request.py, the session check is a one-way gate:

if shared_aiohttp_session is not None and not shared_aiohttp_session.closed:
    data["shared_session"] = shared_aiohttp_session  # reuse pooled connection
else:
    logger.info("SESSION REUSE: No shared session available for this request")
    # ↑ falls back permanently — session is never recreated

Once shared_aiohttp_session.closed == True, no code path recreates it.
The proxy_server creates the session once at startup and never again:

# proxy_server.py — created once, never recreated
shared_aiohttp_session = await _initialize_shared_aiohttp_session()
# limit=300 total connections, limit_per_host=50, keepalive_timeout=120s

Evidence (pod logs)

Full session lifecycle visible on a 3-hour-old pod (Vertex AI backend):

✓ SESSION REUSE: Created shared aiohttp session (ID: 139414346342688, limit=300, limit_per_host=50)
✓ SESSION REUSE: Attached shared aiohttp session to request (ID: 139414346342688)
✓ SESSION REUSE: Attached shared aiohttp session to request (ID: 139414346342688)
✗ SESSION REUSE: No shared session available for this request   ← session died, never recovered
✗ SESSION REUSE: No shared session available for this request

Expected behavior

When the session is found closed, the proxy should recreate it:

if shared_aiohttp_session is None or shared_aiohttp_session.closed:
    logger.warning("SESSION REUSE: Session closed, recreating...")
    shared_aiohttp_session = await _initialize_shared_aiohttp_session()
data["shared_session"] = shared_aiohttp_session

Environment

- LiteLLM version: 1.81.12
- Backend: Vertex AI
- Deployment: Kubernetes (pod-based), multiple replicas

Are you an ML Ops team?

Yes



### Steps to Reproduce

## Steps to Reproduce

### Actual scenario (production trigger)

These are the exact conditions that caused the session to close in staging
(GKE, LiteLLM v1.81.12, Vertex AI backend):

1. Run LiteLLM proxy with `shared_aiohttp_session` enabled (default).
   Confirm session is created at startup:
   SESSION REUSE: Created shared aiohttp session (ID: ..., limit=300, limit_per_host=50)
2. Allow a Redis Sentinel failover to occur (or simulate Redis unavailability).
This causes `async_increment()` / `update_spend` to block on every write.
3. Observe background jobs (`update_spend` every 12s, `ProxyConfig.add_deployment`
every 30s) begin to overlap and stall:
   Execution of job skipped: maximum number of running instances reached (1)
4. With the async event loop saturated by stalled coroutines, the aiohttp
`TCPConnector` fails to service keepalive pings within `keepalive_timeout=120s`.
The connector closes, setting `shared_aiohttp_session.closed = True`.
5. All subsequent requests permanently fall back to per-request HTTPS connections:
   SESSION REUSE: No shared session available for this request  ← repeats forever
6. Pod CPU climbs to ~980m and stays there. `/health/readiness` begins timing out.
**The only recovery is a pod restart** — the session is never recreated in-process.

---

### Minimal reproduction (for contributors — no Redis required)

This reproduces the no-recovery behavior directly, without needing the full
Redis/event-loop cascade:

```python
import asyncio
import litellm.proxy.proxy_server as proxy_server
from litellm.proxy.route_llm_request import add_shared_session_to_data

async def main():
 # 1. Simulate startup
 proxy_server.shared_aiohttp_session = await proxy_server._initialize_shared_aiohttp_session()
 print("Session open:", not proxy_server.shared_aiohttp_session.closed)  # True

 # 2. Force-close the session (simulates event loop pressure / keepalive timeout)
 await proxy_server.shared_aiohttp_session.close()
 print("Session closed:", proxy_server.shared_aiohttp_session.closed)    # True

 # 3. Simulate an incoming request — expect: session stays dead, no recovery
 data = {}
 add_shared_session_to_data(data)  # currently sync
 print("shared_session in data:", "shared_session" in data)              # False (bug)
 print("Session still closed:", proxy_server.shared_aiohttp_session.closed)  # True (bug)

asyncio.run(main())

Expected after fix:
shared_session in data: True       ← session was auto-recreated
Session still closed: False

Actual (current behavior):
SESSION REUSE: No shared session available for this request   ← logged every request
shared_session in data: False      ← falls back to per-request HTTPS forever
Session still closed: True         ← never recovered

### Relevant log output

```shell

What part of LiteLLM is this about?

No response

What LiteLLM version are you on ?

v1.81.12

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To fix the issue, we need to modify the add_shared_session_to_data function in route_llm_request.py to recreate the shared_aiohttp_session when it's closed. Here are the steps:

  • Modify the add_shared_session_to_data function to check if the shared_aiohttp_session is closed and recreate it if necessary:
async def add_shared_session_to_data(data):
    global shared_aiohttp_session
    if shared_aiohttp_session is None or shared_aiohttp_session.closed:
        logger.warning("SESSION REUSE: Session closed, recreating...")
        shared_aiohttp_session = await _initialize_shared_aiohttp_session()
    data["shared_session"] = shared_aiohttp_session
  • Make sure the add_shared_session_to_data function is an async function to allow for the recreation of the session.

Verification

To verify the fix, you can run the minimal reproduction code provided in the issue:

import asyncio
import litellm.proxy.proxy_server as proxy_server
from litellm.proxy.route_llm_request import add_shared_session_to_data

async def main():
    # 1. Simulate startup
    proxy_server.shared_aiohttp_session = await proxy_server._initialize_shared_aiohttp_session()
    print("Session open:", not proxy_server.shared_aiohttp_session.closed)  # True

    # 2. Force-close the session (simulates event loop pressure / keepalive timeout)
    await proxy_server.shared_aiohttp_session.close()
    print("Session closed:", proxy_server.shared_aiohttp_session.closed)    # True

    # 3. Simulate an incoming request — expect: session stays dead, no recovery
    data = {}
    await add_shared_session_to_data(data)  # modified to be async
    print("shared_session in data:", "shared_session" in data)              # True (fixed)
    print("Session still closed:", proxy_server.shared_aiohttp_session.closed)  # False (fixed)

asyncio.run(main())

The expected output should be:

Session open: True
Session closed: True
shared_session in data: True
Session still closed: False

Extra Tips

  • Make sure to test the fix in a production-like environment to ensure it works as expected.
  • Consider adding logging and monitoring to detect when the session is recreated to ensure it's working correctly.
  • Review the keepalive_timeout setting to ensure it's set to a reasonable value to prevent the session from closing too frequently.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING