litellm - ✅(Solved) Fix [Bug]: Sync convert_url_to_base64() blocks asyncio event loop, causing pod health check failures and mass restarts [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#24788Fetched 2026-04-08 01:53:59
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×2commented ×1cross-referenced ×1referenced ×1

Root Cause

File: litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121

convert_url_to_base64() uses litellm.module_level_client (sync httpx) with no explicit connect timeout. It also retries 3 times on failure, meaning a single unreachable URL can block the event loop for ~6 minutes (3 retries × ~2 min TCP timeout).

The sync version is called from multiple async code paths:

FileLineContext
factory.py171convert_to_ollama_image()
factory.py892convert_to_anthropic_image_obj()
factory.py937create_anthropic_image_param() (Bedrock/http URLs)
gpt_transformation.py236_handle_pdf_url() (sync path)
gemini/chat/transformation.py145_process_gemini_image()
vertex_ai/ocr/transformation.py134_url_to_vertex_image()
azure_ai/ocr/transformation.py120_url_to_azure_image()

While async variants (async_convert_url_to_base64) exist and are used in some paths (e.g., gpt_transformation.py:247, vertex_ai/ocr/transformation.py:157), many code paths still use the sync version directly within async request handlers.

PR fix notes

PR #24885: fix: cap connect timeout in image URL fetching to prevent event loop blocking

Description (problem / solution / changelog)

Summary

Fixes #24788 — sync convert_url_to_base64() blocks the asyncio event loop when an image URL points to an unreachable host, causing cascading health check failures and pod restarts in Kubernetes deployments.

Root Cause

When litellm.request_timeout is set to a large value (e.g., 600s for LLM calls), the module-level HTTP client inherits that same timeout for TCP connect. The sync convert_url_to_base64() then blocks for up to request_timeout × 3 retries (potentially 30 minutes) on unreachable hosts. Since it's called from async code paths (Anthropic/Bedrock, Gemini, Ollama, Vertex AI, Azure AI transformations), this blocks the entire event loop.

Changes

1. litellm/litellm_core_utils/prompt_templates/image_handling.py

  • Both convert_url_to_base64() and async_convert_url_to_base64() now pass an explicit httpx.Timeout(timeout=30, connect=5) to every .get() call
  • Connect timeout is always capped at 5 seconds regardless of litellm.request_timeout
  • Overall timeout per attempt is capped at 30 seconds
  • Failed attempts now log a warning instead of silently swallowing exceptions (helps debugging unreachable URL issues)

2. litellm/llms/custom_httpx/http_handler.py

  • HTTPHandler.get() and AsyncHTTPHandler.get() accept an optional timeout parameter
  • When provided, it's forwarded as a per-request timeout override to the underlying httpx client
  • No behavior change when timeout is not passed (backward compatible)

3. Tests

5 new tests in tests/test_litellm/litellm_core_utils/test_image_handling.py:

  • test_image_fetch_timeout_has_capped_connect — verifies timeout structure
  • test_convert_url_to_base64_passes_capped_timeout — verifies sync path passes timeout
  • test_async_convert_url_to_base64_passes_capped_timeout — verifies async path passes timeout
  • test_unreachable_host_fails_after_3_retries — simulates ConnectTimeout, verifies 3 retries then ImageFetchError
  • Updated all existing test clients to accept timeout kwarg

All 12 tests pass. No existing tests broken.

Changed files

  • litellm/litellm_core_utils/prompt_templates/image_handling.py (modified, +41/-8)
  • litellm/llms/custom_httpx/http_handler.py (modified, +21/-8)
  • tests/test_litellm/litellm_core_utils/test_image_handling.py (modified, +112/-4)

Code Example

sync_client = HTTPHandler(timeout=httpx.Timeout(timeout, connect=5.0))

---

base64_data = await asyncio.to_thread(convert_url_to_base64, url)

---

# nginx access log showing 499 (client closed connection = health check timeout)
10.x.x.x - - [30/Mar/2026:xx:xx:xx +0000] "GET /health HTTP/1.1" 499 0

# kubelet events
Warning  Unhealthy  pod/litellm-xxx  Liveness probe failed: HTTP probe failed with statuscode: 499
Warning  Unhealthy  pod/litellm-xxx  Readiness probe failed: HTTP probe failed with statuscode: 499
Normal   Killing    pod/litellm-xxx  Container litellm failed liveness probe, will be restarted
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

In a GKE Autopilot production deployment (16 replicas), only 1 pod remained available. The root cause is that convert_url_to_base64() uses a synchronous httpx client (litellm.module_level_client) to fetch image/file URLs. When the request contains an unreachable internal URL (e.g., http://10.254.3.71/uploads/...), the TCP connect hangs for ~2 minutes, blocking the entire asyncio event loop. This prevents FastAPI from responding to health checks, nginx records 499, kubelet marks the pod as unhealthy, and restarts it — creating a cascading failure across all replicas.

This is the same class of bug as #20268 (sync next() blocking event loop), but occurring in the image/file URL handling path.

Root Cause

File: litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121

convert_url_to_base64() uses litellm.module_level_client (sync httpx) with no explicit connect timeout. It also retries 3 times on failure, meaning a single unreachable URL can block the event loop for ~6 minutes (3 retries × ~2 min TCP timeout).

The sync version is called from multiple async code paths:

FileLineContext
factory.py171convert_to_ollama_image()
factory.py892convert_to_anthropic_image_obj()
factory.py937create_anthropic_image_param() (Bedrock/http URLs)
gpt_transformation.py236_handle_pdf_url() (sync path)
gemini/chat/transformation.py145_process_gemini_image()
vertex_ai/ocr/transformation.py134_url_to_vertex_image()
azure_ai/ocr/transformation.py120_url_to_azure_image()

While async variants (async_convert_url_to_base64) exist and are used in some paths (e.g., gpt_transformation.py:247, vertex_ai/ocr/transformation.py:157), many code paths still use the sync version directly within async request handlers.

Steps to Reproduce

  1. Deploy LiteLLM proxy with multiple replicas behind a load balancer with health checks
  2. Send a chat completion request containing an image URL pointing to an unreachable internal IP (e.g., http://10.x.x.x/image.png) to a provider that triggers convert_url_to_base64 (Anthropic on Vertex/Bedrock, Gemini, Ollama, etc.)
  3. Observe:
    • The request thread blocks for ~2 minutes per retry (up to ~6 min total)
    • During this time, the pod cannot respond to health checks
    • Health check failures trigger pod restarts
    • Under sustained traffic with such URLs, most/all pods become unhealthy

Suggested Fixes

1. Add connect timeout (quick fix / stop the bleeding)

Set a short connect_timeout (e.g., 5s) on module_level_client in _lazy_imports.py:433:

sync_client = HTTPHandler(timeout=httpx.Timeout(timeout, connect=5.0))

2. Replace sync calls with async versions (proper fix)

Replace convert_url_to_base64() with async_convert_url_to_base64() in all async code paths. For call sites where the function signature is sync, use asyncio.to_thread() as an interim measure:

base64_data = await asyncio.to_thread(convert_url_to_base64, url)

3. Reject private/unreachable URLs at the entry point (defense in depth)

Add an optional check in convert_url_to_base64() to reject RFC 1918 private IPs (10.x.x.x, 172.16-31.x.x, 192.168.x.x) or make it configurable.

Related Issues

  • #20268 — Same class of bug: sync next() blocking event loop in streaming handler
  • #24193 — ImageFetchError masked by misleading APIConnectionError in the same code path
  • #19921 — Performance regression in 1.81.x (possibly related to event loop blocking)

Relevant log output

# nginx access log showing 499 (client closed connection = health check timeout)
10.x.x.x - - [30/Mar/2026:xx:xx:xx +0000] "GET /health HTTP/1.1" 499 0

# kubelet events
Warning  Unhealthy  pod/litellm-xxx  Liveness probe failed: HTTP probe failed with statuscode: 499
Warning  Unhealthy  pod/litellm-xxx  Readiness probe failed: HTTP probe failed with statuscode: 499
Normal   Killing    pod/litellm-xxx  Container litellm failed liveness probe, will be restarted

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.65.3+

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To address the issue of synchronous HTTP requests blocking the asyncio event loop, we will implement the following steps:

  1. Add connect timeout: Set a short connect_timeout on module_level_client to prevent long hangs.
  2. Replace sync calls with async versions: Use async_convert_url_to_base64() instead of convert_url_to_base64() in async code paths.
  3. Reject private/unreachable URLs: Add an optional check to reject RFC 1918 private IPs.

Step-by-Step Solution

Add Connect Timeout

# In _lazy_imports.py:433
sync_client = HTTPHandler(timeout=httpx.Timeout(timeout, connect=5.0))

Replace Sync Calls with Async Versions

# Replace convert_url_to_base64() with async_convert_url_to_base64()
# In factory.py:171
base64_data = await async_convert_url_to_base64(url)

# In factory.py:892
base64_data = await async_convert_url_to_base64(url)

# In factory.py:937
base64_data = await async_convert_url_to_base64(url)

# In gpt_transformation.py:236 (sync path)
base64_data = await asyncio.to_thread(convert_url_to_base64, url)

# In gemini/chat/transformation.py:145
base64_data = await async_convert_url_to_base64(url)

# In vertex_ai/ocr/transformation.py:134
base64_data = await async_convert_url_to_base64(url)

# In azure_ai/ocr/transformation.py:120
base64_data = await async_convert_url_to_base64(url)

Reject Private/Unreachable URLs

# In litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121
import ipaddress

def convert_url_to_base64(url):
    # Check if URL is a private IP
    try:
        ip = ipaddress.ip_address(url.split('://')[-1].split('/')[0])
        if ip.is_private:
            raise ValueError("Private IP address")
    except ValueError:
        # Handle error or raise exception
        pass
    # Rest of the function remains the same

Verification

To verify the fix, deploy the updated code and test with the following scenarios:

  • Send a chat completion request with an image URL pointing to an unreachable internal IP.
  • Observe that the request does not block for an extended period.
  • Check the nginx access logs for 499 status codes (client closed connection).
  • Verify that the pod remains healthy and responsive to health checks.

Extra Tips

  • Monitor the application for any regressions or performance issues after applying the fix.
  • Consider adding additional logging or metrics to track the number of rejected private IPs or unreachable URLs.
  • Review the codebase for any other synchronous requests that may be blocking the asyncio event loop.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING