litellm - ✅(Solved) Fix [Bug]: Sync convert_url_to_base64() blocks asyncio event loop, causing pod health check failures and mass restarts [1 pull requests, 1 comments, 1 participants]

ruettiger · 2026-03-30T13:04:42Z

[litellm] PR 24885: fix: cap connect timeout in image URL fetching to prevent event loop blocking - Repository: BerriAI/litellm - Author: voidborne-d - State:… # PR #24885: fix: cap connect timeout in image URL fetching to prevent event loop blocking - Repository: BerriAI/litellm - Author: voidborne-d - State: open | merged: False - Link: https://github.com/BerriAI/litellm/pull/24885 ## Description (problem / solution / changelog) ## Summary Fixes #24788 — sync `convert_url_to_base64()` blocks the asyncio event loop when an image URL points to an unreachable host, causing cascading health check failures and pod restarts in Kubernetes deployments. ## Root Cause When `litellm.request_timeout` is set to a large value (e.g., 600s for LLM calls), the module-level HTTP client inherits that same timeout for TCP connect. The sync `convert_url_to_base64()` then blocks for up to `request_timeout × 3 retries` (potentially **30 minutes**) on unreachable hosts. Since it's called from async code paths (Anthropic/Bedrock, Gemini, Ollama, Vertex AI, Azure AI transformations), this blocks the entire event loop. ## Changes ### 1. `litellm/litellm_core_utils/prompt_templates/image_handling.py` - Both `convert_url_to_base64()` and `async_convert_url_to_base64()` now pass an explicit `httpx.Timeout(timeout=30, connect=5)` to every `.get()` call - Connect timeout is always capped at **5 seconds** regardless of `litellm.request_timeout` - Overall timeout per attempt is capped at **30 seconds** - Failed attempts now log a `warning` instead of silently swallowing exceptions (helps debugging unreachable URL issues) ### 2. `litellm/llms/custom_httpx/http_handler.py` - `HTTPHandler.get()` and `AsyncHTTPHandler.get()` accept an optional `timeout` parameter - When provided, it's forwarded as a per-request timeout override to the underlying httpx client - No behavior change when `timeout` is not passed (backward compatible) ### 3. Tests 5 new tests in `tests/test_litellm/litellm_core_utils/test_image_handling.py`: - `test_image_fetch_timeout_has_capped_connect` — verifies timeout structure - `test_convert_url_to_base64_passes_capped_timeout` — verifies sync path passes timeout - `test_async_convert_url_to_base64_passes_capped_timeout` — verifies async path passes timeout - `test_unreachable_host_fails_after_3_retries` — simulates ConnectTimeout, verifies 3 retries then ImageFetchError - Updated all existing test clients to accept `timeout` kwarg All 12 tests pass. No existing tests broken. ## Changed files - `litellm/litellm_core_utils/prompt_templates/image_handling.py` (modified, +41/-8) - `litellm/llms/custom_httpx/http_handler.py` (modified, +21/-8) - `tests/test_litellm/litellm_core_utils/test_image_handling.py` (modified, +112/-4) ### Check for existing issues - [x] I have searched the existing issues and checked that my issue is not a duplicate. ### What happened? In a GKE Autopilot production deployment (16 replicas), only 1 pod remained available. The root cause is that `convert_url_to_base64()` uses a **synchronous httpx client** (`litellm.module_level_client`) to fetch image/file URLs. When the request contains an unreachable internal URL (e.g., `http://10.254.3.71/uploads/...`), the TCP connect hangs for ~2 minutes, **blocking the entire asyncio event loop**. This prevents FastAPI from responding to health checks, nginx records 499, kubelet marks the pod as unhealthy, and restarts it — creating a cascading failure across all replicas. This is the same class of bug as #20268 (sync `next()` blocking event loop), but occurring in the image/file URL handling path. ### Root Cause **File**: `litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121` `convert_url_to_base64()` uses `litellm.module_level_client` (sync httpx) with no explicit connect timeout. It also retries 3 times on failure, meaning a single unreachable URL can block the event loop for **~6 minutes** (3 retries × ~2 min TCP timeout). The sync version is called from multiple async code paths: | File | Line | Context | |------|------|---------| | `factory.py` | 171 | `convert_to_ollama_image()` | | `factory.py` | 892 | `convert_to_anthropic_image_obj()` | | `factory.py` | 937 | `create_anthropic_image_param()` (Bedrock/http URLs) | | `gpt_transformation.py` | 236 | `_handle_pdf_url()` (sync path) | | `gemini/chat/transformation.py` | 145 | `_process_gemini_image()` | | `vertex_ai/ocr/transformation.py` | 134 | `_url_to_vertex_image()` | | `azure_ai/ocr/transformation.py` | 120 | `_url_to_azure_image()` | While async variants (`async_convert_url_to_base64`) exist and are used in some paths (e.g., `gpt_transformation.py:247`, `vertex_ai/ocr/transformation.py:157`), many code paths still use the sync version directly within async request handlers. ### Steps to Reproduce 1. Deploy LiteLLM proxy with multiple replicas behind a load balancer with health checks 2. Send a chat completion request containing an image URL pointing to an unreachable internal IP (e.g.,

litellm2026-03-30 13:04:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

BerriAI/litellm#24788•Fetched 2026-04-08 01:53:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ruettiger

Participants

ruettiger

Timeline (top)

labeled ×2commented ×1cross-referenced ×1referenced ×1

Root Cause

File: litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121

convert_url_to_base64() uses litellm.module_level_client (sync httpx) with no explicit connect timeout. It also retries 3 times on failure, meaning a single unreachable URL can block the event loop for ~6 minutes (3 retries × ~2 min TCP timeout).

The sync version is called from multiple async code paths:

File	Line	Context
`factory.py`	171	`convert_to_ollama_image()`
`factory.py`	892	`convert_to_anthropic_image_obj()`
`factory.py`	937	`create_anthropic_image_param()` (Bedrock/http URLs)
`gpt_transformation.py`	236	`_handle_pdf_url()` (sync path)
`gemini/chat/transformation.py`	145	`_process_gemini_image()`
`vertex_ai/ocr/transformation.py`	134	`_url_to_vertex_image()`
`azure_ai/ocr/transformation.py`	120	`_url_to_azure_image()`

While async variants (async_convert_url_to_base64) exist and are used in some paths (e.g., gpt_transformation.py:247, vertex_ai/ocr/transformation.py:157), many code paths still use the sync version directly within async request handlers.

PR fix notes

PR #24885: fix: cap connect timeout in image URL fetching to prevent event loop blocking

Repository: BerriAI/litellm
Author: voidborne-d
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/24885

Description (problem / solution / changelog)

Summary

Fixes #24788 — sync convert_url_to_base64() blocks the asyncio event loop when an image URL points to an unreachable host, causing cascading health check failures and pod restarts in Kubernetes deployments.

Root Cause

When litellm.request_timeout is set to a large value (e.g., 600s for LLM calls), the module-level HTTP client inherits that same timeout for TCP connect. The sync convert_url_to_base64() then blocks for up to request_timeout × 3 retries (potentially 30 minutes) on unreachable hosts. Since it's called from async code paths (Anthropic/Bedrock, Gemini, Ollama, Vertex AI, Azure AI transformations), this blocks the entire event loop.

Changes

1. `litellm/litellm_core_utils/prompt_templates/image_handling.py`

Both convert_url_to_base64() and async_convert_url_to_base64() now pass an explicit httpx.Timeout(timeout=30, connect=5) to every .get() call
Connect timeout is always capped at 5 seconds regardless of litellm.request_timeout
Overall timeout per attempt is capped at 30 seconds
Failed attempts now log a warning instead of silently swallowing exceptions (helps debugging unreachable URL issues)

2. `litellm/llms/custom_httpx/http_handler.py`

HTTPHandler.get() and AsyncHTTPHandler.get() accept an optional timeout parameter
When provided, it's forwarded as a per-request timeout override to the underlying httpx client
No behavior change when timeout is not passed (backward compatible)

3. Tests

5 new tests in tests/test_litellm/litellm_core_utils/test_image_handling.py:

test_image_fetch_timeout_has_capped_connect — verifies timeout structure
test_convert_url_to_base64_passes_capped_timeout — verifies sync path passes timeout
test_async_convert_url_to_base64_passes_capped_timeout — verifies async path passes timeout
test_unreachable_host_fails_after_3_retries — simulates ConnectTimeout, verifies 3 retries then ImageFetchError
Updated all existing test clients to accept timeout kwarg

All 12 tests pass. No existing tests broken.

Changed files

litellm/litellm_core_utils/prompt_templates/image_handling.py (modified, +41/-8)
litellm/llms/custom_httpx/http_handler.py (modified, +21/-8)
tests/test_litellm/litellm_core_utils/test_image_handling.py (modified, +112/-4)

Code Example

sync_client = HTTPHandler(timeout=httpx.Timeout(timeout, connect=5.0))

---

base64_data = await asyncio.to_thread(convert_url_to_base64, url)

---

# nginx access log showing 499 (client closed connection = health check timeout)
10.x.x.x - - [30/Mar/2026:xx:xx:xx +0000] "GET /health HTTP/1.1" 499 0

# kubelet events
Warning  Unhealthy  pod/litellm-xxx  Liveness probe failed: HTTP probe failed with statuscode: 499
Warning  Unhealthy  pod/litellm-xxx  Readiness probe failed: HTTP probe failed with statuscode: 499
Normal   Killing    pod/litellm-xxx  Container litellm failed liveness probe, will be restarted

RAW_BUFFERClick to expand / collapse

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

In a GKE Autopilot production deployment (16 replicas), only 1 pod remained available. The root cause is that convert_url_to_base64() uses a synchronous httpx client (litellm.module_level_client) to fetch image/file URLs. When the request contains an unreachable internal URL (e.g., http://10.254.3.71/uploads/...), the TCP connect hangs for ~2 minutes, blocking the entire asyncio event loop. This prevents FastAPI from responding to health checks, nginx records 499, kubelet marks the pod as unhealthy, and restarts it — creating a cascading failure across all replicas.

This is the same class of bug as #20268 (sync next() blocking event loop), but occurring in the image/file URL handling path.

Root Cause

File: litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121

The sync version is called from multiple async code paths:

File	Line	Context
`factory.py`	171	`convert_to_ollama_image()`
`factory.py`	892	`convert_to_anthropic_image_obj()`
`factory.py`	937	`create_anthropic_image_param()` (Bedrock/http URLs)
`gpt_transformation.py`	236	`_handle_pdf_url()` (sync path)
`gemini/chat/transformation.py`	145	`_process_gemini_image()`
`vertex_ai/ocr/transformation.py`	134	`_url_to_vertex_image()`
`azure_ai/ocr/transformation.py`	120	`_url_to_azure_image()`

Steps to Reproduce

Deploy LiteLLM proxy with multiple replicas behind a load balancer with health checks
Send a chat completion request containing an image URL pointing to an unreachable internal IP (e.g., http://10.x.x.x/image.png) to a provider that triggers convert_url_to_base64 (Anthropic on Vertex/Bedrock, Gemini, Ollama, etc.)
Observe:
- The request thread blocks for ~2 minutes per retry (up to ~6 min total)
- During this time, the pod cannot respond to health checks
- Health check failures trigger pod restarts
- Under sustained traffic with such URLs, most/all pods become unhealthy

Suggested Fixes

1. Add connect timeout (quick fix / stop the bleeding)

Set a short connect_timeout (e.g., 5s) on module_level_client in _lazy_imports.py:433:

sync_client = HTTPHandler(timeout=httpx.Timeout(timeout, connect=5.0))

2. Replace sync calls with async versions (proper fix)

Replace convert_url_to_base64() with async_convert_url_to_base64() in all async code paths. For call sites where the function signature is sync, use asyncio.to_thread() as an interim measure:

base64_data = await asyncio.to_thread(convert_url_to_base64, url)

3. Reject private/unreachable URLs at the entry point (defense in depth)

Add an optional check in convert_url_to_base64() to reject RFC 1918 private IPs (10.x.x.x, 172.16-31.x.x, 192.168.x.x) or make it configurable.

Related Issues

#20268 — Same class of bug: sync next() blocking event loop in streaming handler
#24193 — ImageFetchError masked by misleading APIConnectionError in the same code path
#19921 — Performance regression in 1.81.x (possibly related to event loop blocking)

Relevant log output

# nginx access log showing 499 (client closed connection = health check timeout)
10.x.x.x - - [30/Mar/2026:xx:xx:xx +0000] "GET /health HTTP/1.1" 499 0

# kubelet events
Warning  Unhealthy  pod/litellm-xxx  Liveness probe failed: HTTP probe failed with statuscode: 499
Warning  Unhealthy  pod/litellm-xxx  Readiness probe failed: HTTP probe failed with statuscode: 499
Normal   Killing    pod/litellm-xxx  Container litellm failed liveness probe, will be restarted

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.65.3+

Twitter / LinkedIn details

No response

extent analysis

Fix Plan

To address the issue of synchronous HTTP requests blocking the asyncio event loop, we will implement the following steps:

Add connect timeout: Set a short connect_timeout on module_level_client to prevent long hangs.
Replace sync calls with async versions: Use async_convert_url_to_base64() instead of convert_url_to_base64() in async code paths.
Reject private/unreachable URLs: Add an optional check to reject RFC 1918 private IPs.

Step-by-Step Solution

Add Connect Timeout

# In _lazy_imports.py:433
sync_client = HTTPHandler(timeout=httpx.Timeout(timeout, connect=5.0))

Replace Sync Calls with Async Versions

# Replace convert_url_to_base64() with async_convert_url_to_base64()
# In factory.py:171
base64_data = await async_convert_url_to_base64(url)

# In factory.py:892
base64_data = await async_convert_url_to_base64(url)

# In factory.py:937
base64_data = await async_convert_url_to_base64(url)

# In gpt_transformation.py:236 (sync path)
base64_data = await asyncio.to_thread(convert_url_to_base64, url)

# In gemini/chat/transformation.py:145
base64_data = await async_convert_url_to_base64(url)

# In vertex_ai/ocr/transformation.py:134
base64_data = await async_convert_url_to_base64(url)

# In azure_ai/ocr/transformation.py:120
base64_data = await async_convert_url_to_base64(url)

Reject Private/Unreachable URLs

# In litellm/litellm_core_utils/prompt_templates/image_handling.py:98-121
import ipaddress

def convert_url_to_base64(url):
    # Check if URL is a private IP
    try:
        ip = ipaddress.ip_address(url.split('://')[-1].split('/')[0])
        if ip.is_private:
            raise ValueError("Private IP address")
    except ValueError:
        # Handle error or raise exception
        pass
    # Rest of the function remains the same

Verification

To verify the fix, deploy the updated code and test with the following scenarios:

Send a chat completion request with an image URL pointing to an unreachable internal IP.
Observe that the request does not block for an extended period.
Check the nginx access logs for 499 status codes (client closed connection).
Verify that the pod remains healthy and responsive to health checks.

Extra Tips

Monitor the application for any regressions or performance issues after applying the fix.
Consider adding additional logging or metrics to track the number of rejected private IPs or unreachable URLs.
Review the codebase for any other synchronous requests that may be blocking the asyncio event loop.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retrieval issue #search optimization #API routing #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

litellm - ✅(Solved) Fix [Bug]: Sync convert_url_to_base64() blocks asyncio event loop, causing pod health check failures and mass restarts [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #24885: fix: cap connect timeout in image URL fetching to prevent event loop blocking

Description (problem / solution / changelog)

Summary

Root Cause

Changes

1. litellm/litellm_core_utils/prompt_templates/image_handling.py

2. litellm/llms/custom_httpx/http_handler.py

3. Tests

Changed files

Code Example

Check for existing issues

What happened?

Root Cause

Steps to Reproduce

Suggested Fixes

Related Issues

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

extent analysis

Fix Plan

Step-by-Step Solution

Add Connect Timeout

Replace Sync Calls with Async Versions

Reject Private/Unreachable URLs

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `litellm/litellm_core_utils/prompt_templates/image_handling.py`

2. `litellm/llms/custom_httpx/http_handler.py`