litellm - 💡(How to fix) Fix Title: Vertex AI Gemini streaming silently drops per-request timeout, falling back to litellm.request_timeout (6000s) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26939Fetched 2026-05-01 05:34:19
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

When calling litellm.completion(..., stream=True) against a Vertex AI / Google AI Studio Gemini model, the per-request timeout argument is silently discarded. Streams that stall before the first token hang for the full litellm.request_timeout (default 6000s = 100 minutes) instead of failing at the configured timeout. The non-streaming branch of the same handler propagates timeout correctly — only the streaming branch is broken.

Version litellm == 1.83.0 (also reproduces on main as of filing) Python 3.10 Provider: vertex_ai and gemini (both routes through the same handler) Repro import litellm import time

Set a small request_timeout so the bug doesn't take 100 minutes to surface.

In practice, the bug means the per-call timeout is ignored regardless.

litellm.request_timeout = 30

start = time.monotonic() try: response = litellm.completion( model="vertex_ai/gemini-2.5-flash", messages=[{"role": "user", "content": "hello"}], stream=True, timeout=2, # expect to fail in ~2s on a stalled stream ) for chunk in response: pass except Exception as e: print(f"failed after {time.monotonic() - start:.1f}s with: {e!s}") When the upstream stream stalls (we observe this regularly under load against vertex_ai/gemini-2.5-flash), the call hangs ~30s (the litellm.request_timeout), not ~2s (the per-call timeout). The exception text is also misleading:

litellm.Timeout: Connection timed out after None seconds. — note the literal None, despite a numeric timeout=2 having been passed.

Root cause In litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py, VertexLLM.completion() has two branches. The non-streaming branch correctly bakes the request timeout into the httpx client:

non-streaming branch (~line 2971)

if client is None or isinstance(client, AsyncHTTPHandler): _params = {} if timeout is not None: if isinstance(timeout, float) or isinstance(timeout, int): timeout = httpx.Timeout(timeout) _params["timeout"] = timeout client = _get_httpx_client(params=_params) The streaming branch (~line 2944) constructs the make_sync_call partial without any timeout argument:

Error Message

except Exception as e: When the upstream stream stalls (we observe this regularly under load against vertex_ai/gemini-2.5-flash), the call hangs ~30s (the litellm.request_timeout), not ~2s (the per-call timeout). The exception text is also misleading: Because make_sync_call calls client.post(...) without a timeout argument, the local timeout is None here even though the actual httpx.Client instance has a real timeout configured. The exception message thus reports None. (The async sibling at line 488 was patched to log Timeout passed={timeout}, time taken={time_delta} — the sync handler was missed.) The error message reports Connection timed out after None seconds, hiding the actual configured timeout.

Root Cause

Root cause In litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py, VertexLLM.completion() has two branches. The non-streaming branch correctly bakes the request timeout into the httpx client:

Fix Action

Fix / Workaround

except httpx.TimeoutException: raise litellm.Timeout( message=f"Connection timed out after {timeout} seconds.", model="default-model-name", llm_provider="litellm-httpx-handler", ) Because make_sync_call calls client.post(...) without a timeout argument, the local timeout is None here even though the actual httpx.Client instance has a real timeout configured. The exception message thus reports None. (The async sibling at line 488 was patched to log Timeout passed={timeout}, time taken={time_delta} — the sync handler was missed.)

Workaround Until this is fixed, users of streaming Gemini must rely on application-level timeout managers (e.g., a thread that raises TimeoutError into the main thread) or explicitly construct a short-timeout HTTPHandler and pass it as the client kwarg.

RAW_BUFFERClick to expand / collapse

Summary When calling litellm.completion(..., stream=True) against a Vertex AI / Google AI Studio Gemini model, the per-request timeout argument is silently discarded. Streams that stall before the first token hang for the full litellm.request_timeout (default 6000s = 100 minutes) instead of failing at the configured timeout. The non-streaming branch of the same handler propagates timeout correctly — only the streaming branch is broken.

Version litellm == 1.83.0 (also reproduces on main as of filing) Python 3.10 Provider: vertex_ai and gemini (both routes through the same handler) Repro import litellm import time

Set a small request_timeout so the bug doesn't take 100 minutes to surface.

In practice, the bug means the per-call timeout is ignored regardless.

litellm.request_timeout = 30

start = time.monotonic() try: response = litellm.completion( model="vertex_ai/gemini-2.5-flash", messages=[{"role": "user", "content": "hello"}], stream=True, timeout=2, # expect to fail in ~2s on a stalled stream ) for chunk in response: pass except Exception as e: print(f"failed after {time.monotonic() - start:.1f}s with: {e!s}") When the upstream stream stalls (we observe this regularly under load against vertex_ai/gemini-2.5-flash), the call hangs ~30s (the litellm.request_timeout), not ~2s (the per-call timeout). The exception text is also misleading:

litellm.Timeout: Connection timed out after None seconds. — note the literal None, despite a numeric timeout=2 having been passed.

Root cause In litellm/llms/vertex_ai/gemini/vertex_and_google_ai_studio_gemini.py, VertexLLM.completion() has two branches. The non-streaming branch correctly bakes the request timeout into the httpx client:

non-streaming branch (~line 2971)

if client is None or isinstance(client, AsyncHTTPHandler): _params = {} if timeout is not None: if isinstance(timeout, float) or isinstance(timeout, int): timeout = httpx.Timeout(timeout) _params["timeout"] = timeout client = _get_httpx_client(params=_params) The streaming branch (~line 2944) constructs the make_sync_call partial without any timeout argument:

SYNC STREAMING CALL

if stream is True: request_data_str = json.dumps(data) streaming_response = CustomStreamWrapper( completion_stream=None, make_call=partial( make_sync_call, gemini_client=( client if client is not None and isinstance(client, HTTPHandler) else None ), api_base=url, data=request_data_str, model=model, messages=messages, logging_obj=logging_obj, headers=headers, ), ... ) return streaming_response make_sync_call itself does not accept a timeout parameter, and its client.post(...) invocation does not pass one:

def make_sync_call( client, gemini_client, api_base, headers, data, model, messages, logging_obj, ): if gemini_client is not None: client = gemini_client if client is None: client = HTTPHandler() # uses _DEFAULT_TIMEOUT = httpx.Timeout(5.0, connect=5.0) response = client.post( api_base, headers=headers, data=data, stream=True, logging_obj=logging_obj, ) Net result: the user-supplied timeout is dropped on the floor. Whatever httpx.Client instance is reused/created by make_sync_call is what governs the streaming read timeout.

Why "Connection timed out after None seconds" In litellm/llms/custom_httpx/http_handler.py:1031, the sync HTTPHandler.post() reports timeouts using its local timeout parameter:

except httpx.TimeoutException: raise litellm.Timeout( message=f"Connection timed out after {timeout} seconds.", model="default-model-name", llm_provider="litellm-httpx-handler", ) Because make_sync_call calls client.post(...) without a timeout argument, the local timeout is None here even though the actual httpx.Client instance has a real timeout configured. The exception message thus reports None. (The async sibling at line 488 was patched to log Timeout passed={timeout}, time taken={time_delta} — the sync handler was missed.)

Expected behavior The streaming branch should mirror the non-streaming branch: when timeout is provided to completion(), propagate it down to make_sync_call so that client.post(..., timeout=timeout) enforces the user's configured timeout.

Suggested fix Two small changes:

Make make_sync_call accept and forward timeout: def make_sync_call( client, gemini_client, api_base, headers, data, model, messages, logging_obj, timeout: Optional[Union[float, httpx.Timeout]] = None, # NEW ): if gemini_client is not None: client = gemini_client if client is None: client = HTTPHandler(timeout=timeout) # NEW: bake into fresh client

response = client.post(
    api_base,
    headers=headers,
    data=data,
    stream=True,
    timeout=timeout,                                     # NEW
    logging_obj=logging_obj,
)
...

Pass timeout through the streaming partial(...) in VertexLLM.completion(): make_call=partial( make_sync_call, gemini_client=( client if client is not None and isinstance(client, HTTPHandler) else None ), api_base=url, data=request_data_str, model=model, messages=messages, logging_obj=logging_obj, headers=headers, timeout=timeout, # NEW ), The async streaming sibling (async_streaming at ~line 2575) already receives timeout via its signature; only the sync streaming path is broken.

Impact For any sync streaming Gemini call where the upstream stalls before the first chunk:

The user's timeout is ignored. The call hangs for the full litellm.request_timeout (default 6000s). The error message reports Connection timed out after None seconds, hiding the actual configured timeout. In production we observed individual calls hanging 6,008–6,017 seconds — exact matches for litellm.request_timeout=6000 plus connect overhead.

Workaround Until this is fixed, users of streaming Gemini must rely on application-level timeout managers (e.g., a thread that raises TimeoutError into the main thread) or explicitly construct a short-timeout HTTPHandler and pass it as the client kwarg.

extent analysis

TL;DR

To fix the issue, modify the make_sync_call function to accept and forward the timeout parameter, and pass this parameter through the streaming partial in VertexLLM.completion().

Guidance

  • Modify the make_sync_call function signature to include timeout: Optional[Union[float, httpx.Timeout]] = None.
  • Update the make_sync_call function to forward the timeout parameter to client.post().
  • Pass the timeout parameter through the streaming partial in VertexLLM.completion().
  • Verify the fix by testing the streaming branch with a configured timeout and checking that it fails within the expected time frame.
  • Consider implementing a workaround, such as using an application-level timeout manager, until the fix is applied.

Example

def make_sync_call(
    client,
    gemini_client,
    api_base,
    headers,
    data,
    model,
    messages,
    logging_obj,
    timeout: Optional[Union[float, httpx.Timeout]] = None,
):
    # ...
    response = client.post(
        api_base,
        headers=headers,
        data=data,
        stream=True,
        timeout=timeout,
        logging_obj=logging_obj,
    )
    # ...

Notes

The suggested fix only addresses the sync streaming path, as the async streaming sibling already receives the timeout parameter. The fix should be applied to the litellm library to ensure correct propagation of the timeout parameter.

Recommendation

Apply the suggested fix to the make_sync_call function and VertexLLM.completion() method to ensure correct handling of timeouts in the streaming branch. This will allow users to configure timeouts for streaming Gemini calls and prevent calls from hanging for the full litellm.request_timeout.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix Title: Vertex AI Gemini streaming silently drops per-request timeout, falling back to litellm.request_timeout (6000s) [1 participants]