litellm - 💡(How to fix) Fix [Bug]: websearch_interception silently truncates streaming response on /v1/messages — follow-up call always uses stream=False

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

When using websearch_interception with a search provider (e.g. Tavily) via Claude Code (which uses the /v1/messages endpoint) with stream=True, the tool call executes successfully, but the final LLM output is silently truncated mid-response with no error logs. Actual behavior: The response cuts off mid-generation. No error, no non-200 status, no log entry — it simply stops. 5. Observe that the litellm_web_search tool call completes successfully, but the final LLM response is cut off mid-generation with no error.

Root Cause

The bug lives in two cooperating places inside handler.py.

Step 1 — stream is silently killed by the pre-hook:

Lines 364–369:

if kwargs.get("stream"):
    kwargs["stream"] = False
    kwargs["_websearch_interception_converted_stream"] = True

The flag _websearch_interception_converted_stream is set so that the streaming response can be reconstructed later.

Step 2 — _prepare_followup_kwargs strips the flag:

Lines 707–721:

return {
    k: v
    for k, v in kwargs.items()
    if not k.startswith("_websearch_interception") and k not in _internal_keys
}

Any key prefixed with _websearch_interception — including _websearch_interception_converted_stream — is stripped from kwargs before the follow-up call.

Step 3 — _execute_agentic_loop never passes stream to the follow-up call:

Lines 758–764:

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    **optional_params,
    **request_patch.kwargs,
)

stream is not passed here. In anthropic_messages, original_stream is computed as (line 193–195):

original_stream = stream or kwargs.get("_websearch_interception_converted_stream", False)

Since stream defaults to False and the flag has been stripped by Step 2, original_stream evaluates to False. The follow-up call always returns a non-streaming response, which is handed back to Claude Code expecting SSE chunks — causing the response to appear truncated.

Fix Action

Fix / Workaround

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    **optional_params,
    **request_patch.kwargs,
)

return await anthropic_messages.acreate( max_tokens=max_tokens, messages=request_patch.messages, model=request_patch.model or model, stream=original_stream, # <-- add this **optional_params, **request_patch.kwargs, )


v1.83.14-stable.patch.3

Code Example

if kwargs.get("stream"):
    kwargs["stream"] = False
    kwargs["_websearch_interception_converted_stream"] = True

---

return {
    k: v
    for k, v in kwargs.items()
    if not k.startswith("_websearch_interception") and k not in _internal_keys
}

---

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    **optional_params,
    **request_patch.kwargs,
)

---

original_stream = stream or kwargs.get("_websearch_interception_converted_stream", False)

---

original_stream = kwargs.get("_websearch_interception_converted_stream", False) or stream

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    stream=original_stream,   # <-- add this
    **optional_params,
    **request_patch.kwargs,
)
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

When using websearch_interception with a search provider (e.g. Tavily) via Claude Code (which uses the /v1/messages endpoint) with stream=True, the tool call executes successfully, but the final LLM output is silently truncated mid-response with no error logs.

Expected behavior: The full LLM response is streamed back to Claude Code after the search results are injected.

Actual behavior: The response cuts off mid-generation. No error, no non-200 status, no log entry — it simply stops.

Root Cause

The bug lives in two cooperating places inside handler.py.

Step 1 — stream is silently killed by the pre-hook:

Lines 364–369:

if kwargs.get("stream"):
    kwargs["stream"] = False
    kwargs["_websearch_interception_converted_stream"] = True

The flag _websearch_interception_converted_stream is set so that the streaming response can be reconstructed later.

Step 2 — _prepare_followup_kwargs strips the flag:

Lines 707–721:

return {
    k: v
    for k, v in kwargs.items()
    if not k.startswith("_websearch_interception") and k not in _internal_keys
}

Any key prefixed with _websearch_interception — including _websearch_interception_converted_stream — is stripped from kwargs before the follow-up call.

Step 3 — _execute_agentic_loop never passes stream to the follow-up call:

Lines 758–764:

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    **optional_params,
    **request_patch.kwargs,
)

stream is not passed here. In anthropic_messages, original_stream is computed as (line 193–195):

original_stream = stream or kwargs.get("_websearch_interception_converted_stream", False)

Since stream defaults to False and the flag has been stripped by Step 2, original_stream evaluates to False. The follow-up call always returns a non-streaming response, which is handed back to Claude Code expecting SSE chunks — causing the response to appear truncated.

Suggested Fix

Pass the original stream intent explicitly to the follow-up anthropic_messages.acreate() call in _execute_agentic_loop:

original_stream = kwargs.get("_websearch_interception_converted_stream", False) or stream

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    stream=original_stream,   # <-- add this
    **optional_params,
    **request_patch.kwargs,
)

This ensures the follow-up call respects the original streaming intent from the client, before _prepare_followup_kwargs strips the flag.

Steps to Reproduce

  1. Configure LiteLLM proxy with websearch_interception and a search tool.
  2. Connect Claude Code to the LiteLLM proxy via ANTHROPIC_BASE_URL.
  3. Send a normal conversation message (no web search triggered) — confirm streaming works fine.
  4. Send a message that triggers a web search (e.g. "What are the latest AI news today?").
  5. Observe that the litellm_web_search tool call completes successfully, but the final LLM response is cut off mid-generation with no error.

Relevant log output

No response

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.83.14-stable.patch.3

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING