litellm - 💡(How to fix) Fix [Bug]: Mid-stream fallback request includes assistant prefill block, breaks for fallback targets that don't support `prefix=True` (Claude Sonnet 4.6 / Opus 4.7)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

litellm.exceptions.BadRequestError: litellm.BadRequestError: AnthropicException

  • {"type":"error","error":{"type":"invalid_request_error", "message":"This model does not support assistant message prefill. The conversation must end with a user message."}} Model: claude-sonnet-4-6 API Base: https://api.anthropic.com messages: [ {"role": "user", "content": "..."}, {"role": "assistant", "prefix": true, "content": "<partial streamed tokens>"} ]

Root Cause

The same reproduces against Vertex Anthropic models. It does not reproduce when the fallback target supports prefill (e.g. older Claude 3.x variants), because the malformed request is "accepted" — but the assistant prefix payload still corrupts the output and skips alignment training, which is also undesirable.

Fix Action

Fix / Workaround

There is no documented hook that fires between mid-stream failure detection and fallback request construction. Per the docs and current source layout, the only working workaround is to monkey-patch MidStreamFallbackError.__init__ to force is_pre_first_chunk=True and generated_content="", which causes the fallback path to unconditionally take the no-prefill branch. We're shipping this in production, but it's clearly the wrong shape.

Code Example

HTTP 400: "This model does not support assistant message prefill. The conversation must end with a user message."

---

import litellm
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "claude-primary",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "stream_timeout": 1,  # force mid-stream timeout
            },
        },
        {
            "model_name": "claude-fallback",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
            },
        },
    ],
    fallbacks=[{"claude-primary": ["claude-fallback"]}],
)

resp = await router.acompletion(
    model="claude-primary",
    messages=[
        {"role": "user", "content": "Write a 2000-word essay about transformers."}
    ],
    stream=True,
)

async for chunk in resp:
    print(chunk)
# Expected: stream resumes on claude-fallback, or surfaces the original timeout.
# Actual:   HTTP 400 "This model does not support assistant message prefill."

---

litellm.exceptions.BadRequestError: litellm.BadRequestError: AnthropicException
  - {"type":"error","error":{"type":"invalid_request_error",
     "message":"This model does not support assistant message prefill.
                The conversation must end with a user message."}}
  Model: claude-sonnet-4-6
  API Base: https://api.anthropic.com
  messages: [
    {"role": "user", "content": "..."},
    {"role": "assistant", "prefix": true, "content": "<partial streamed tokens>"}
  ]

---

# litellm/router.py (Router.__init__ kwargs)
disable_mid_stream_continuation: bool = False  # global opt-out

# or per model-group via deployment metadata
{
    "model_name": "claude-fallback",
    "litellm_params": { ... },
    "model_info": {"supports_assistant_prefill": False},
}
RAW_BUFFERClick to expand / collapse

What Happened?

When a streaming request fails mid-stream (e.g. httpx.ReadTimeout, provider 5xx, network drop), Router.stream_with_fallbacks constructs the fallback request with an extra assistant message carrying prefix=True and the partially generated text — asking the fallback model to "continue from where the previous model left off."

This continuation prompt is unconditionally appended regardless of whether the fallback target supports assistant prefill. For models that do not support it — notably claude-sonnet-4-6 / claude-opus-4-7 on both Anthropic and Vertex AI — the fallback request is rejected with:

HTTP 400: "This model does not support assistant message prefill. The conversation must end with a user message."

Net effect: the fallback feature is silently unusable for these models. Every mid-stream failure → fallback target rejects the malformed request → the original error is swallowed and the user sees a 400 about prefill instead of the real upstream failure (timeout, overload, etc).

Expected behavior

One of:

  1. Capability-aware: if the fallback target deployment doesn't support assistant prefill, fall back with the original messages (no continuation block).
  2. Configurable: expose a router_settings.disable_mid_stream_continuation: bool (or per-deployment flag) so users can opt out globally / per model group. This is the ask in #18229.
  3. Catchable: at minimum, ensure MidStreamFallbackError propagates a reliable signal so external retry layers can re-issue with original messages without inspecting error message strings.

Why the existing escape hatches don't help

  • _handle_stream_fallback_error re-raises without prefill fallback when the original exception carries a 4xx status_code (other than 429). In practice, the exception that reaches CustomStreamWrapper.__next__ for our case is httpx.ReadTimeout — which has no status_code attribute — so the escape hatch never fires even when our user-level wrapper exception (a TimeoutError subclass with status_code = 408) was the one originally injected via stopit.async_raise. httpx's map_httpcore_exceptions strips our type before litellm sees it.
  • disable_fallbacks=True is ignored mid-stream (#19077).
  • CustomLogger.async_post_call_failure_hook can transform the surfaced error but cannot prevent the malformed fallback request from being sent in the first place.
  • CustomLogger.async_pre_call_hook is proxy-path-only (#8842) and would require fragile inspection of messages to detect "this is a fallback retry, strip the prefill."

There is no documented hook that fires between mid-stream failure detection and fallback request construction. Per the docs and current source layout, the only working workaround is to monkey-patch MidStreamFallbackError.__init__ to force is_pre_first_chunk=True and generated_content="", which causes the fallback path to unconditionally take the no-prefill branch. We're shipping this in production, but it's clearly the wrong shape.

Steps to Reproduce

import litellm
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "claude-primary",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "stream_timeout": 1,  # force mid-stream timeout
            },
        },
        {
            "model_name": "claude-fallback",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
            },
        },
    ],
    fallbacks=[{"claude-primary": ["claude-fallback"]}],
)

resp = await router.acompletion(
    model="claude-primary",
    messages=[
        {"role": "user", "content": "Write a 2000-word essay about transformers."}
    ],
    stream=True,
)

async for chunk in resp:
    print(chunk)
# Expected: stream resumes on claude-fallback, or surfaces the original timeout.
# Actual:   HTTP 400 "This model does not support assistant message prefill."

The same reproduces against Vertex Anthropic models. It does not reproduce when the fallback target supports prefill (e.g. older Claude 3.x variants), because the malformed request is "accepted" — but the assistant prefix payload still corrupts the output and skips alignment training, which is also undesirable.

Relevant Log Output

litellm.exceptions.BadRequestError: litellm.BadRequestError: AnthropicException
  - {"type":"error","error":{"type":"invalid_request_error",
     "message":"This model does not support assistant message prefill.
                The conversation must end with a user message."}}
  Model: claude-sonnet-4-6
  API Base: https://api.anthropic.com
  messages: [
    {"role": "user", "content": "..."},
    {"role": "assistant", "prefix": true, "content": "<partial streamed tokens>"}
  ]

Component

LiteLLM Python SDK (Router)

LiteLLM Version

v1.83.0 (also reproduced on internal builds tracking main)

Related Issues

  • #18229 — Allow disabling or customizing mid-stream fallback continuation prompt (open, no fix yet)
  • #19077 — disable_fallbacks ignored during mid-stream fallback in streaming responses
  • #25492 — upstream BadRequestError gets wrapped as 503 MidStreamFallbackError
  • #22296 — streaming fallback broken for 429 / missing for sync streaming
  • #8842 — Router async completion doesn't trigger CustomLogger callbacks

Proposed Fix

Smallest viable surface area:

# litellm/router.py (Router.__init__ kwargs)
disable_mid_stream_continuation: bool = False  # global opt-out

# or per model-group via deployment metadata
{
    "model_name": "claude-fallback",
    "litellm_params": { ... },
    "model_info": {"supports_assistant_prefill": False},
}

In _handle_stream_fallback_error, gate the prefill branch on not disable_mid_stream_continuation and target_deployment.supports_assistant_prefill. When disabled, fall through to the existing "pre-first-chunk" branch that re-issues with original messages.

Happy to send a PR if the maintainers can confirm shape preference (global flag vs per-deployment capability flag vs both).

Contact Information

(optional — fill in if you want)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

One of:

  1. Capability-aware: if the fallback target deployment doesn't support assistant prefill, fall back with the original messages (no continuation block).
  2. Configurable: expose a router_settings.disable_mid_stream_continuation: bool (or per-deployment flag) so users can opt out globally / per model group. This is the ask in #18229.
  3. Catchable: at minimum, ensure MidStreamFallbackError propagates a reliable signal so external retry layers can re-issue with original messages without inspecting error message strings.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Mid-stream fallback request includes assistant prefill block, breaks for fallback targets that don't support `prefix=True` (Claude Sonnet 4.6 / Opus 4.7)