litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages → hosted_vllm silently drops assistant-message prefill (continue_final_message never reaches vLLM)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Root cause (traced)

Fix Action

Fix / Workaround

Workaround for impacted users (until merged)

Pending a fix, I'm running a user-space CustomLogger callback that monkey-patches OpenAIChatCompletion.make_openai_chat_completion_request at import time to inject extra_body.continue_final_message when data["messages"][-1].role == "assistant". ~100 lines, works on v1.82.6, but obviously brittle to internal API changes. Happy to share if useful as a reference for the PR.

Code Example

curl -sS -X POST "https://<proxy>/v1/messages" \
  -H "x-api-key: $KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
  -d '{"model":"qwen3-coder","max_tokens":50,"system":"You are precise.",
       "messages":[{"role":"user","content":"Count to three, just the numbers comma-separated."},
                   {"role":"assistant","content":"1, 2,"}]}'

---

curl -sS -X POST "https://<proxy>/v1/chat/completions" \
  -H "Authorization: Bearer $KEY" -H "content-type: application/json" \
  -d '{"model":"qwen3-coder","max_tokens":50,
       "messages":[{"role":"system","content":"You are precise."},
                   {"role":"user","content":"Count to three, just the numbers comma-separated."},
                   {"role":"assistant","content":"1, 2,"}],
       "continue_final_message":true,"add_generation_prompt":false}'

---

acompletion: model=qwen3.6-35b-a3b last_role=assistant
   has_cfm=True eb_has_cfm=True
   keys=[..., 'continue_final_message', 'extra_body', ...]

---

def transform_request(self, model, messages, optional_params, litellm_params, headers):
    # Detect the canonical LiteLLM prefill marker on the trailing assistant
    # message, OR (for Anthropic-shape passthroughs) any trailing assistant
    # message with content. Translate to vLLM's native continuation flags.
    if messages and isinstance(messages[-1], dict) and messages[-1].get("role") == "assistant":
        # Only fire if the caller explicitly opted in via `prefix:true`
        # (LiteLLM's unified marker from #4881) OR if we were entered via the
        # anthropic_messages adapter (which Anthropic-spec requires honor by
        # default — see step 2 below for the auto-stamp).
        if messages[-1].get("prefix") is True:
            eb = optional_params.setdefault("extra_body", {})
            eb.setdefault("continue_final_message", True)
            eb.setdefault("add_generation_prompt", False)
    return super().transform_request(model, messages, optional_params, litellm_params, headers)

---

# Anthropic prefill spec is part of the /v1/messages contract — make it
# portable by tagging trailing-assistant messages with prefix:true before
# they reach the non-Anthropic provider config.
if messages and isinstance(messages[-1], dict) and messages[-1].get("role") == "assistant":
    messages[-1].setdefault("prefix", True)
RAW_BUFFERClick to expand / collapse

What happened

When a request comes in on /v1/messages (Anthropic shape) whose last message is role:"assistant" — i.e. the Anthropic-spec prefill semantic — and is routed to a hosted_vllm/* model, vLLM never receives continue_final_message: true. The model treats the prefill as a completed prior turn and starts a fresh response instead of continuing.

This silently breaks the Anthropic prefill contract for the entire self-hosted ecosystem behind LiteLLM. Claude Code (and any other Anthropic-SDK client) cannot use prefill against a vLLM backend through LiteLLM, even though it works fine when calling vLLM directly via /v1/chat/completions with continue_final_message: true.

Verified on LiteLLM v1.82.6, vLLM serving qwen3.6-35b-a3b.

Repro

A — broken (Anthropic shape, current behavior)

curl -sS -X POST "https://<proxy>/v1/messages" \
  -H "x-api-key: $KEY" -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
  -d '{"model":"qwen3-coder","max_tokens":50,"system":"You are precise.",
       "messages":[{"role":"user","content":"Count to three, just the numbers comma-separated."},
                   {"role":"assistant","content":"1, 2,"}]}'

content[0].text == "1, 2, 3" (fresh turn — prefill discarded), output_tokens=8.

B — works (OpenAI shape, same backend)

curl -sS -X POST "https://<proxy>/v1/chat/completions" \
  -H "Authorization: Bearer $KEY" -H "content-type: application/json" \
  -d '{"model":"qwen3-coder","max_tokens":50,
       "messages":[{"role":"system","content":"You are precise."},
                   {"role":"user","content":"Count to three, just the numbers comma-separated."},
                   {"role":"assistant","content":"1, 2,"}],
       "continue_final_message":true,"add_generation_prompt":false}'

' 3' (proper continuation), output_tokens=3.

Also tried (all 400/no-op via /v1/messages)

  • {"role":"assistant","content":"1, 2,","prefix":true} (the unified-API marker from #4881)
  • "continue_final_message":true at top level
  • "extra_body":{"continue_final_message":true,"add_generation_prompt":false}

None reach vLLM.

Root cause (traced)

I traced the call path inside the running litellm container with a debug callback:

  1. Pre-call hooks fire correctlyasync_pre_call_hook(call_type='anthropic_messages') receives the data dict with messages[-1].role == 'assistant'. Mutations to data propagate forward.
  2. litellm.acompletion receives the mutated kwargs — confirmed continue_final_message=True AND extra_body.continue_final_message=True are present:
    acompletion: model=qwen3.6-35b-a3b last_role=assistant
    has_cfm=True eb_has_cfm=True
    keys=[..., 'continue_final_message', 'extra_body', ...]
  3. Lost between acompletion and the openai SDK boundary. The openai-compat path at litellm.llms.openai.openai.OpenAIChatCompletion.make_openai_chat_completion_request does not forward extra_body into the openai SDK call's data dict for hosted_vllm. By the time vLLM gets the HTTP body, continue_final_message is gone.
  4. Provider config gap. HostedVLLMChatConfig.get_supported_openai_params doesn't include continue_final_message or add_generation_prompt, so under drop_params: true they get stripped by map_openai_params even when they DO reach that layer. prefix:true handling exists ONLY in litellm/llms/anthropic/chat/transformation.py::get_prefix_prompt — there is no equivalent on the hosted_vllm side. continue_final_message appears zero times anywhere in the LiteLLM source tree.

Related

  • #4881 (closed 2024-08-10) — defines the unified prefill API with prefix:true on the trailing assistant message. Implemented for Anthropic only. The "unified" part of the unified API isn't actually unified.
  • #27967 (open 2026-05-14) — LiteLLM's own internal Router uses prefix:true for mid-stream fallback recovery, and complains that it's silently broken on providers that don't honor it natively. Same gap from the other direction.

Proposed fix

Two small changes that make Anthropic-prefill portable to any openai-compat self-hosted backend, with vLLM as the first concrete recipient:

1. litellm/llms/hosted_vllm/chat/transformation.py

Add an override that detects the prefill pattern and injects vLLM's continue_final_message / add_generation_prompt into extra_body:

def transform_request(self, model, messages, optional_params, litellm_params, headers):
    # Detect the canonical LiteLLM prefill marker on the trailing assistant
    # message, OR (for Anthropic-shape passthroughs) any trailing assistant
    # message with content. Translate to vLLM's native continuation flags.
    if messages and isinstance(messages[-1], dict) and messages[-1].get("role") == "assistant":
        # Only fire if the caller explicitly opted in via `prefix:true`
        # (LiteLLM's unified marker from #4881) OR if we were entered via the
        # anthropic_messages adapter (which Anthropic-spec requires honor by
        # default — see step 2 below for the auto-stamp).
        if messages[-1].get("prefix") is True:
            eb = optional_params.setdefault("extra_body", {})
            eb.setdefault("continue_final_message", True)
            eb.setdefault("add_generation_prompt", False)
    return super().transform_request(model, messages, optional_params, litellm_params, headers)

Also extend get_supported_openai_params to whitelist continue_final_message, add_generation_prompt so they survive map_openai_params under drop_params:true.

2. litellm/llms/anthropic/experimental_pass_through/adapters/handler.py

In _prepare_completion_kwargs (or earlier in async_anthropic_messages_handler), when the request is being routed to a non-Anthropic provider and the trailing message is role:"assistant", auto-stamp prefix:true so step 1's translator fires:

# Anthropic prefill spec is part of the /v1/messages contract — make it
# portable by tagging trailing-assistant messages with prefix:true before
# they reach the non-Anthropic provider config.
if messages and isinstance(messages[-1], dict) and messages[-1].get("role") == "assistant":
    messages[-1].setdefault("prefix", True)

3. Tests

tests/local_testing/test_hosted_vllm.py (or wherever the existing hosted_vllm tests live): verify a request with a trailing assistant message produces an outgoing body with extra_body.continue_final_message == True. Use the existing httpx-mock pattern in the test suite.

Workaround for impacted users (until merged)

Pending a fix, I'm running a user-space CustomLogger callback that monkey-patches OpenAIChatCompletion.make_openai_chat_completion_request at import time to inject extra_body.continue_final_message when data["messages"][-1].role == "assistant". ~100 lines, works on v1.82.6, but obviously brittle to internal API changes. Happy to share if useful as a reference for the PR.

Why this is worth fixing

The Anthropic prefill semantic is foundational for tool-use and structured-output workflows. The current behavior means every Anthropic-SDK client (Claude Code, Anthropic's Python SDK, etc.) silently loses prefill capability the moment LiteLLM is in front of vLLM — even though vLLM itself supports it perfectly. That's a substantial coverage gap in LiteLLM's value proposition as the Anthropic-compatible gateway for the self-hosted LLM ecosystem.

Happy to open a draft PR — wanted to surface the analysis first.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages → hosted_vllm silently drops assistant-message prefill (continue_final_message never reaches vLLM)