litellm - 💡(How to fix) Fix [Bug]: "Response with id '{response_id}' not found" in /ui/chat with vllm [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#26147Fetched 2026-04-22 07:46:22
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
0
Author
Timeline (top)
labeled ×3commented ×2renamed ×1

Error Message

Error occurred while generating model response. Please try again. Error: Error: 404 litellm.BadRequestError: Hosted_vllmException - {"error":{"message":"Response with id 'resp_998bca5d44e22037' not found.","type":"invalid_request_error","param":"response_id","code":404}}. Received Model Group=Qwen/Qwen3.6-35B-A3B-FP8 Available Model Group Fallbacks=None

Code Example

Error occurred while generating model response. Please try again.
Error: Error: 404 litellm.BadRequestError: Hosted_vllmException -
{"error":{"message":"Response with id 'resp_998bca5d44e22037' not found.","type":"invalid_request_error","param":"response_id","code":404}}.
Received Model Group=Qwen/Qwen3.6-35B-A3B-FP8
Available Model Group Fallbacks=None

---

Qwen/Qwen3.6-35B-A3B-FP8 \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144

---

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144

---

Error occurred while generating model response. Please try again.
Error: Error: 404 litellm.BadRequestError: Hosted_vllmException -
{"error":{"message":"Response with id 'resp_998bca5d44e22037' not found.","type":"invalid_request_error","param":"response_id","code":404}}.
Received Model Group=Qwen/Qwen3.6-35B-A3B-FP8
Available Model Group Fallbacks=None

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Issue summary

When using a hosted_vllm model backed by vLLM, continuous chat works correctly in the LiteLLM Playground, but fails in /ui/chat starting from the second user message.

The first message succeeds, but on the second turn /ui/chat returns this error:

Error occurred while generating model response. Please try again.
Error: Error: 404 litellm.BadRequestError: Hosted_vllmException -
{"error":{"message":"Response with id 'resp_998bca5d44e22037' not found.","type":"invalid_request_error","param":"response_id","code":404}}.
Received Model Group=Qwen/Qwen3.6-35B-A3B-FP8
Available Model Group Fallbacks=None

Observed behavior

  • LiteLLM Playground: multi-turn chat works
  • /ui/chat: first message works, second and later messages fail with response_id not found

vLLM startup parameters

Qwen/Qwen3.6-35B-A3B-FP8 \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144

Suspected issue It looks like /ui/chat may be using the Responses API flow and attempting to reuse a response_id, but the backend vLLM response cannot be found on the next turn. This only seems to happen in /ui/chat, not in the Playground.

litellm Playground (without problem) <img width="1670" height="903" alt="Image" src="https://github.com/user-attachments/assets/dd0a5466-0991-445e-9f4a-5e4acfb65d93" />

litellm ui chat

<img width="1244" height="410" alt="Image" src="https://github.com/user-attachments/assets/000a0446-d74a-44dc-bac3-d8d1b8421f9a" />

Steps to Reproduce

  1. Start a vLLM server with the following model and parameters:
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --reasoning-parser qwen3 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.95 \
  --max-model-len 262144
  1. Configure this model in LiteLLM as a hosted_vllm backend.
  2. Open LiteLLM /ui/chat
  3. Select the Qwen/Qwen3.6-35B-A3B-FP8 model.
  4. Send the first user message. Result: the first response is generated successfully.
  5. Send a second follow-up message in the same chat session.
  6. Observe that the request fails with:
Error occurred while generating model response. Please try again.
Error: Error: 404 litellm.BadRequestError: Hosted_vllmException -
{"error":{"message":"Response with id 'resp_998bca5d44e22037' not found.","type":"invalid_request_error","param":"response_id","code":404}}.
Received Model Group=Qwen/Qwen3.6-35B-A3B-FP8
Available Model Group Fallbacks=None

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

v1.82.6

Twitter / LinkedIn details

No response

extent analysis

TL;DR

The issue can likely be resolved by modifying the /ui/chat implementation to not reuse response_ids or by ensuring that the backend vLLM stores and retrieves responses correctly.

Guidance

  • Investigate the Responses API flow in /ui/chat to determine why it's attempting to reuse a response_id that cannot be found by the backend vLLM.
  • Verify that the vLLM backend is correctly storing and retrieving responses for multi-turn conversations.
  • Check the LiteLLM configuration to ensure that the hosted_vllm backend is properly set up to handle continuous chat sessions.
  • Consider adding logging or debugging statements to the /ui/chat code to track the response_id generation and usage.

Example

No code snippet is provided as the issue does not contain sufficient information to create a specific example.

Notes

The issue seems to be specific to the /ui/chat implementation and the interaction with the hosted_vllm backend. The fact that the LiteLLM Playground works correctly suggests that the issue is not with the vLLM model itself, but rather with how /ui/chat is using it.

Recommendation

Apply a workaround to modify the /ui/chat implementation to not reuse response_ids or ensure correct response storage and retrieval by the backend vLLM, as the root cause of the issue appears to be related to this aspect of the code.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING