vllm - 💡(How to fix) Fix [Bug]: `previous_response_id` drops function_call/function_call_output from stored context in Responses API

vllm2026-05-20 20:11:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

The bug is in vllm/entrypoints/openai/responses/utils.py:construct_input_messages() (lines 101–112). When reconstructing conversation history from a stored response, the function only iterates over ResponseOutputMessage items from prev_response_output, skipping ResponseFunctionToolCall items entirely. This means assistant tool calls and their corresponding function_call_output inputs are never included in the reconstructed message list.

Fix Action

Fix / Workaround

Workaround: Maintain msg_history client-side and pass the full array as input on every request instead of using previous_response_id.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glib-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-1084-aws-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version        : 570.133.07
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               96
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8559C
Hypervisor vendor:                    KVM
Virtualization type:                  full
NUMA node(s):                         1

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.8.1
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
vLLM Version                 : 0.21.0
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0  GPU1  GPU2  GPU3  CPU Affinity  NUMA Affinity  GPU NUMA ID
GPU0   X   PIX  NODE  NODE  0-95      0        N/A
GPU1  PIX   X   NODE  NODE  0-95      0        N/A
GPU2  NODE  NODE   X   PIX  0-95      0        N/A
GPU3  NODE  NODE  PIX   X   0-95      0        N/A

---

docker run -d \
  --name vllm-gemma4 \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e VLLM_ENABLE_RESPONSES_API_STORE=1 \
  vllm/vllm-openai:v0.21.0 \
  --model google/gemma-4-26B-A4B-it \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 14336 \
  --tool-call-parser functiongemma \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4

---

import requests, json

BASE = "http://localhost:8000/v1"
MODEL = "google/gemma-4-26B-A4B-it"

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    }
]

# Turn 1: model calls a tool
r1 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": tools,
    "tool_choice": "required",
    "store": True,
})
d1 = r1.json()
resp_id_1 = d1["id"]
fc = [item for item in d1["output"] if item["type"] == "function_call"][0]
call_id = fc["call_id"]
print(f"Turn 1: tool call {fc['name']}({fc['arguments']}), call_id={call_id}")

# Turn 2: Submit tool result via previous_response_id
r2 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [
        {"type": "function_call_output", "call_id": call_id,
         "output": json.dumps({"temp": "65F", "condition": "foggy"})},
    ],
    "previous_response_id": resp_id_1,
    "store": True,
})
d2 = r2.json()
resp_id_2 = d2["id"]
text2 = ""
for item in d2.get("output", []):
    if item.get("content"):
        for c in item["content"]:
            text2 += c.get("text", "")
print(f"Turn 2: model response = {text2[:200]}")

# Turn 3: Follow-up — should recall tool result
r3 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What temperature did you find and what was the condition?"}],
    "previous_response_id": resp_id_2,
    "store": True,
})
d3 = r3.json()
text3 = ""
for item in d3.get("output", []):
    if item.get("content"):
        for c in item["content"]:
            text3 += c.get("text", "")
print(f"Turn 3: {text3[:300]}")

if "65" in text3 or "foggy" in text3:
    print("PASS: Model recalls tool results")
else:
    print("FAIL: Model does not recall tool results from prior turns")

---

Turn 1: tool call get_weather({"city": "San Francisco"}), call_id=chatcmpl-tool-adcc1c53fcdf531c
Turn 2: model response = I do not have access to real-time weather data or a search engine...
Turn 3: I did not find any temperature or weather conditions because I do not have access to real-time information...
FAIL: Model does not recall tool results from prior turns

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glib-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-1084-aws-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version        : 570.133.07
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
CPU(s):                               96
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8559C
Hypervisor vendor:                    KVM
Virtualization type:                  full
NUMA node(s):                         1

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.8.1
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
vLLM Version                 : 0.21.0
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0  GPU1  GPU2  GPU3  CPU Affinity  NUMA Affinity  GPU NUMA ID
GPU0   X   PIX  NODE  NODE  0-95      0        N/A
GPU1  PIX   X   NODE  NODE  0-95      0        N/A
GPU2  NODE  NODE   X   PIX  0-95      0        N/A
GPU3  NODE  NODE  PIX   X   0-95      0        N/A

</details>

🐛 Describe the bug

When using previous_response_id to chain multi-turn conversations that include tool calls, the server-side stored context only carries forward text messages — function_call and function_call_output messages from earlier turns are dropped. The model in follow-up turns has no memory of what tools were called or what data they returned.

Server start command:

docker run -d \
  --name vllm-gemma4 \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e VLLM_ENABLE_RESPONSES_API_STORE=1 \
  vllm/vllm-openai:v0.21.0 \
  --model google/gemma-4-26B-A4B-it \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 14336 \
  --tool-call-parser functiongemma \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4

Reproduction script:

import requests, json

BASE = "http://localhost:8000/v1"
MODEL = "google/gemma-4-26B-A4B-it"

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    }
]

# Turn 1: model calls a tool
r1 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": tools,
    "tool_choice": "required",
    "store": True,
})
d1 = r1.json()
resp_id_1 = d1["id"]
fc = [item for item in d1["output"] if item["type"] == "function_call"][0]
call_id = fc["call_id"]
print(f"Turn 1: tool call {fc['name']}({fc['arguments']}), call_id={call_id}")

# Turn 2: Submit tool result via previous_response_id
r2 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [
        {"type": "function_call_output", "call_id": call_id,
         "output": json.dumps({"temp": "65F", "condition": "foggy"})},
    ],
    "previous_response_id": resp_id_1,
    "store": True,
})
d2 = r2.json()
resp_id_2 = d2["id"]
text2 = ""
for item in d2.get("output", []):
    if item.get("content"):
        for c in item["content"]:
            text2 += c.get("text", "")
print(f"Turn 2: model response = {text2[:200]}")

# Turn 3: Follow-up — should recall tool result
r3 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What temperature did you find and what was the condition?"}],
    "previous_response_id": resp_id_2,
    "store": True,
})
d3 = r3.json()
text3 = ""
for item in d3.get("output", []):
    if item.get("content"):
        for c in item["content"]:
            text3 += c.get("text", "")
print(f"Turn 3: {text3[:300]}")

if "65" in text3 or "foggy" in text3:
    print("PASS: Model recalls tool results")
else:
    print("FAIL: Model does not recall tool results from prior turns")

Observed output:

Turn 1: tool call get_weather({"city": "San Francisco"}), call_id=chatcmpl-tool-adcc1c53fcdf531c
Turn 2: model response = I do not have access to real-time weather data or a search engine...
Turn 3: I did not find any temperature or weather conditions because I do not have access to real-time information...
FAIL: Model does not recall tool results from prior turns

Expected behavior: When previous_response_id chains after a response that involved tool calls, the server should reconstruct the full conversation including the function_call output and the client-submitted function_call_output, so the model retains awareness of tool results.

Actual behavior: Only text-based messages are retained. Tool call/output pairs are dropped from the stored context.

Impact: This makes previous_response_id unusable for multi-turn agentic workflows. Clients must manually maintain full message history (including all function_call and function_call_output entries) and re-send it each turn, negating the value of server-side state management and causing linear input token growth.

Workaround: Maintain msg_history client-side and pass the full array as input on every request instead of using previous_response_id.

Root Cause Analysis

Related Issues / PRs

#33017 — Responses API tool calling support
#33089 — Tool call parsing in Responses API
#37697 — previous_response_id state management
#26934 — Multi-turn conversation context
#42189 — Response message merging fix (partially addresses this in v0.21.1rc0+)

Affected Components

vllm/entrypoints/openai/responses/utils.py — construct_input_messages()
vllm/entrypoints/openai/responses/response_store.py — response storage
Responses API (/v1/responses) — tool calling + previous_response_id

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering