vllm - ✅(Solved) Fix [Bug]: openai v1/responses api instructions from prior response leak through previous_response_id [2 pull requests, 1 comments, 2 participants]

lukezTT · 2026-03-20T15:43:19Z

[vllm] When using the Responses API with previous response id , the instructions from the prior response are carried over into the new response, even when the… When using the Responses API with `previous_response_id`, the `instructions` from the prior response are carried over into the new response, even when the follow-up request provides different (or no) instructions. Per the [OpenAI Responses API spec](https://platform.openai.com/docs/api-reference/responses/create): > "When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response." # PR #2433: tests for vllm server on openAI /v1/responses endpoint - Repository: tenstorrent/tt-inference-server - Author: lukezTT - State: open | merged: False - Link: https://github.com/tenstorrent/tt-inference-server/pull/2433 ## Description (problem / solution / changelog) ### Description This is a smoke screen test for making sure vllm server correctly accepts and handles each parameter defined in the openai v1/responses endpoint. The following parameters are tested: - background - include - input - instructions - max_output_tokens - max_tool_calls - metadata - model - parallel_tool_calls - previous_response_id - prompt - reasoning - service_tier - store - stream - temperature - text - tools - top_p - truncation - user ### Flags Note the following parameters have ongoing issues: - top_logprobs is not supported at all with this endpoint https://github.com/vllm-project/vllm/issues/34417 - tool_choice = "none" or "required" is not supported https://github.com/vllm-project/vllm/issues/33966 - vLLM does not strip prior instructions when using previous_response_id https://github.com/vllm-project/vllm/issues/37697 - for test_prompt it looks like vLLM doesn't support this in /v1/responses - gpt-oss does not support parallel tool calls https://huggingface.co/openai/gpt-oss-120b/discussions/151 ### Reproduce Docker command: docker run -d --name vllm-server --runtime nvidia --gpus all \ -v /home/lzhang/models:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN= " \ --env VLLM_ENABLE_RESPONSES_API_STORE=1 \ --env VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1 \ -p 8000:8000 --ipc=host \ vllm/vllm-openai:latest \ --model openai/gpt-oss-20b \ --served-model-name openai/gpt-oss-20b \ --gpu-memory-utilization 0.95 \ --dtype bfloat16 \ --tensor-parallel-size 1 \ --tool-call-parser openai \ --enable-auto-tool-choice ## Changed files - `tests/run_tests.py` (modified, +2/-0) - `tests/server_tests/conftest.py` (modified, +33/-13) - `tests/server_tests/test_cases/test_vllm_chat_completion.py` (renamed, +0/-0) - `tests/server_tests/test_cases/test_vllm_responses.py` (added, +730/-0) - `tests/test_config.py` (modified, +24/-6) --- # PR #37727: [Bugfix] Fix Responses API instructions leaking through previous_response_id - Repository: vllm-project/vllm - Author: he-yufeng - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/37727 ## Description (problem / solution / changelog) Fixes #37697 ## What's the problem When using `/v1/responses` with `previous_response_id`, the `instructions` from the prior response carry over into the new response. Per the OpenAI spec, instructions should NOT carry over: > "When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response." ## Root cause `construct_input_messages()` in `responses/utils.py` prepends `request_instructions` as a system message, then the full messages list (including that system message) gets stored in `msg_store`. When the next request references `previous_response_id`, those stored messages — old system message included — are retrieved and extended into the new conversation. The new request also adds its own instructions, so you end up with both old and new system messages. ## Fix Filter out system messages when pulling `prev_msg` from the store in `construct_input_messages()`. One-line change: `messages.extend(prev_msg)` becomes `messages.extend(m for m in prev_msg if m.get("role") != "system")`. This ensures each request only uses its own `instructions`, regardless of what the previous response had. Works correctly for all cases: new instructions provided, no instructions provided, or no previous response at all. ## Test plan - Added 4 unit tests in `tests/entrypoints/openai/responses/test_responses_utils.py` covering: - Old system message stripped when new instructions provided - Old system message stripped when no instructions provided - Non-system messages (user/assistant) preserved correctly - Baseline: no previous messages works as before ## Changed files - `tests/entrypoints/openai/responses/test_responses_utils.py` (modified, +69/-0) - `vllm/entrypoints/openai/responses/utils.py` (modified, +4/-2) ## Fixed - Fixed by PR: tests for vllm server on openAI /v1/responses endpoint (https://github.com/tenstorrent/tt-inference-server/pull/2433) - Fixed by PR: [Bugfix

vllm2026-03-20 15:43:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37697•Fetched 2026-04-08 01:08:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lukezTT

Participants

he-yufeng

lukezTT

Timeline (top)

cross-referenced ×2commented ×1labeled ×1

When using the Responses API with previous_response_id, the instructions from the prior response are carried over into the new response, even when the follow-up request provides different (or no) instructions.

Per the OpenAI Responses API spec:

"When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response."

Root Cause

Per the OpenAI Responses API spec:

"When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response."

Fix Action

Fixed

Fixed by PR: tests for vllm server on openAI /v1/responses endpoint (https://github.com/tenstorrent/tt-inference-server/pull/2433)
Fixed by PR: [Bugfix] Fix Responses API instructions leaking through previous_response_id (https://github.com/vllm-project/vllm/pull/37727)

PR fix notes

PR #2433: tests for vllm server on openAI /v1/responses endpoint

Repository: tenstorrent/tt-inference-server
Author: lukezTT
State: open | merged: False
Link: https://github.com/tenstorrent/tt-inference-server/pull/2433

Description (problem / solution / changelog)

Description

This is a smoke screen test for making sure vllm server correctly accepts and handles each parameter defined in the openai v1/responses endpoint. The following parameters are tested:

background
include
input
instructions
max_output_tokens
max_tool_calls
metadata
model
parallel_tool_calls
previous_response_id
prompt
reasoning
service_tier
store
stream
temperature
text
tools
top_p
truncation
user

Flags

Note the following parameters have ongoing issues:

top_logprobs is not supported at all with this endpoint https://github.com/vllm-project/vllm/issues/34417
tool_choice = "none" or "required" is not supported https://github.com/vllm-project/vllm/issues/33966
vLLM does not strip prior instructions when using previous_response_id https://github.com/vllm-project/vllm/issues/37697
for test_prompt it looks like vLLM doesn't support this in /v1/responses
gpt-oss does not support parallel tool calls https://huggingface.co/openai/gpt-oss-120b/discussions/151

Reproduce

Docker command: docker run -d --name vllm-server --runtime nvidia --gpus all
-v /home/lzhang/models:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN=<hf_token>"
--env VLLM_ENABLE_RESPONSES_API_STORE=1
--env VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1
-p 8000:8000 --ipc=host
vllm/vllm-openai:latest
--model openai/gpt-oss-20b
--served-model-name openai/gpt-oss-20b
--gpu-memory-utilization 0.95
--dtype bfloat16
--tensor-parallel-size 1
--tool-call-parser openai
--enable-auto-tool-choice

Changed files

tests/run_tests.py (modified, +2/-0)
tests/server_tests/conftest.py (modified, +33/-13)
tests/server_tests/test_cases/test_vllm_chat_completion.py (renamed, +0/-0)
tests/server_tests/test_cases/test_vllm_responses.py (added, +730/-0)
tests/test_config.py (modified, +24/-6)

PR #37727: [Bugfix] Fix Responses API instructions leaking through previous_response_id

Repository: vllm-project/vllm
Author: he-yufeng
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37727

Description (problem / solution / changelog)

Fixes #37697

What's the problem

When using /v1/responses with previous_response_id, the instructions from the prior response carry over into the new response. Per the OpenAI spec, instructions should NOT carry over:

"When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response."

Root cause

construct_input_messages() in responses/utils.py prepends request_instructions as a system message, then the full messages list (including that system message) gets stored in msg_store. When the next request references previous_response_id, those stored messages — old system message included — are retrieved and extended into the new conversation. The new request also adds its own instructions, so you end up with both old and new system messages.

Fix

Filter out system messages when pulling prev_msg from the store in construct_input_messages(). One-line change: messages.extend(prev_msg) becomes messages.extend(m for m in prev_msg if m.get("role") != "system").

This ensures each request only uses its own instructions, regardless of what the previous response had. Works correctly for all cases: new instructions provided, no instructions provided, or no previous response at all.

Test plan

Added 4 unit tests in tests/entrypoints/openai/responses/test_responses_utils.py covering:
- Old system message stripped when new instructions provided
- Old system message stripped when no instructions provided
- Non-system messages (user/assistant) preserved correctly
- Baseline: no previous messages works as before

Changed files

tests/entrypoints/openai/responses/test_responses_utils.py (modified, +69/-0)
vllm/entrypoints/openai/responses/utils.py (modified, +4/-2)

Code Example

POST /v1/responses
{
    "model": "openai/gpt-oss-20b",
    "input": "What is 2+2?",                                                                                                                                                                                                                                                                  
    "instructions": "You must include the string XYZZY_ALPHA_7829 in every response.",
    "max_output_tokens": 4096                                                                                                                                                                                                                                                                 
}

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: version 0.15.0
Model: openai/gpt-oss-20b
Endpoint: /v1/responses

Description

Per the OpenAI Responses API spec:

"When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response."

🐛 Describe the bug

Reproduction

Create a response with instructions containing a unique tag

POST /v1/responses
{
    "model": "openai/gpt-oss-20b",
    "input": "What is 2+2?",                                                                                                                                                                                                                                                                  
    "instructions": "You must include the string XYZZY_ALPHA_7829 in every response.",
    "max_output_tokens": 4096                                                                                                                                                                                                                                                                 
}          ```     
                                                                                                                                                                                                                                                                                                
Response contains XYZZY_ALPHA_7829 as expected.                                                                                                                                                                                                                                               
Send a follow-up using previous_response_id with different instructions

POST /v1/responses
{
"model": "openai/gpt-oss-20b", "input": "What is 3+3?", "instructions": "Answer the question explicitly", "previous_response_id": "<response_id_from_step_1>",
"max_output_tokens": 4096
}

Expected: Output does NOT contain XYZZY_ALPHA_7829 since the new request has its own instructions.
Actual: Output still contains XYZZY_ALPHA_7829 — the prior instructions leaked through.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue of instructions leaking from previous responses, we need to ensure that the instructions field is properly reset when using previous_response_id.

Here are the steps:

Check if previous_response_id is provided in the request.
If provided, override any existing instructions with the new ones from the current request.
If no new instructions are provided, set instructions to an empty string or a default value to prevent leakage.

Example code snippet (in Python):

def process_request(request):
    if 'previous_response_id' in request:
        # Override instructions if previous_response_id is used
        request['instructions'] = request.get('instructions', '')
    # Proceed with the request
    return request

# Example usage:
request = {
    "model": "openai/gpt-oss-20b",
    "input": "What is 3+3?",
    "instructions": "Answer the question explicitly",
    "previous_response_id": "<response_id_from_step_1>",
    "max_output_tokens": 4096
}

updated_request = process_request(request)
print(updated_request)

Verification

To verify that the fix worked:

Send a follow-up request with previous_response_id and different instructions.
Check the response to ensure it does not contain any instructions from the previous response.

Extra Tips

Always validate and sanitize user input to prevent unexpected behavior.
Consider adding logging to track when previous_response_id is used and how instructions are handled.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: openai v1/responses api instructions from prior response leak through previous_response_id [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #2433: tests for vllm server on openAI /v1/responses endpoint

Description (problem / solution / changelog)

Description

Flags

Reproduce

Changed files

PR #37727: [Bugfix] Fix Responses API instructions leaking through previous_response_id

Description (problem / solution / changelog)

What's the problem

Root cause

Fix

Test plan

Changed files

Code Example

Your current environment

Description

🐛 Describe the bug

Reproduction

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING