vllm - ✅(Solved) Fix [Bug]: truncation: "auto" in Responses API returns 400 instead of truncating input [2 pull requests, 1 participants]

vllm2026-03-25 18:30:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38132•Fetched 2026-04-08 01:32:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lukezTT

Participants

lukezTT

Timeline (top)

referenced ×3cross-referenced ×2labeled ×1

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit.

Per the OpenAI Responses API spec: auto: If the input to this Response exceeds the model's context window size, the model will truncate the response to fit the context window by dropping items from the beginning of the conversation.

Currently, vLLM passes the full prompt to the engine without applying truncation, resulting in:
{'error': {'message': 'The engine prompt length 1327246 exceeds the max_model_len 131072. Please reduce prompt.', 'type': 'invalid_request_error', 'param': 'input', 'code': 400}}

Error Message

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit. {'error': {'message': 'The engine prompt length 1327246 exceeds the max_model_len 131072. Please reduce prompt.', 'type': 'invalid_request_error', 'param': 'input', 'code': 400}}

Root Cause

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit.

Fix Action

Fixed

Fixed by PR: tests for vllm server on openAI /v1/responses endpoint (https://github.com/tenstorrent/tt-inference-server/pull/2433)
Fixed by PR: [Bugfix] Apply truncation in Responses API harmony path (https://github.com/vllm-project/vllm/pull/38143)

PR fix notes

PR #2433: tests for vllm server on openAI /v1/responses endpoint

Repository: tenstorrent/tt-inference-server
Author: lukezTT
State: open | merged: False
Link: https://github.com/tenstorrent/tt-inference-server/pull/2433

Description (problem / solution / changelog)

Description

This is a smoke screen test for making sure vllm server correctly accepts and handles each parameter defined in the openai v1/responses endpoint. The following parameters are tested:

background
include
input
instructions
max_output_tokens
max_tool_calls
metadata
model
parallel_tool_calls
previous_response_id
prompt
reasoning
service_tier
store
stream
temperature
text
tools
top_p
truncation
user

Flags

Note the following parameters have ongoing issues:

top_logprobs is not supported at all with this endpoint https://github.com/vllm-project/vllm/issues/34417
tool_choice = "none" or "required" is not supported https://github.com/vllm-project/vllm/issues/33966
vLLM does not strip prior instructions when using previous_response_id https://github.com/vllm-project/vllm/issues/37697
for test_prompt it looks like vLLM doesn't support this in /v1/responses
gpt-oss does not support parallel tool calls https://huggingface.co/openai/gpt-oss-120b/discussions/151
vllm does not support truncation = "auto". https://github.com/vllm-project/vllm/issues/38132

Reproduce

Docker command: docker run -d --name vllm-server --runtime nvidia --gpus all
-v /home/lzhang/models:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN=<hf_token>"
--env VLLM_ENABLE_RESPONSES_API_STORE=1
--env VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1
-p 8000:8000 --ipc=host
vllm/vllm-openai:latest
--model openai/gpt-oss-20b
--served-model-name openai/gpt-oss-20b
--gpu-memory-utilization 0.95
--dtype bfloat16
--tensor-parallel-size 1
--tool-call-parser openai
--enable-auto-tool-choice

Changed files

tests/run_tests.py (modified, +8/-1)
tests/server_tests/conftest.py (modified, +33/-13)
tests/server_tests/test_cases/test_vllm_chat_completion.py (renamed, +0/-0)
tests/server_tests/test_cases/test_vllm_responses.py (added, +719/-0)
tests/test_config.py (modified, +24/-6)
workflows/model_spec.py (modified, +4/-0)

PR #38143: [Bugfix] Apply truncation in Responses API harmony path

Repository: vllm-project/vllm
Author: saivedant169
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38143

Description (problem / solution / changelog)

Purpose

Fixes #38132

When using the Responses API with truncation: "auto", requests that exceed max_model_len return a 400 error instead of truncating the input. This only affects the harmony path (used by openai/gpt-oss-* models).

The root cause: _make_request_with_harmony calls render_for_completion() directly to produce token IDs, bypassing the renderer pipeline that normally handles truncation via TokenizeParams. The raw token IDs then hit _validate_generator_input, which rejects anything at or above max_model_len.

The non-harmony path already works because it goes through _preprocess_chat -> render_chat_async -> apply_post_tokenization -> _token_truncation, which respects truncate_prompt_tokens=-1.

The fix adds left-side token truncation in the harmony path after render_for_completion() and before validation. This matches the OpenAI spec, which says to "drop items from the beginning of the conversation" when the input exceeds the context window.

No existing open PR addresses this (checked via gh pr list --search).

Test Plan

Added unit tests for build_tok_params truncation parameter mapping in tests/entrypoints/openai/responses/test_sampling_params.py:

pytest tests/entrypoints/openai/responses/test_sampling_params.py -v -k TestResponsesRequestTruncation

Tests cover:

truncation="auto" sets truncate_prompt_tokens=-1
truncation="disabled" sets truncate_prompt_tokens=None
Default truncation value is "disabled"
max_input_tokens calculation with max_output_tokens set

Linting:

ruff check vllm/entrypoints/openai/responses/serving.py
ruff check tests/entrypoints/openai/responses/test_sampling_params.py

Both pass clean.

Test Result

Cannot run integration tests locally (no GPU), but unit tests and linting pass. The harmony path truncation logic mirrors the existing renderer truncation that already works for the non-harmony path.

AI assistance was used in developing this fix.

Changed files

tests/entrypoints/openai/responses/test_sampling_params.py (modified, +50/-0)
vllm/entrypoints/openai/responses/serving.py (modified, +11/-0)

Code Example

vLLM: version 0.15.0
Model: openai/gpt-oss-20b
Endpoint: /v1/responses

---

import requests                                                                                                                                                                                                             
                  
  filler_message = "This is filler text to consume tokens. " * 50                                                                                                                                                             
  num_messages = (131072 // 40) + 1  # enough to exceed 131072 context window
  oversized_input = [{"role": "user", "content": filler_message} for _ in range(num_messages)]                                                                                                                                
                                                                                                                                                                                                                              
  response = requests.post(                                                                                                                                                                                                   
      "http://127.0.0.1:8000/v1/responses",                                                                                                                                                                                   
      json={      
          "model": "openai/gpt-oss-20b",
          "input": oversized_input,                                                                                                                                                                                           
          "max_output_tokens": 1024,
          "truncation": "auto",                                                                                                                                                                                               
      },          
  )
  print(response.status_code, response.json())

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: version 0.15.0
Model: openai/gpt-oss-20b
Endpoint: /v1/responses

🐛 Describe the bug

Description

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit.

Reproduction

Send a request to /v1/responses with truncation: "auto" and input that exceeds max_model_len:

  import requests                                                                                                                                                                                                             
                  
  filler_message = "This is filler text to consume tokens. " * 50                                                                                                                                                             
  num_messages = (131072 // 40) + 1  # enough to exceed 131072 context window
  oversized_input = [{"role": "user", "content": filler_message} for _ in range(num_messages)]                                                                                                                                
                                                                                                                                                                                                                              
  response = requests.post(                                                                                                                                                                                                   
      "http://127.0.0.1:8000/v1/responses",                                                                                                                                                                                   
      json={      
          "model": "openai/gpt-oss-20b",
          "input": oversized_input,                                                                                                                                                                                           
          "max_output_tokens": 1024,
          "truncation": "auto",                                                                                                                                                                                               
      },          
  )
  print(response.status_code, response.json())

Expected behavior The server should drop items from the beginning of the conversation until the input fits within max_model_len, then return a successful response with usage.input_tokens <= max_model_len.

Actual behavior
Returns 400 with The engine prompt length 1327246 exceeds the max_model_len 131072.

Note: truncation: "disabled" correctly returns a 400 for oversized input — this issue is only about the "auto" path.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to implement the truncation logic when the truncation parameter is set to "auto". We will truncate the input to fit within the model's context window by dropping items from the beginning of the conversation.

Here are the steps to fix the issue:

Check if the truncation parameter is set to "auto" and the input exceeds the model's context window.
If the condition is met, truncate the input by dropping items from the beginning of the conversation until the input fits within the model's context window.
Pass the truncated input to the model.

Example code snippet in Python:

def truncate_input(input_data, max_model_len):
    """
    Truncate the input to fit within the model's context window.
    
    Args:
    input_data (list): The input data to be truncated.
    max_model_len (int): The maximum length of the model's context window.
    
    Returns:
    list: The truncated input data.
    """
    total_tokens = sum(len(item["content"]) for item in input_data)
    while total_tokens > max_model_len:
        # Remove the first item from the conversation
        input_data.pop(0)
        total_tokens = sum(len(item["content"]) for item in input_data)
    return input_data

# Example usage:
max_model_len = 131072
input_data = [{"role": "user", "content": "This is filler text to consume tokens. " * 50} for _ in range(100)]
truncated_input = truncate_input(input_data, max_model_len)

Verification

To verify that the fix worked, you can send a request to the /v1/responses endpoint with the truncation parameter set to "auto" and input that exceeds the model's context window. The server should return a successful response with the truncated input.

Example test case:

import requests

filler_message = "This is filler text to consume tokens. " * 50
num_messages = (131072 // 40) + 1  # enough to exceed 131072 context window
oversized_input = [{"role": "user", "content": filler_message} for _ in range(num_messages)]

response = requests.post(
    "http://127.0.0.1:8000/v1/responses",
    json={
        "model": "openai/gpt-oss-20b",
        "input": oversized_input,
        "max_output_tokens": 1024,
        "truncation": "auto",
    },
)

print(response.status_code, response.json())

The response should have a status code of 200 and the input_tokens should be less than or equal to max_model_len.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: truncation: "auto" in Responses API returns 400 instead of truncating input [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #2433: tests for vllm server on openAI /v1/responses endpoint

Description (problem / solution / changelog)

Description

Flags

Reproduce

Changed files

PR #38143: [Bugfix] Apply truncation in Responses API harmony path

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Description

Reproduction

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING