vllm - ✅(Solved) Fix [Bug]: truncation: "auto" in Responses API returns 400 instead of truncating input [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38132Fetched 2026-04-08 01:32:07
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
referenced ×3cross-referenced ×2labeled ×1

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit.

Per the OpenAI Responses API spec: auto: If the input to this Response exceeds the model's context window size, the model will truncate the response to fit the context window by dropping items from the beginning of the conversation.

Currently, vLLM passes the full prompt to the engine without applying truncation, resulting in:
{'error': {'message': 'The engine prompt length 1327246 exceeds the max_model_len 131072. Please reduce prompt.', 'type': 'invalid_request_error', 'param': 'input', 'code': 400}}

Error Message

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit. {'error': {'message': 'The engine prompt length 1327246 exceeds the max_model_len 131072. Please reduce prompt.', 'type': 'invalid_request_error', 'param': 'input', 'code': 400}}

Root Cause

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit.

Per the OpenAI Responses API spec: auto: If the input to this Response exceeds the model's context window size, the model will truncate the response to fit the context window by dropping items from the beginning of the conversation.

Currently, vLLM passes the full prompt to the engine without applying truncation, resulting in:
{'error': {'message': 'The engine prompt length 1327246 exceeds the max_model_len 131072. Please reduce prompt.', 'type': 'invalid_request_error', 'param': 'input', 'code': 400}}

Fix Action

Fixed

PR fix notes

PR #2433: tests for vllm server on openAI /v1/responses endpoint

Description (problem / solution / changelog)

Description

This is a smoke screen test for making sure vllm server correctly accepts and handles each parameter defined in the openai v1/responses endpoint. The following parameters are tested:

  • background
  • include
  • input
  • instructions
  • max_output_tokens
  • max_tool_calls
  • metadata
  • model
  • parallel_tool_calls
  • previous_response_id
  • prompt
  • reasoning
  • service_tier
  • store
  • stream
  • temperature
  • text
  • tools
  • top_p
  • truncation
  • user

Flags

Note the following parameters have ongoing issues:

Reproduce

Docker command: docker run -d --name vllm-server --runtime nvidia --gpus all
-v /home/lzhang/models:/root/.cache/huggingface
--env "HUGGING_FACE_HUB_TOKEN=<hf_token>"
--env VLLM_ENABLE_RESPONSES_API_STORE=1
--env VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS=1
-p 8000:8000 --ipc=host
vllm/vllm-openai:latest
--model openai/gpt-oss-20b
--served-model-name openai/gpt-oss-20b
--gpu-memory-utilization 0.95
--dtype bfloat16
--tensor-parallel-size 1
--tool-call-parser openai
--enable-auto-tool-choice

Changed files

  • tests/run_tests.py (modified, +8/-1)
  • tests/server_tests/conftest.py (modified, +33/-13)
  • tests/server_tests/test_cases/test_vllm_chat_completion.py (renamed, +0/-0)
  • tests/server_tests/test_cases/test_vllm_responses.py (added, +719/-0)
  • tests/test_config.py (modified, +24/-6)
  • workflows/model_spec.py (modified, +4/-0)

PR #38143: [Bugfix] Apply truncation in Responses API harmony path

Description (problem / solution / changelog)

Purpose

Fixes #38132

When using the Responses API with truncation: "auto", requests that exceed max_model_len return a 400 error instead of truncating the input. This only affects the harmony path (used by openai/gpt-oss-* models).

The root cause: _make_request_with_harmony calls render_for_completion() directly to produce token IDs, bypassing the renderer pipeline that normally handles truncation via TokenizeParams. The raw token IDs then hit _validate_generator_input, which rejects anything at or above max_model_len.

The non-harmony path already works because it goes through _preprocess_chat -> render_chat_async -> apply_post_tokenization -> _token_truncation, which respects truncate_prompt_tokens=-1.

The fix adds left-side token truncation in the harmony path after render_for_completion() and before validation. This matches the OpenAI spec, which says to "drop items from the beginning of the conversation" when the input exceeds the context window.

No existing open PR addresses this (checked via gh pr list --search).

Test Plan

Added unit tests for build_tok_params truncation parameter mapping in tests/entrypoints/openai/responses/test_sampling_params.py:

pytest tests/entrypoints/openai/responses/test_sampling_params.py -v -k TestResponsesRequestTruncation

Tests cover:

  • truncation="auto" sets truncate_prompt_tokens=-1
  • truncation="disabled" sets truncate_prompt_tokens=None
  • Default truncation value is "disabled"
  • max_input_tokens calculation with max_output_tokens set

Linting:

ruff check vllm/entrypoints/openai/responses/serving.py
ruff check tests/entrypoints/openai/responses/test_sampling_params.py

Both pass clean.

Test Result

Cannot run integration tests locally (no GPU), but unit tests and linting pass. The harmony path truncation logic mirrors the existing renderer truncation that already works for the non-harmony path.

AI assistance was used in developing this fix.

Changed files

  • tests/entrypoints/openai/responses/test_sampling_params.py (modified, +50/-0)
  • vllm/entrypoints/openai/responses/serving.py (modified, +11/-0)

Code Example

vLLM: version 0.15.0
Model: openai/gpt-oss-20b
Endpoint: /v1/responses

---

import requests                                                                                                                                                                                                             
                  
  filler_message = "This is filler text to consume tokens. " * 50                                                                                                                                                             
  num_messages = (131072 // 40) + 1  # enough to exceed 131072 context window
  oversized_input = [{"role": "user", "content": filler_message} for _ in range(num_messages)]                                                                                                                                
                                                                                                                                                                                                                              
  response = requests.post(                                                                                                                                                                                                   
      "http://127.0.0.1:8000/v1/responses",                                                                                                                                                                                   
      json={      
          "model": "openai/gpt-oss-20b",
          "input": oversized_input,                                                                                                                                                                                           
          "max_output_tokens": 1024,
          "truncation": "auto",                                                                                                                                                                                               
      },          
  )
  print(response.status_code, response.json())
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: version 0.15.0
Model: openai/gpt-oss-20b
Endpoint: /v1/responses

🐛 Describe the bug

Description

When using the Responses API with truncation: "auto", sending input that exceeds the model's context window returns a 400 error instead of truncating the input to fit.

Per the OpenAI Responses API spec: auto: If the input to this Response exceeds the model's context window size, the model will truncate the response to fit the context window by dropping items from the beginning of the conversation.

Currently, vLLM passes the full prompt to the engine without applying truncation, resulting in:
{'error': {'message': 'The engine prompt length 1327246 exceeds the max_model_len 131072. Please reduce prompt.', 'type': 'invalid_request_error', 'param': 'input', 'code': 400}}

Reproduction

Send a request to /v1/responses with truncation: "auto" and input that exceeds max_model_len:

  import requests                                                                                                                                                                                                             
                  
  filler_message = "This is filler text to consume tokens. " * 50                                                                                                                                                             
  num_messages = (131072 // 40) + 1  # enough to exceed 131072 context window
  oversized_input = [{"role": "user", "content": filler_message} for _ in range(num_messages)]                                                                                                                                
                                                                                                                                                                                                                              
  response = requests.post(                                                                                                                                                                                                   
      "http://127.0.0.1:8000/v1/responses",                                                                                                                                                                                   
      json={      
          "model": "openai/gpt-oss-20b",
          "input": oversized_input,                                                                                                                                                                                           
          "max_output_tokens": 1024,
          "truncation": "auto",                                                                                                                                                                                               
      },          
  )
  print(response.status_code, response.json())

Expected behavior The server should drop items from the beginning of the conversation until the input fits within max_model_len, then return a successful response with usage.input_tokens <= max_model_len.

Actual behavior
Returns 400 with The engine prompt length 1327246 exceeds the max_model_len 131072.

Note: truncation: "disabled" correctly returns a 400 for oversized input — this issue is only about the "auto" path.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to implement the truncation logic when the truncation parameter is set to "auto". We will truncate the input to fit within the model's context window by dropping items from the beginning of the conversation.

Here are the steps to fix the issue:

  • Check if the truncation parameter is set to "auto" and the input exceeds the model's context window.
  • If the condition is met, truncate the input by dropping items from the beginning of the conversation until the input fits within the model's context window.
  • Pass the truncated input to the model.

Example code snippet in Python:

def truncate_input(input_data, max_model_len):
    """
    Truncate the input to fit within the model's context window.
    
    Args:
    input_data (list): The input data to be truncated.
    max_model_len (int): The maximum length of the model's context window.
    
    Returns:
    list: The truncated input data.
    """
    total_tokens = sum(len(item["content"]) for item in input_data)
    while total_tokens > max_model_len:
        # Remove the first item from the conversation
        input_data.pop(0)
        total_tokens = sum(len(item["content"]) for item in input_data)
    return input_data

# Example usage:
max_model_len = 131072
input_data = [{"role": "user", "content": "This is filler text to consume tokens. " * 50} for _ in range(100)]
truncated_input = truncate_input(input_data, max_model_len)

Verification

To verify that the fix worked, you can send a request to the /v1/responses endpoint with the truncation parameter set to "auto" and input that exceeds the model's context window. The server should return a successful response with the truncated input.

Example test case:

import requests

filler_message = "This is filler text to consume tokens. " * 50
num_messages = (131072 // 40) + 1  # enough to exceed 131072 context window
oversized_input = [{"role": "user", "content": filler_message} for _ in range(num_messages)]

response = requests.post(
    "http://127.0.0.1:8000/v1/responses",
    json={
        "model": "openai/gpt-oss-20b",
        "input": oversized_input,
        "max_output_tokens": 1024,
        "truncation": "auto",
    },
)

print(response.status_code, response.json())

The response should have a status code of 200 and the input_tokens should be less than or equal to max_model_len.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: truncation: "auto" in Responses API returns 400 instead of truncating input [2 pull requests, 1 participants]