vllm - ✅(Solved) Fix [Bug]: vLLM rejects requests when max_tokens exceeds available context instead of clamping [1 pull requests, 1 comments, 1 participants]

vllm2026-05-13 02:12:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#42474•Fetched 2026-05-14 03:29:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

v1b3coder

Participants

v1b3coder

Timeline (top)

referenced ×2commented ×1cross-referenced ×1

Error Message

VLLMValidationError: This model's maximum context length is 128000 tokens. However, you requested 65535 output tokens and your prompt contains at least 62466 input tokens, for a total of at least 128001 tokens.

Fix Action

Fix / Workaround

Current workaround: Users configure artificially low max_tokens in their editor/IDE settings to avoid exceeding the context window, sacrificing the intended flexibility of a safety cap.

PR fix notes

PR #42482: [Bugfix] Treat max_tokens as upper bound, not input-space reservation

Repository: vllm-project/vllm
Author: v1b3coder
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/42482

Description (problem / solution / changelog)

Summary

Fixes #42474.

TokenizeParams.max_input_tokens is currently defined as max_total_tokens - max_output_tokens, which treats the client's requested max_tokens as a hard reservation deducted from the model's context window. Per the OpenAI API spec — and how Zed / Factory / opencode actually use it — max_tokens is an upper bound on generation, not a reservation. The real generation budget is already clamped downstream by get_max_tokens() in vllm/entrypoints/utils.py based on actual post-tokenization prompt length, so the output side stays bounded regardless.

Why a one-line property change

The property is consumed in seven places:

Site	Consequence of the bug
`_token_len_check`	rejects valid prompts (the reported symptom)
`_text_len_check`	rejects on char-count before tokenization
`get_encode_kwargs`	sets `tokenizer.encode(max_length=max_input_tokens+1)` — silent truncation
`with_kwargs`	propagates the cap into chained tokenization
`_token_padding` (`-1` sentinel)	pads to wrong length
`_token_truncation` (`-1` sentinel)	truncates to wrong length
`__post_init__`	rejects `truncate_prompt_tokens > max_input_tokens`

Patching only _token_len_check (the obvious whack-a-mole) makes the validator pass, but get_encode_kwargs is still capping the tokenizer. The encoder silently truncates the prompt and the now-lenient validator waves it through — for left-truncation chat models the cut point shifts every turn as the conversation grows, breaking prefix-cache locality (0% hit rate). Right-truncation drops the most recent messages. Either way the surface-level fix masks the bug rather than removing it.

Fixing the property at the source corrects all seven consumers consistently and eliminates the silent-truncation path.

Test plan

Three regression tests in tests/renderers/test_completions.py:

test_large_max_output_tokens_does_not_reject_valid_token_input — max_total_tokens=100, max_output_tokens=80, 70-token prompt → succeeds (would fail under old semantics: max_input_tokens = 20).
test_large_max_output_tokens_does_not_truncate_text_input — same parameters with text input; asserts tokenizer.encode is called with max_length=101 (max_total_tokens + 1) rather than 21, so the prompt isn't silently truncated. This is the actual prefix-cache regression.
test_input_overflowing_context_still_rejected_with_max_output_tokens — sanity check that genuine context overflow (input > max_total_tokens) is still rejected.

All 25 tests in tests/renderers/test_completions.py pass locally.

What stays correct

__post_init__ still rejects max_output_tokens > max_total_tokens (genuine bug).
_token_len_check still rejects len(tokens) > max_total_tokens (genuine context overflow).
_text_len_check still rejects on char count, against the realistic ceiling.
get_max_tokens() in entrypoints/utils.py continues to clamp sampling_params.max_tokens to max_total_tokens - len(prompt_tokens) — output stays bounded.
/v1/responses already does the equivalent clamp at responses/serving.py:682; behavior now converges across endpoints.

Changed files

tests/renderers/test_completions.py (modified, +115/-0)
vllm/renderers/params.py (modified, +68/-25)

Code Example

VLLMValidationError: This model's maximum context length is 128000 tokens. 
However, you requested 65535 output tokens and your prompt contains at 
least 62466 input tokens, for a total of at least 128001 tokens.

---

available = self.max_total_tokens - token_count
effective_max_tokens = min(self.max_output_tokens, available)
# log warning that max_tokens was truncated
return tokens  # with truncated max_tokens

RAW_BUFFERClick to expand / collapse

Bug: vLLM rejects requests when max_tokens exceeds available context, treating it as a hard requirement rather than an upper bound

Environment:

vLLM Version: 0.20.x (latest dev)

Reproduction:

Start vLLM server with --max-model-len 128000
Send chat completion request with:
- max_tokens: 65535 (streaming safety cap)
- Input prompt tokenized to 62466 tokens
vLLM returns 400 error:

VLLMValidationError: This model's maximum context length is 128000 tokens. 
However, you requested 65535 output tokens and your prompt contains at 
least 62466 input tokens, for a total of at least 128001 tokens.

Error location: vllm/renderers/params.py:418 in _token_len_check()

Impact:

Zed Editor: AI assistant using max_tokens: 65535 caps
Factory.ai: Agentic code generation
Opencode: CLI tool with streaming safety limits

GPU-limited deployments cannot use these tools with vLLM. The full context window plus typical token caps exceeds available memory, leaving users stuck at the ceiling.

Current workaround: Users configure artificially low max_tokens in their editor/IDE settings to avoid exceeding the context window, sacrificing the intended flexibility of a safety cap.

Proposed fix: In _token_len_check() at line 418, instead of raising VLLMValidationError, clamp max_output_tokens to available space:

available = self.max_total_tokens - token_count
effective_max_tokens = min(self.max_output_tokens, available)
# log warning that max_tokens was truncated
return tokens  # with truncated max_tokens

This makes vLLM compatible with tools that use max_tokens as an upper bound per the OpenAI API specification.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tool integration #LLM response #prompt template #agent execution

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: vLLM rejects requests when max_tokens exceeds available context instead of clamping [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #42482: [Bugfix] Treat max_tokens as upper bound, not input-space reservation

Description (problem / solution / changelog)

Summary

Why a one-line property change

Test plan

What stays correct

Changed files

Code Example

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: vLLM rejects requests when max_tokens exceeds available context instead of clamping [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #42482: [Bugfix] Treat max_tokens as upper bound, not input-space reservation

Description (problem / solution / changelog)

Summary

Why a one-line property change

Test plan

What stays correct

Changed files

Code Example

Still need to ship something?

RELATED_DISCOVERY

TRENDING