vllm - ✅(Solved) Fix [Bug]: vLLM rejects requests when max_tokens exceeds available context instead of clamping [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#42474Fetched 2026-05-14 03:29:55
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
1
Author
Participants
Timeline (top)
referenced ×2commented ×1cross-referenced ×1

Error Message

VLLMValidationError: This model's maximum context length is 128000 tokens. However, you requested 65535 output tokens and your prompt contains at least 62466 input tokens, for a total of at least 128001 tokens.

Fix Action

Fix / Workaround

Current workaround: Users configure artificially low max_tokens in their editor/IDE settings to avoid exceeding the context window, sacrificing the intended flexibility of a safety cap.

PR fix notes

PR #42482: [Bugfix] Treat max_tokens as upper bound, not input-space reservation

Description (problem / solution / changelog)

Summary

Fixes #42474.

TokenizeParams.max_input_tokens is currently defined as max_total_tokens - max_output_tokens, which treats the client's requested max_tokens as a hard reservation deducted from the model's context window. Per the OpenAI API spec — and how Zed / Factory / opencode actually use it — max_tokens is an upper bound on generation, not a reservation. The real generation budget is already clamped downstream by get_max_tokens() in vllm/entrypoints/utils.py based on actual post-tokenization prompt length, so the output side stays bounded regardless.

Why a one-line property change

The property is consumed in seven places:

SiteConsequence of the bug
_token_len_checkrejects valid prompts (the reported symptom)
_text_len_checkrejects on char-count before tokenization
get_encode_kwargssets tokenizer.encode(max_length=max_input_tokens+1)silent truncation
with_kwargspropagates the cap into chained tokenization
_token_padding (-1 sentinel)pads to wrong length
_token_truncation (-1 sentinel)truncates to wrong length
__post_init__rejects truncate_prompt_tokens > max_input_tokens

Patching only _token_len_check (the obvious whack-a-mole) makes the validator pass, but get_encode_kwargs is still capping the tokenizer. The encoder silently truncates the prompt and the now-lenient validator waves it through — for left-truncation chat models the cut point shifts every turn as the conversation grows, breaking prefix-cache locality (0% hit rate). Right-truncation drops the most recent messages. Either way the surface-level fix masks the bug rather than removing it.

Fixing the property at the source corrects all seven consumers consistently and eliminates the silent-truncation path.

Test plan

Three regression tests in tests/renderers/test_completions.py:

  • test_large_max_output_tokens_does_not_reject_valid_token_inputmax_total_tokens=100, max_output_tokens=80, 70-token prompt → succeeds (would fail under old semantics: max_input_tokens = 20).
  • test_large_max_output_tokens_does_not_truncate_text_input — same parameters with text input; asserts tokenizer.encode is called with max_length=101 (max_total_tokens + 1) rather than 21, so the prompt isn't silently truncated. This is the actual prefix-cache regression.
  • test_input_overflowing_context_still_rejected_with_max_output_tokens — sanity check that genuine context overflow (input > max_total_tokens) is still rejected.

All 25 tests in tests/renderers/test_completions.py pass locally.

What stays correct

  • __post_init__ still rejects max_output_tokens > max_total_tokens (genuine bug).
  • _token_len_check still rejects len(tokens) > max_total_tokens (genuine context overflow).
  • _text_len_check still rejects on char count, against the realistic ceiling.
  • get_max_tokens() in entrypoints/utils.py continues to clamp sampling_params.max_tokens to max_total_tokens - len(prompt_tokens) — output stays bounded.
  • /v1/responses already does the equivalent clamp at responses/serving.py:682; behavior now converges across endpoints.

Changed files

  • tests/renderers/test_completions.py (modified, +115/-0)
  • vllm/renderers/params.py (modified, +68/-25)

Code Example

VLLMValidationError: This model's maximum context length is 128000 tokens. 
However, you requested 65535 output tokens and your prompt contains at 
least 62466 input tokens, for a total of at least 128001 tokens.

---

available = self.max_total_tokens - token_count
effective_max_tokens = min(self.max_output_tokens, available)
# log warning that max_tokens was truncated
return tokens  # with truncated max_tokens
RAW_BUFFERClick to expand / collapse

Bug: vLLM rejects requests when max_tokens exceeds available context, treating it as a hard requirement rather than an upper bound

Environment:

  • vLLM Version: 0.20.x (latest dev)

Reproduction:

  1. Start vLLM server with --max-model-len 128000
  2. Send chat completion request with:
    • max_tokens: 65535 (streaming safety cap)
    • Input prompt tokenized to 62466 tokens
  3. vLLM returns 400 error:
VLLMValidationError: This model's maximum context length is 128000 tokens. 
However, you requested 65535 output tokens and your prompt contains at 
least 62466 input tokens, for a total of at least 128001 tokens.

Error location: vllm/renderers/params.py:418 in _token_len_check()

Impact:

  • Zed Editor: AI assistant using max_tokens: 65535 caps
  • Factory.ai: Agentic code generation
  • Opencode: CLI tool with streaming safety limits

GPU-limited deployments cannot use these tools with vLLM. The full context window plus typical token caps exceeds available memory, leaving users stuck at the ceiling.

Current workaround: Users configure artificially low max_tokens in their editor/IDE settings to avoid exceeding the context window, sacrificing the intended flexibility of a safety cap.

Proposed fix: In _token_len_check() at line 418, instead of raising VLLMValidationError, clamp max_output_tokens to available space:

available = self.max_total_tokens - token_count
effective_max_tokens = min(self.max_output_tokens, available)
# log warning that max_tokens was truncated
return tokens  # with truncated max_tokens

This makes vLLM compatible with tools that use max_tokens as an upper bound per the OpenAI API specification.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING