vllm - ✅(Solved) Fix [Bug]: Max token length incorrect when /nothink tag on Qwen3.5-4B [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40689Fetched 2026-04-24 05:52:08
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Error Message

ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193)

Root Cause

I am running Qwen3.5-4B with vLLM and the flag {"enable_thinking": False}. When I set max_tokens to a specific number, the loop does not stop early enough, potentially because the /nothink (added from {"enable_thinking": False}) flag is not considered. The resulting number of tokens is always 1 more than the maximum context length I set when I deploy.

PR fix notes

PR #40775: [Bugfix] Don't reserve max_tokens strictly for chat completions

Description (problem / solution / changelog)

Summary

Chat completions hard-failed validation when the chat template prepended tokens the caller could not easily predict. The most reproducible case is Qwen3.5 with enable_thinking=False, where the template injects <think>\n\n</think>\n\n (2 extra tokens vs. the default <think>\n). With a prompt sized to fit max_model_len - max_tokens, the post-template prompt overshoots by a token or two and the renderer raises VLLMValidationError, even though get_max_tokens would have auto-capped the generation budget moments later.

This PR introduces TokenizeParams.reserve_max_output_tokens (default True, preserving the completion / pooling / tokenize / responses behavior) and opts chat completions out of the strict reservation. The renderer still fails when the prompt alone exceeds max_model_len, and truncation/padding semantics tied to max_input_tokens (e.g. truncate_prompt_tokens=-1) are unchanged.

Fixes #40689.

Why this is not a duplicate

  • gh pr list --repo vllm-project/vllm --state open --search "40689 in:body" → no matches
  • gh pr list --repo vllm-project/vllm --state open --search "nothink enable_thinking" → no matches
  • gh pr list --repo vllm-project/vllm --state all --search "max_tokens nothink" → no matches

The closest related change is #36197 (misleading error message wording), which only adjusts the error text and does not address the strict reservation.

What changed

  • vllm/renderers/params.py
    • Add reserve_max_output_tokens: bool = True and _length_check_limit property
    • _text_len_check, _token_len_check, get_encode_kwargs use the new limit; error messages adapt when reservation is disabled
    • with_kwargs propagates the new flag
  • vllm/entrypoints/openai/chat_completion/protocol.py
    • ChatCompletionRequest.build_tok_params sets reserve_max_output_tokens=False (with rationale comment)
  • tests/renderers/test_completions.py
    • Add two regression tests: overshoot is allowed when opted out; prompts above max_model_len still error

Test plan

  • pytest tests/renderers/test_completions.py -q → 24 passed
  • pre-commit run --files vllm/renderers/params.py vllm/entrypoints/openai/chat_completion/protocol.py tests/renderers/test_completions.py → all hooks pass (ruff, ruff format, typos, mypy, SPDX, …)
  • Local repro using Qwen3.5-4B + enable_thinking=False with max_model_len=8192 and max_tokens=7000: pre-fix the request raises VLLMValidationError; post-fix it succeeds. A prompt > max_model_len still raises with an updated message ("the prompt alone exceeds the model's context window"). The default completion path keeps the strict reservation.

Changed files

  • tests/renderers/test_completions.py (modified, +47/-0)
  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +7/-0)
  • vllm/renderers/params.py (modified, +64/-20)

Code Example

ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> Cannot provide easily as running in Sagemaker </details>

🐛 Describe the bug

I am running Qwen3.5-4B with vLLM and the flag {"enable_thinking": False}. When I set max_tokens to a specific number, the loop does not stop early enough, potentially because the /nothink (added from {"enable_thinking": False}) flag is not considered. The resulting number of tokens is always 1 more than the maximum context length I set when I deploy.

The error is below:

ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193)

Here, I have a max context length of 8192 and set max_tokens to 7000, but since /nothink is added to the input, the input which was clipped at 1192 becomes 1193. If I change max_tokens to 5000, the input will be clipped at 3192, but with /nothink is 3193, again giving me 1 token over the maximum context.

I am using vllm:0.17.1-gpu-py312-cu129-ubuntu22.04-sagemaker

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adjust the max_tokens parameter to account for the additional token added by the /nothink flag to prevent exceeding the maximum context length.

Guidance

  • Verify that the issue is indeed caused by the /nothink flag adding an extra token by checking the input length with and without the flag.
  • Calculate the effective maximum input length by subtracting 1 from the maximum context length and use this value to set max_tokens.
  • Consider reducing the max_tokens value by 1 to ensure it does not exceed the maximum context length when the /nothink flag is added.
  • Check the documentation for any updates or recommendations on using the enable_thinking flag and its impact on input length.

Example

No code snippet is provided as it is not clearly supported by the issue.

Notes

The solution assumes that the /nothink flag always adds exactly one token to the input. If this is not the case, further investigation may be needed to determine the correct adjustment for max_tokens.

Recommendation

Apply workaround: Adjust the max_tokens parameter to account for the additional token added by the /nothink flag, as this is a specific and targeted solution to the described issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING