vllm - ✅(Solved) Fix [Bug]: Max token length incorrect when /nothink tag on Qwen3.5-4B [1 pull requests, 1 participants]

tama-biro · 2026-04-23T08:01:43Z

[vllm] PR 40775: Bugfix Don't reserve max tokens strictly for chat completions - Repository: vllm-project/vllm - Author: hoobnn - State: open | merged: False -… # PR #40775: [Bugfix] Don't reserve `max_tokens` strictly for chat completions - Repository: vllm-project/vllm - Author: hoobnn - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/40775 ## Description (problem / solution / changelog) ## Summary Chat completions hard-failed validation when the chat template prepended tokens the caller could not easily predict. The most reproducible case is Qwen3.5 with `enable_thinking=False`, where the template injects ` \n\n \n\n` (2 extra tokens vs. the default ` \n`). With a prompt sized to fit `max_model_len - max_tokens`, the post-template prompt overshoots by a token or two and the renderer raises `VLLMValidationError`, even though `get_max_tokens` would have auto-capped the generation budget moments later. This PR introduces `TokenizeParams.reserve_max_output_tokens` (default `True`, preserving the completion / pooling / tokenize / responses behavior) and opts chat completions out of the strict reservation. The renderer still fails when the prompt alone exceeds `max_model_len`, and truncation/padding semantics tied to `max_input_tokens` (e.g. `truncate_prompt_tokens=-1`) are unchanged. Fixes #40689. ## Why this is not a duplicate - `gh pr list --repo vllm-project/vllm --state open --search "40689 in:body"` → no matches - `gh pr list --repo vllm-project/vllm --state open --search "nothink enable_thinking"` → no matches - `gh pr list --repo vllm-project/vllm --state all --search "max_tokens nothink"` → no matches The closest related change is #36197 (misleading error message wording), which only adjusts the error text and does not address the strict reservation. ## What changed - `vllm/renderers/params.py` - Add `reserve_max_output_tokens: bool = True` and `_length_check_limit` property - `_text_len_check`, `_token_len_check`, `get_encode_kwargs` use the new limit; error messages adapt when reservation is disabled - `with_kwargs` propagates the new flag - `vllm/entrypoints/openai/chat_completion/protocol.py` - `ChatCompletionRequest.build_tok_params` sets `reserve_max_output_tokens=False` (with rationale comment) - `tests/renderers/test_completions.py` - Add two regression tests: overshoot is allowed when opted out; prompts above `max_model_len` still error ## Test plan - [x] `pytest tests/renderers/test_completions.py -q` → 24 passed - [x] `pre-commit run --files vllm/renderers/params.py vllm/entrypoints/openai/chat_completion/protocol.py tests/renderers/test_completions.py` → all hooks pass (ruff, ruff format, typos, mypy, SPDX, …) - [x] Local repro using Qwen3.5-4B + `enable_thinking=False` with `max_model_len=8192` and `max_tokens=7000`: pre-fix the request raises `VLLMValidationError`; post-fix it succeeds. A prompt > `max_model_len` still raises with an updated message ("the prompt alone exceeds the model's context window"). The default completion path keeps the strict reservation. ## Changed files - `tests/renderers/test_completions.py` (modified, +47/-0) - `vllm/entrypoints/openai/chat_completion/protocol.py` (modified, +7/-0) - `vllm/renderers/params.py` (modified, +64/-20) ### Your current environment Cannot provide easily as running in Sagemaker ### 🐛 Describe the bug I am running Qwen3.5-4B with vLLM and the flag `{"enable_thinking": False}`. When I set max_tokens to a specific number, the loop does not stop early enough, potentially because the /nothink (added from `{"enable_thinking": False}`) flag is not considered. The resulting number of tokens is always 1 more than the maximum context length I set when I deploy. The error is below: ``` ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193) ``` Here, I have a max context length of 8192 and set max_tokens to 7000, but since /nothink is added to the input, the input which was clipped at 1192 becomes 1193. If I change max_tokens to 5000, the input will be clipped at 3192, but with /nothink is 3193, again giving me 1 token over the maximum context. I am using `vllm:0.17.1-gpu-py312-cu129-ubuntu22.04-sagemaker` ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-23 08:01:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40689•Fetched 2026-04-24 05:52:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tama-biro

Participants

tama-biro

Timeline (top)

labeled ×1

Error Message

ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193)

Root Cause

I am running Qwen3.5-4B with vLLM and the flag {"enable_thinking": False}. When I set max_tokens to a specific number, the loop does not stop early enough, potentially because the /nothink (added from {"enable_thinking": False}) flag is not considered. The resulting number of tokens is always 1 more than the maximum context length I set when I deploy.

PR fix notes

PR #40775: [Bugfix] Don't reserve `max_tokens` strictly for chat completions

Repository: vllm-project/vllm
Author: hoobnn
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40775

Description (problem / solution / changelog)

Summary

Chat completions hard-failed validation when the chat template prepended tokens the caller could not easily predict. The most reproducible case is Qwen3.5 with enable_thinking=False, where the template injects <think>\n\n</think>\n\n (2 extra tokens vs. the default <think>\n). With a prompt sized to fit max_model_len - max_tokens, the post-template prompt overshoots by a token or two and the renderer raises VLLMValidationError, even though get_max_tokens would have auto-capped the generation budget moments later.

This PR introduces TokenizeParams.reserve_max_output_tokens (default True, preserving the completion / pooling / tokenize / responses behavior) and opts chat completions out of the strict reservation. The renderer still fails when the prompt alone exceeds max_model_len, and truncation/padding semantics tied to max_input_tokens (e.g. truncate_prompt_tokens=-1) are unchanged.

Fixes #40689.

Why this is not a duplicate

gh pr list --repo vllm-project/vllm --state open --search "40689 in:body" → no matches
gh pr list --repo vllm-project/vllm --state open --search "nothink enable_thinking" → no matches
gh pr list --repo vllm-project/vllm --state all --search "max_tokens nothink" → no matches

The closest related change is #36197 (misleading error message wording), which only adjusts the error text and does not address the strict reservation.

What changed

vllm/renderers/params.py
- Add reserve_max_output_tokens: bool = True and _length_check_limit property
- _text_len_check, _token_len_check, get_encode_kwargs use the new limit; error messages adapt when reservation is disabled
- with_kwargs propagates the new flag
vllm/entrypoints/openai/chat_completion/protocol.py
- ChatCompletionRequest.build_tok_params sets reserve_max_output_tokens=False (with rationale comment)
tests/renderers/test_completions.py
- Add two regression tests: overshoot is allowed when opted out; prompts above max_model_len still error

Test plan

pytest tests/renderers/test_completions.py -q → 24 passed
pre-commit run --files vllm/renderers/params.py vllm/entrypoints/openai/chat_completion/protocol.py tests/renderers/test_completions.py → all hooks pass (ruff, ruff format, typos, mypy, SPDX, …)
Local repro using Qwen3.5-4B + enable_thinking=False with max_model_len=8192 and max_tokens=7000: pre-fix the request raises VLLMValidationError; post-fix it succeeds. A prompt > max_model_len still raises with an updated message ("the prompt alone exceeds the model's context window"). The default completion path keeps the strict reservation.

Changed files

tests/renderers/test_completions.py (modified, +47/-0)
vllm/entrypoints/openai/chat_completion/protocol.py (modified, +7/-0)
vllm/renderers/params.py (modified, +64/-20)

Code Example

ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> Cannot provide easily as running in Sagemaker </details>

🐛 Describe the bug

The error is below:

ERROR 04-22 14:48:25 [serving.py:311] vllm.exceptions.VLLMValidationError: You passed 1193 input tokens and requested 7000 output tokens. However, the model's context length is only 8192 tokens, resulting in a maximum input length of 1192 tokens. Please reduce the length of the input prompt. (parameter=input_tokens, value=1193)

Here, I have a max context length of 8192 and set max_tokens to 7000, but since /nothink is added to the input, the input which was clipped at 1192 becomes 1193. If I change max_tokens to 5000, the input will be clipped at 3192, but with /nothink is 3193, again giving me 1 token over the maximum context.

I am using vllm:0.17.1-gpu-py312-cu129-ubuntu22.04-sagemaker

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adjust the max_tokens parameter to account for the additional token added by the /nothink flag to prevent exceeding the maximum context length.

Guidance

Verify that the issue is indeed caused by the /nothink flag adding an extra token by checking the input length with and without the flag.
Calculate the effective maximum input length by subtracting 1 from the maximum context length and use this value to set max_tokens.
Consider reducing the max_tokens value by 1 to ensure it does not exceed the maximum context length when the /nothink flag is added.
Check the documentation for any updates or recommendations on using the enable_thinking flag and its impact on input length.

Example

No code snippet is provided as it is not clearly supported by the issue.

Notes

The solution assumes that the /nothink flag always adds exactly one token to the input. If this is not the case, further investigation may be needed to determine the correct adjustment for max_tokens.

Recommendation

Apply workaround: Adjust the max_tokens parameter to account for the additional token added by the /nothink flag, as this is a specific and targeted solution to the described issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Max token length incorrect when /nothink tag on Qwen3.5-4B [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #40775: [Bugfix] Don't reserve `max_tokens` strictly for chat completions

Description (problem / solution / changelog)

Summary

Why this is not a duplicate

What changed

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Max token length incorrect when /nothink tag on Qwen3.5-4B [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #40775: [Bugfix] Don't reserve max_tokens strictly for chat completions

Description (problem / solution / changelog)

Summary

Why this is not a duplicate

What changed

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #40775: [Bugfix] Don't reserve `max_tokens` strictly for chat completions