vllm - ✅(Solved) Fix [Bug]: In case chunked prefill is enabled and max-num-batched-tokens > max-model-length the server does not start up and fails [1 pull requests, 1 comments, 2 participants]

BKaurHarpreet · 2026-04-16T07:08:09Z

[vllm] PR 40063: fix: allow max num batched tokens max model len with chunked prefill - Repository: vllm-project/vllm - Author: ianliuy - State: open | merged:… # PR #40063: fix: allow max_num_batched_tokens > max_model_len with chunked prefill - Repository: vllm-project/vllm - Author: ianliuy - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/40063 ## Description (problem / solution / changelog) ## What's broken? When chunked prefill is enabled, the server may fail to start or emit misleading warnings if `max_num_batched_tokens` exceeds `max_model_len`. Users who explicitly set a large `max_num_batched_tokens` (which is valid for chunked prefill since tokens are processed in chunks across multiple requests) encounter unnecessary validation friction. ## Who is affected? Users who enable chunked prefill and either: - Explicitly set `max_num_batched_tokens > max_num_seqs * max_model_len` - Rely on default `max_num_batched_tokens` with small `max_num_seqs` or `max_model_len` values This does **not** affect users with chunked prefill disabled — all existing validation for that case is preserved. ## Why does it happen? Two validation/clamping paths did not account for chunked prefill: 1. **`SchedulerConfig.verify_max_model_len`**: The warning for `max_num_batched_tokens > max_num_seqs * max_model_len` fired unconditionally, even though with chunked prefill this is a valid configuration (tokens come from partial prefills across multiple requests). 2. **`EngineArgs._set_default_max_num_seqs_and_batched_tokens_args`**: The default `max_num_batched_tokens` was clamped to `min(max_num_seqs * max_model_len, default)` even when chunked prefill was enabled. This unnecessarily restricted the default batch size. ## How did we fix it? 1. **`vllm/config/scheduler.py`**: Added `and not self.enable_chunked_prefill` guard to the warning at line 281, consistent with the existing guard on the error check at line 261. 2. **`vllm/engine/arg_utils.py`**: Moved the default `max_num_batched_tokens` clamping (`min(max_num_seqs * max_model_len, ...)`) inside the existing `if not self.enable_chunked_prefill` block, so it only applies when chunked prefill is disabled. Both changes are minimal and surgical — no unrelated code is modified. ## How do we know it works? - All existing validation for non-chunked-prefill cases is preserved (the guards only skip checks when `enable_chunked_prefill=True`) - `ruff check` and `ruff format --check` pass on both changed files - The change is consistent with the existing pattern at line 261-264 of `scheduler.py` which already has the same `and not self.enable_chunked_prefill` guard Fixes #39976 ## Changed files - `vllm/config/scheduler.py` (modified, +4/-1) - `vllm/engine/arg_utils.py` (modified, +9/-7) ## Fixed - Fixed by PR: fix: allow max_num_batched_tokens > max_model_len with chunked prefill (https://github.com/vllm-project/vllm/pull/40063) ### Your current environment The output of python collect_env.py ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug This is the official definition that I got. Therefore, my understanding is max-num-batched tokens could be more than max-model-length (number of tokens in a request. Because batched tokens could include tokens from both prefill and decode phase and also across requests but whenever max-num-batched tokens > max-model-length, the server did not start up ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-16 07:08:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39976•Fetched 2026-04-17 08:28:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

BKaurHarpreet

Participants

BKaurHarpreet

ianliuy

Timeline (top)

renamed ×2commented ×1cross-referenced ×1labeled ×1

Root Cause

This is the official definition that I got. Therefore, my understanding is max-num-batched tokens could be more than max-model-length (number of tokens in a request. Because batched tokens could include tokens from both prefill and decode phase and also across requests but whenever max-num-batched tokens > max-model-length, the server did not start up

Fix Action

Fixed

Fixed by PR: fix: allow max_num_batched_tokens > max_model_len with chunked prefill (https://github.com/vllm-project/vllm/pull/40063)

PR fix notes

PR #40063: fix: allow max_num_batched_tokens > max_model_len with chunked prefill

Repository: vllm-project/vllm
Author: ianliuy
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40063

Description (problem / solution / changelog)

What's broken?

When chunked prefill is enabled, the server may fail to start or emit misleading warnings if max_num_batched_tokens exceeds max_model_len. Users who explicitly set a large max_num_batched_tokens (which is valid for chunked prefill since tokens are processed in chunks across multiple requests) encounter unnecessary validation friction.

Who is affected?

Users who enable chunked prefill and either:

Explicitly set max_num_batched_tokens > max_num_seqs * max_model_len
Rely on default max_num_batched_tokens with small max_num_seqs or max_model_len values

This does not affect users with chunked prefill disabled — all existing validation for that case is preserved.

Why does it happen?

Two validation/clamping paths did not account for chunked prefill:

SchedulerConfig.verify_max_model_len: The warning for max_num_batched_tokens > max_num_seqs * max_model_len fired unconditionally, even though with chunked prefill this is a valid configuration (tokens come from partial prefills across multiple requests).
EngineArgs._set_default_max_num_seqs_and_batched_tokens_args: The default max_num_batched_tokens was clamped to min(max_num_seqs * max_model_len, default) even when chunked prefill was enabled. This unnecessarily restricted the default batch size.

How did we fix it?

vllm/config/scheduler.py: Added and not self.enable_chunked_prefill guard to the warning at line 281, consistent with the existing guard on the error check at line 261.
vllm/engine/arg_utils.py: Moved the default max_num_batched_tokens clamping (min(max_num_seqs * max_model_len, ...)) inside the existing if not self.enable_chunked_prefill block, so it only applies when chunked prefill is disabled.

Both changes are minimal and surgical — no unrelated code is modified.

How do we know it works?

All existing validation for non-chunked-prefill cases is preserved (the guards only skip checks when enable_chunked_prefill=True)
ruff check and ruff format --check pass on both changed files
The change is consistent with the existing pattern at line 261-264 of scheduler.py which already has the same and not self.enable_chunked_prefill guard

Fixes #39976

Changed files

vllm/config/scheduler.py (modified, +4/-1)
vllm/engine/arg_utils.py (modified, +9/-7)

Code Example

Your output of `python collect_env.py` here

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely caused by the max-num-batched-tokens being greater than max-model-length, preventing the server from starting up, and adjusting these parameters may resolve the issue.

Guidance

Review the max-num-batched-tokens and max-model-length configuration settings to ensure they are properly set and compatible with each other.
Consider reducing the value of max-num-batched-tokens to be less than or equal to max-model-length to allow the server to start up.
Verify that the issue is resolved by checking the server startup logs or console output for any error messages related to token limits.
If the issue persists, try adjusting other related configuration settings, such as batch size or token limits, to find a compatible combination.

Notes

The exact solution may depend on the specific requirements and constraints of the application, and further experimentation may be needed to find the optimal configuration.

Recommendation

Apply workaround: Adjust the max-num-batched-tokens and max-model-length configuration settings to compatible values, as this is likely to resolve the issue and allow the server to start up.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#integration issue #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: In case chunked prefill is enabled and max-num-batched-tokens > max-model-length the server does not start up and fails [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #40063: fix: allow max_num_batched_tokens > max_model_len with chunked prefill

Description (problem / solution / changelog)

What's broken?

Who is affected?

Why does it happen?

How did we fix it?

How do we know it works?

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: In case chunked prefill is enabled and max-num-batched-tokens > max-model-length the server does not start up and fails [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #40063: fix: allow max_num_batched_tokens > max_model_len with chunked prefill

Description (problem / solution / changelog)

What's broken?

Who is affected?

Why does it happen?

How did we fix it?

How do we know it works?

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING