vllm - ✅(Solved) Fix [Bug]: In case chunked prefill is enabled and max-num-batched-tokens > max-model-length the server does not start up and fails [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39976Fetched 2026-04-17 08:28:03
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Timeline (top)
renamed ×2commented ×1cross-referenced ×1labeled ×1

Root Cause

This is the official definition that I got. Therefore, my understanding is max-num-batched tokens could be more than max-model-length (number of tokens in a request. Because batched tokens could include tokens from both prefill and decode phase and also across requests but whenever max-num-batched tokens > max-model-length, the server did not start up

Fix Action

Fixed

PR fix notes

PR #40063: fix: allow max_num_batched_tokens > max_model_len with chunked prefill

Description (problem / solution / changelog)

What's broken?

When chunked prefill is enabled, the server may fail to start or emit misleading warnings if max_num_batched_tokens exceeds max_model_len. Users who explicitly set a large max_num_batched_tokens (which is valid for chunked prefill since tokens are processed in chunks across multiple requests) encounter unnecessary validation friction.

Who is affected?

Users who enable chunked prefill and either:

  • Explicitly set max_num_batched_tokens > max_num_seqs * max_model_len
  • Rely on default max_num_batched_tokens with small max_num_seqs or max_model_len values

This does not affect users with chunked prefill disabled — all existing validation for that case is preserved.

Why does it happen?

Two validation/clamping paths did not account for chunked prefill:

  1. SchedulerConfig.verify_max_model_len: The warning for max_num_batched_tokens > max_num_seqs * max_model_len fired unconditionally, even though with chunked prefill this is a valid configuration (tokens come from partial prefills across multiple requests).

  2. EngineArgs._set_default_max_num_seqs_and_batched_tokens_args: The default max_num_batched_tokens was clamped to min(max_num_seqs * max_model_len, default) even when chunked prefill was enabled. This unnecessarily restricted the default batch size.

How did we fix it?

  1. vllm/config/scheduler.py: Added and not self.enable_chunked_prefill guard to the warning at line 281, consistent with the existing guard on the error check at line 261.

  2. vllm/engine/arg_utils.py: Moved the default max_num_batched_tokens clamping (min(max_num_seqs * max_model_len, ...)) inside the existing if not self.enable_chunked_prefill block, so it only applies when chunked prefill is disabled.

Both changes are minimal and surgical — no unrelated code is modified.

How do we know it works?

  • All existing validation for non-chunked-prefill cases is preserved (the guards only skip checks when enable_chunked_prefill=True)
  • ruff check and ruff format --check pass on both changed files
  • The change is consistent with the existing pattern at line 261-264 of scheduler.py which already has the same and not self.enable_chunked_prefill guard

Fixes #39976

Changed files

  • vllm/config/scheduler.py (modified, +4/-1)
  • vllm/engine/arg_utils.py (modified, +9/-7)

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

<img width="628" height="178" alt="Image" src="https://github.com/user-attachments/assets/d98af805-947e-425b-965a-8e4f8aff2952" /> <img width="940" height="356" alt="Image" src="https://github.com/user-attachments/assets/8c4a1e97-37b3-4dde-b242-d1c01e3c2820" /> <img width="448" height="98" alt="Image" src="https://github.com/user-attachments/assets/35fd9816-9a16-414c-b35b-e679320e9df6" />

This is the official definition that I got. Therefore, my understanding is max-num-batched tokens could be more than max-model-length (number of tokens in a request. Because batched tokens could include tokens from both prefill and decode phase and also across requests but whenever max-num-batched tokens > max-model-length, the server did not start up

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely caused by the max-num-batched-tokens being greater than max-model-length, preventing the server from starting up, and adjusting these parameters may resolve the issue.

Guidance

  • Review the max-num-batched-tokens and max-model-length configuration settings to ensure they are properly set and compatible with each other.
  • Consider reducing the value of max-num-batched-tokens to be less than or equal to max-model-length to allow the server to start up.
  • Verify that the issue is resolved by checking the server startup logs or console output for any error messages related to token limits.
  • If the issue persists, try adjusting other related configuration settings, such as batch size or token limits, to find a compatible combination.

Notes

The exact solution may depend on the specific requirements and constraints of the application, and further experimentation may be needed to find the optimal configuration.

Recommendation

Apply workaround: Adjust the max-num-batched-tokens and max-model-length configuration settings to compatible values, as this is likely to resolve the issue and allow the server to start up.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING