vllm - ✅(Solved) Fix [Feature]: Add cap to --max-model-len auto (auto-fit with upper bound) [1 pull requests, 1 participants]

fl0rianr · 2026-04-30T11:04:27Z

[vllm] PR 41391: Core feat: add optional cap for --max-model-len auto - Repository: vllm-project/vllm - Author: fl0rianr - State: open | merged: False - Link:… # PR #41391: [Core] feat: add optional cap for --max-model-len auto - Repository: vllm-project/vllm - Author: fl0rianr - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/41391 ## Description (problem / solution / changelog) ## Purpose Adds `--max-model-len-cap` as an optional upper bound for `--max-model-len auto`. When `--max-model-len auto` is used, vLLM still computes the maximum context length that fits in the available KV-cache memory, then applies the cap as: ```text effective_max_model_len = min(auto_fit_len, max_model_len_cap) ``` If `--max-model-len-cap` is not set, the existing auto-fit behavior is unchanged. This allows deployments to keep the robustness of auto-fit while preventing vLLM from exposing or reserving KV cache for a context length above the intended product/deployment limit. Closes #41364. ## Test Plan - Build and install vLLM from this branch. - Run syntax checks for the changed Python files. - Run targeted pytest coverage for parser/config/KV-cache auto-fit behavior. - Run single-GPU serving validation with a small model on an RTX 4090. - Run single-GPU serving validation with a small model on an RTX 5090. - Verify that capped auto-fit accepts normal requests and rejects oversized requests at the configured cap. - Verify that `--max-model-len auto` without `--max-model-len-cap` still starts and serves requests successfully. - Add minimal documentation for the new option. ## Test Result Built and installed vLLM from this branch: ```text Successfully installed vllm-0.1.dev16216+gfc6bc6d1e.cu131 ``` Environment: ```text vllm: 0.1.dev16216+gfc6bc6d1e torch: 2.11.0+cu128 cuda available: True cuda version: 12.8 GPU 0: NVIDIA GeForce RTX 4090, capability (8, 9) GPU 1: NVIDIA GeForce RTX 5090, capability (12, 0) ``` Syntax check: ```text python -m py_compile \ vllm/config/model.py \ vllm/engine/arg_utils.py \ vllm/v1/core/kv_cache_utils.py \ tests/v1/core/test_kv_cache_utils.py \ tests/engine/test_arg_utils.py ``` Result: passed. Targeted pytest coverage: ```text python -m pytest \ tests/engine/test_arg_utils.py \ tests/v1/core/test_kv_cache_utils.py \ -k "max_model_len_cap or max_model_len or auto" \ -q ``` Result: ```text 8 passed, 113 deselected, 16 warnings in 16.39s ``` Serving validation used `Qwen/Qwen3-0.6B`. RTX 4090 capped auto-fit: ```text CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-0.6B \ --max-model-len auto \ --max-model-len-cap 2048 \ --gpu-memory-utilization 0.35 \ --max-num-seqs 4 \ --max-num-batched-tokens 2048 ``` RTX 5090 capped auto-fit: ```text CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-0.6B \ --max-model-len auto \ --max-model-len-cap 2048 \ --gpu-memory-utilization 0.35 \ --max-num-seqs 4 \ --max-num-batched-tokens 2048 ``` Verified on both GPUs: - short completion requests succeeded - oversized prompts were rejected with the configured capped maximum context length: ```text This model's maximum context length is 2048 tokens. However, you requested 16 output tokens and your prompt contains at least 2033 input tokens, for a total of at least 2049 tokens. ``` Regression check without cap: ```text CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-0.6B \ --max-model-len auto \ --gpu-memory-utilization 0.35 \ --max-num-seqs 4 \ --max-num-batched-tokens 2048 ``` Result: server started successfully and served a completion request. Full distributed/tensor-parallel serving tests were not run. ## Notes This PR was prepared with AI assistance. I reviewed the changed code and tests before submission. --- Essential Elements of an Effective PR Description Checklist - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan, such as providing test command. - [x] The test results, such as pasting the results comparison before and after, or e2e results - [x] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. ## Changed files - `docs/configuration/conserving_memory.md` (modified, +13/-0) - `tests/engine/test_arg_utils.py` (modified, +13/-0) - `tests/v1/core/test_kv_cache_utils.py` (modified, +56/-2) - `vllm/config/model.py` (modified, +24/-0) - `vllm/engine/arg_utils.py` (modified, +10/-1) ## Fixed - Fixed by PR: [Core] feat: add optional cap for --max-model-len auto (https://github.com/vllm-project/vllm/pull/41391) ### 🚀 The feature, motivation and pitch vLLM already supports `--max-model-len auto`, which is very useful because vLLM can determine the maximum context length that fits the current memory budget after loading the model and profiling memory usage. I would like to request combining this `auto` behavior with a user-defined upper bound. Example: ``` vllm serve \ --max-model-len auto \ --max-model-len-cap 131072 ```

vllm2026-04-30 11:04:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41364•Fetched 2026-05-01 05:33:59

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fl0rianr

Participants

fl0rianr

Timeline (top)

cross-referenced ×1labeled ×1renamed ×1

Root Cause

vLLM already supports --max-model-len auto, which is very useful because vLLM can determine the maximum context length that fits the current memory budget after loading the model and profiling memory usage.

Fix Action

Fixed

Fixed by PR: [Core] feat: add optional cap for --max-model-len auto (https://github.com/vllm-project/vllm/pull/41391)

PR fix notes

PR #41391: [Core] feat: add optional cap for --max-model-len auto

Repository: vllm-project/vllm
Author: fl0rianr
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41391

Description (problem / solution / changelog)

Purpose

Adds --max-model-len-cap as an optional upper bound for --max-model-len auto.

When --max-model-len auto is used, vLLM still computes the maximum context length that fits in the available KV-cache memory, then applies the cap as:

effective_max_model_len = min(auto_fit_len, max_model_len_cap)

If --max-model-len-cap is not set, the existing auto-fit behavior is unchanged.

This allows deployments to keep the robustness of auto-fit while preventing vLLM from exposing or reserving KV cache for a context length above the intended product/deployment limit.

Closes #41364.

Test Plan

Build and install vLLM from this branch.
Run syntax checks for the changed Python files.
Run targeted pytest coverage for parser/config/KV-cache auto-fit behavior.
Run single-GPU serving validation with a small model on an RTX 4090.
Run single-GPU serving validation with a small model on an RTX 5090.
Verify that capped auto-fit accepts normal requests and rejects oversized requests at the configured cap.
Verify that --max-model-len auto without --max-model-len-cap still starts and serves requests successfully.
Add minimal documentation for the new option.

Test Result

Built and installed vLLM from this branch:

Successfully installed vllm-0.1.dev16216+gfc6bc6d1e.cu131

Environment:

vllm: 0.1.dev16216+gfc6bc6d1e
torch: 2.11.0+cu128
cuda available: True
cuda version: 12.8
GPU 0: NVIDIA GeForce RTX 4090, capability (8, 9)
GPU 1: NVIDIA GeForce RTX 5090, capability (12, 0)

Syntax check:

python -m py_compile \
  vllm/config/model.py \
  vllm/engine/arg_utils.py \
  vllm/v1/core/kv_cache_utils.py \
  tests/v1/core/test_kv_cache_utils.py \
  tests/engine/test_arg_utils.py

Result: passed.

Targeted pytest coverage:

python -m pytest \
  tests/engine/test_arg_utils.py \
  tests/v1/core/test_kv_cache_utils.py \
  -k "max_model_len_cap or max_model_len or auto" \
  -q

Result:

8 passed, 113 deselected, 16 warnings in 16.39s

Serving validation used Qwen/Qwen3-0.6B.

RTX 4090 capped auto-fit:

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-0.6B \
  --max-model-len auto \
  --max-model-len-cap 2048 \
  --gpu-memory-utilization 0.35 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 2048

RTX 5090 capped auto-fit:

CUDA_VISIBLE_DEVICES=1 vllm serve Qwen/Qwen3-0.6B \
  --max-model-len auto \
  --max-model-len-cap 2048 \
  --gpu-memory-utilization 0.35 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 2048

Verified on both GPUs:

short completion requests succeeded
oversized prompts were rejected with the configured capped maximum context length:

This model's maximum context length is 2048 tokens. However, you requested 16 output tokens and your prompt contains at least 2033 input tokens, for a total of at least 2049 tokens.

Regression check without cap:

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-0.6B \
  --max-model-len auto \
  --gpu-memory-utilization 0.35 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 2048

Result: server started successfully and served a completion request.

Full distributed/tensor-parallel serving tests were not run.

Notes

This PR was prepared with AI assistance. I reviewed the changed code and tests before submission.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

docs/configuration/conserving_memory.md (modified, +13/-0)
tests/engine/test_arg_utils.py (modified, +13/-0)
tests/v1/core/test_kv_cache_utils.py (modified, +56/-2)
vllm/config/model.py (modified, +24/-0)
vllm/engine/arg_utils.py (modified, +10/-1)

Code Example

vllm serve <model> \
  --max-model-len auto \
  --max-model-len-cap 131072

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

I would like to request combining this auto behavior with a user-defined upper bound.

Example:

vllm serve <model> \
  --max-model-len auto \
  --max-model-len-cap 131072

Semantics:

fitted_len = maximum context length vLLM can fit in the configured memory budget effective_max_model_len = min(fitted_len, max_model_len_cap)

Motivation

auto is great when the goal is “use as much context as possible”. However, on shared or restricted hardware, this can be undesirable. If a model supports a very large context length, e.g. 1M tokens, auto may consume the available vLLM memory budget for KV cache, leaving less room for other local applications or parallel serving processes.

Simply reducing --gpu-memory-utilization or setting a fixed memory budget does not fully solve this. The useful context length depends on model weight size, architecture, KV cache dtype, parallelism, and other runtime details. A small model can consume the same memory budget as a larger model simply by fitting more context.

Pitch

Use the best context length that fits, but never exceed the deployment/product limit.

This would be useful for local serving, small-business internal deployments, workstations, and shared GPUs.

Since vLLM already has the auto-fit logic and updates the effective max model length internally, this may be a relatively small code change: apply an optional cap after the auto-fit value is computed, before finalizing the engine configuration.

Alternatives

Example: Desired context is 128k, but it might not always fit completely and the model has a much larger max context limit.

Fixed --max-model-len vllm serve <model> --max-model-len 131072

This works only if the chosen value fits. If it does not fit, startup fails instead of gracefully falling back to a smaller fitting value.

Plain --max-model-len auto

This is robust, but it may select a context length larger than the deployment wants to expose, causing unnecessary KV-cache memory usage.

--gpu-memory-utilization or --kv-cache-memory-bytes

These help limit memory, but they do not express the actual intent: automatically fit the context length, while capping the served maximum context.

External gateway limit

A proxy can reject requests above 128k tokens, but vLLM may still reserve KV cache for a larger auto-selected context length. This gives the right API behavior, but not the desired memory behavior.

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement an optional --max-model-len-cap parameter to limit the maximum context length determined by the --max-model-len auto behavior.

Guidance

Introduce a new command-line parameter --max-model-len-cap to specify the upper bound for the maximum context length.
Modify the auto-fit logic to apply the cap after determining the fitted length, using the formula _effective_max_model_len_ = min(_fitted_len_, _max_model_len_cap_).
Update the engine configuration with the capped effective maximum model length.
Document the new parameter and its usage in the vLLM documentation.

Example

vllm serve <model> \
  --max-model-len auto \
  --max-model-len-cap 131072

This example demonstrates how to use the proposed --max-model-len-cap parameter to limit the maximum context length.

Notes

The implementation of this feature should be relatively small, as it builds upon the existing auto-fit logic. However, thorough testing is necessary to ensure the correct behavior in various scenarios.

Recommendation

Apply the workaround by implementing the --max-model-len-cap parameter, as it directly addresses the user's requirement to limit the maximum context length while still utilizing the auto-fit behavior. This approach provides a flexible solution for deployments with specific constraints.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Add cap to --max-model-len auto (auto-fit with upper bound) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41391: [Core] feat: add optional cap for --max-model-len auto

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Notes

Changed files

Code Example

🚀 The feature, motivation and pitch

Motivation

Pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Add cap to --max-model-len auto (auto-fit with upper bound) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41391: [Core] feat: add optional cap for --max-model-len auto

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Notes

Changed files

Code Example

🚀 The feature, motivation and pitch

Motivation

Pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING