vllm - 💡(How to fix) Fix [Bug]: KV Cache Memory Error with 262K Context on High VRAM Setup (Regression from Previous Version) [4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37951Fetched 2026-04-08 01:22:27
View on GitHub
Comments
4
Participants
4
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×4labeled ×1mentioned ×1subscribed ×1

Error Message

The engine fails during initialization with the following error: Given a ~96GB VRAM environment, this configuration should not hit KV cache limits. Previous versions were able to run similar configurations without triggering this error. (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] EngineCore failed to start. (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] Traceback (most recent call last): (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] super().init( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in init (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 258, in _initialize_kv_caches (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] kv_cache_configs = get_kv_cache_configs( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1579, in get_kv_cache_configs (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] _check_enough_kv_cache_memory( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 644, in _check_enough_kv_cache_memory (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] raise ValueError( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ValueError: To serve at least one request with the models's max seq len (262144), (27.45 GiB KV cache is needed, which is larger than the available KV cache memory (23.43 GiB). Based on the available memory, the estimated maximum model length is 220480. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. (EngineCore pid=47) Traceback (most recent call last): (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] EngineCore failed to start. (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] Traceback (most recent call last): (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core

RAW_BUFFERClick to expand / collapse

Your current environment

Description: When running vLLM with a 262144 max sequence length, the engine fails to initialize due to insufficient KV cache memory, despite having a high VRAM configuration (~96GB). This behavior did not occur in previous versions under similar or identical setups.

Reproduction Command: (see attached log)

Observed Behavior: The engine fails during initialization with the following error:

ValueError: To serve at least one request with the model's max seq len (262144), 27.45 GiB KV cache is needed, but only 23.43 GiB is available.

The system reports:

  • Available KV cache memory: 23.43 GiB
  • Required KV cache memory: 27.45 GiB

This leads to engine startup failure.

Expected Behavior: Given a ~96GB VRAM environment, this configuration should not hit KV cache limits. Previous versions were able to run similar configurations without triggering this error.

Additional Notes:

  • Model loads successfully (~50.76 GiB used) before KV cache allocation failure
  • Prefix caching and speculative decoding are enabled
  • No explicit GPU memory cap was set beyond defaults

Regression: This appears to be a regression, as the same setup did not fail in earlier versions.

Request: Please clarify:

  1. Whether KV cache allocation logic has changed in recent versions
  2. Why available KV cache memory is significantly lower than expected given total VRAM
  3. Whether additional configuration is now required to utilize full GPU memory

Environment:

  • vLLM version: 0.18.1rc1.dev32
  • GPU: ~96GB VRAM
  • Docker + WSL environment
  • Model: Qwen3.5-based FP8 variant

Logs: (EngineCore pid=47) WARNING 03-24 01:59:47 [kv_cache_utils.py:1059] Add 3 padding layers, may waste at most 4.17% KV cache memory (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] EngineCore failed to start. (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] Traceback (most recent call last): (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] super().init( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in init (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 258, in _initialize_kv_caches (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] kv_cache_configs = get_kv_cache_configs( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1579, in get_kv_cache_configs (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] _check_enough_kv_cache_memory( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 644, in _check_enough_kv_cache_memory (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] raise ValueError( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ValueError: To serve at least one request with the models's max seq len (262144), (27.45 GiB KV cache is needed, which is larger than the available KV cache memory (23.43 GiB). Based on the available memory, the estimated maximum model length is 220480. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. (EngineCore pid=47) Process EngineCore: (EngineCore pid=47) Traceback (most recent call last): (EngineCore pid=47) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=47) self.run() (EngineCore pid=47) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=47) self._target(*self._args, **self._kwargs) (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core (EngineCore pid=47) raise e (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=47) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) return func(*args, **kwargs) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=47) super().init( (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in init (EngineCore pid=47) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) return func(*args, **kwargs) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 258, in _initialize_kv_caches (EngineCore pid=47) kv_cache_configs = get_kv_cache_configs( (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1579, in get_kv_cache_configs (EngineCore pid=47) _check_enough_kv_cache_memory( (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 644, in _check_enough_kv_cache_memory (EngineCore pid=47) raise ValueError( (EngineCore pid=47) ValueError: To serve at least one request with the models's max seq len (262144), (27.45 GiB KV cache is needed, which is larger than the available KV cache memory (23.43 GiB). Based on the available memory, the estimated maximum model length is 220480. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. [rank0]:[W324 01:59:47.618056694 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

🐛 Describe the bug

(EngineCore pid=47) WARNING 03-24 01:59:47 [kv_cache_utils.py:1059] Add 3 padding layers, may waste at most 4.17% KV cache memory (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] EngineCore failed to start. (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] Traceback (most recent call last): (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] super().init( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in init (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 258, in _initialize_kv_caches (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] kv_cache_configs = get_kv_cache_configs( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1579, in get_kv_cache_configs (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] _check_enough_kv_cache_memory( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 644, in _check_enough_kv_cache_memory (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] raise ValueError( (EngineCore pid=47) ERROR 03-24 01:59:47 [core.py:1108] ValueError: To serve at least one request with the models's max seq len (262144), (27.45 GiB KV cache is needed, which is larger than the available KV cache memory (23.43 GiB). Based on the available memory, the estimated maximum model length is 220480. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. (EngineCore pid=47) Process EngineCore: (EngineCore pid=47) Traceback (most recent call last): (EngineCore pid=47) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=47) self.run() (EngineCore pid=47) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=47) self._target(*self._args, **self._kwargs) (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1112, in run_engine_core (EngineCore pid=47) raise e (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=47) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) return func(*args, **kwargs) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=47) super().init( (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in init (EngineCore pid=47) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=47) return func(*args, **kwargs) (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 258, in _initialize_kv_caches (EngineCore pid=47) kv_cache_configs = get_kv_cache_configs( (EngineCore pid=47) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1579, in get_kv_cache_configs (EngineCore pid=47) _check_enough_kv_cache_memory( (EngineCore pid=47) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 644, in _check_enough_kv_cache_memory (EngineCore pid=47) raise ValueError( (EngineCore pid=47) ValueError: To serve at least one request with the models's max seq len (262144), (27.45 GiB KV cache is needed, which is larger than the available KV cache memory (23.43 GiB). Based on the available memory, the estimated maximum model length is 220480. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. [rank0]:[W324 01:59:47.618056694 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue of insufficient KV cache memory, you can try the following steps:

  • Increase gpu_memory_utilization when initializing the engine. This can be done by adding the following code:
import vllm

# Initialize the engine with increased gpu_memory_utilization
engine = vllm.Engine(gpu_memory_utilization=0.8)
  • Decrease max_model_len when initializing the engine. This can be done by adding the following code:
import vllm

# Initialize the engine with decreased max_model_len
engine = vllm.Engine(max_model_len=220480)

Alternatively, you can also try a combination of both:

import vllm

# Initialize the engine with increased gpu_memory_utilization and decreased max_model_len
engine = vllm.Engine(gpu_memory_utilization=0.8, max_model_len=220480)

Verification

To verify that the fix worked, you can check the engine's status after initialization:

import vllm

# Initialize the engine with the fix
engine = vllm.Engine(gpu_memory_utilization=0.8, max_model_len=220480)

# Check the engine's status
print(engine.status)

If the engine is initialized successfully, the status should indicate that the KV cache memory is sufficient.

Extra Tips

  • Make sure to check the documentation for the latest information on configuring the engine and conserving memory: https://docs.vllm.ai/en/latest/configuration/conserving_memory/
  • If you are still experiencing issues, try reducing the model size or using a different model architecture to reduce the memory requirements.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING