vllm - ✅(Solved) Fix [Bug]: `redundancy_buffer_memory` is Never really used [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37419Fetched 2026-04-08 00:57:36
View on GitHub
Comments
1
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
cross-referenced ×2commented ×1labeled ×1mentioned ×1

Fix Action

Fixed

PR fix notes

PR #37420: Fix redundancy_buffer_memory not taken account in determine_available_memory()

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Purpose

Fix https://github.com/vllm-project/vllm/issues/37419

Test Plan

Reproduction

Any configuration running close to the memory limit can trigger OOM that the buffer was supposed to prevent. Example:

vllm serve <model> --gpu-memory-utilization=0.95 --max-num-batched-tokens=65536

Under load, if PyTorch caching allocator fragmentation exceeds the implicit margin from 1.0 - gpu_memory_utilization, the process OOMs despite the code claiming to reserve 150 MiB.

although previous guru (Lee Jie?) create a buffer value redundancy_buffer_memory, but it's NOT used but just printing in debug message suggestion for --kv-cache-memory= value.

available_kv_cache_memory_bytes must still leave room for non-KV memory (weights, activations, CUDA graph, NCCL/driver) and allocator fragmentation (reserved - allocated) during runtime. So the suggestion should reserve the same safety margin; otherwise users can still hit OOM near the limit.

I'm working on a more elegant solution instead of hardcode 150MB, stay tuned.

Test Result

You will see log Available KV cache memory: become a bit smaller , and OOM will be relief .


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/v1/worker/gpu_worker.py (modified, +7/-4)

PR #37440: fix: apply redundancy_buffer_memory to KV cache allocation

Description (problem / solution / changelog)

Summary

Fixes #37419.

  • redundancy_buffer_memory (150 MiB safety buffer) was computed in compile_or_warm_up_model() but never subtracted from available memory during KV cache allocation in determine_available_memory() -- it was only used to compute a suggested --kv-cache-memory= value in a debug log message
  • This meant vLLM allocated KV cache with zero safety margin for caching allocator fragmentation and memory profiling inaccuracy, making it vulnerable to runtime OOM when running near the memory limit
  • Move the buffer subtraction into determine_available_memory() where the actual KV cache size is computed, and store on self so compile_or_warm_up_model() reuses the same value for consistency

Test plan

  • Verify existing tests pass (no gpu_worker unit tests exist; the change is a one-line arithmetic addition in a well-understood code path)
  • Manual: run vllm serve near memory limit and confirm 150 MiB less KV cache is allocated vs. before the fix

🤖 Generated with Claude Code

Changed files

  • vllm/v1/worker/gpu_worker.py (modified, +7/-4)

Code Example

# gpu_worker.py, determine_available_memory()
self.available_kv_cache_memory_bytes = (
    self.requested_memory
    - profile_result.non_kv_cache_memory
    - cudagraph_memory_estimate_applied
    # ← redundancy_buffer_memory is NOT subtracted here
)

---

# gpu_worker.py, compile_or_warm_up_model()
redundancy_buffer_memory = 150 * (1 << 20)  # only used for log suggestion
kv_cache_memory_bytes_to_requested_limit = (
    int(self.requested_memory) - non_kv_cache_memory - redundancy_buffer_memory
)
# ↑ this value is **only printed via logger.debug(msg)**, never used for allocation

---

vllm serve <model> --gpu-memory-utilization=0.95 --max-num-batched-tokens=65536

---

# In determine_available_memory(), after computing non_kv_cache_memory:
redundancy_buffer_memory = 150 * (1 << 20)
self.available_kv_cache_memory_bytes = (
    self.requested_memory
    - profile_result.non_kv_cache_memory
    - cudagraph_memory_estimate_applied
    - redundancy_buffer_memory              # ← add this
)
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM 0.17.1

🐛 Describe the bug

redundancy_buffer_memory (150 MiB) is never subtracted from KV cache allocation.

The 150 MiB redundancy_buffer_memory in gpu_worker.py (introduced to leave headroom for memory profiling inaccuracy) is not applied to the actual KV cache allocation. It is only used to compute a suggested --kv-cache-memory= value in a logger.debug() message inside compile_or_warm_up_model().

The actual KV cache size is determined in determine_available_memory():

# gpu_worker.py, determine_available_memory()
self.available_kv_cache_memory_bytes = (
    self.requested_memory
    - profile_result.non_kv_cache_memory
    - cudagraph_memory_estimate_applied
    # ← redundancy_buffer_memory is NOT subtracted here
)

While redundancy_buffer_memory is computed much later in a completely different method:

# gpu_worker.py, compile_or_warm_up_model()
redundancy_buffer_memory = 150 * (1 << 20)  # only used for log suggestion
kv_cache_memory_bytes_to_requested_limit = (
    int(self.requested_memory) - non_kv_cache_memory - redundancy_buffer_memory
)
# ↑ this value is **only printed via logger.debug(msg)**, never used for allocation

Impact

  1. Zero safety margin: vLLM allocates KV cache with no buffer for caching allocator fragmentation, making it vulnerable to runtime OOM when the reserved - allocated gap grows beyond profiling conditions
  2. Path inconsistency: If a user follows the logged --kv-cache-memory= suggestion (which includes the 150 MiB deduction), they get a buffer on their next startup. If they use --gpu-memory-utilization (the default), they get no buffer. Two paths produce different safety margins for the same workload
  3. False sense of security: The code comment says "leave a small buffer (=150MiB) to avoid OOM" but the buffer does not actually prevent any OOM

Reproduction

Any configuration running close to the memory limit can trigger OOM that the buffer was supposed to prevent. Example:

vllm serve <model> --gpu-memory-utilization=0.95 --max-num-batched-tokens=65536

Under load, if PyTorch caching allocator fragmentation exceeds the implicit margin from 1.0 - gpu_memory_utilization, the process OOMs despite the code claiming to reserve 150 MiB.

Suggested Fix

Move the buffer subtraction into determine_available_memory() where the actual KV cache size is computed:

# In determine_available_memory(), after computing non_kv_cache_memory:
redundancy_buffer_memory = 150 * (1 << 20)
self.available_kv_cache_memory_bytes = (
    self.requested_memory
    - profile_result.non_kv_cache_memory
    - cudagraph_memory_estimate_applied
    - redundancy_buffer_memory              # ← add this
)

And in compile_or_warm_up_model(), reuse the same value (stored on self) instead of recomputing it, to keep both paths consistent.

Follow-up (WIP)

Hardcoding 150 MiB is not a final design. I am working on a profiling-driven buffer based on measured allocator fragmentation from the profiling stage, so the safety margin can adapt to real model/runtime patterns instead of a fixed constant.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to apply the redundancy_buffer_memory to the actual KV cache allocation. Here are the steps:

  • Move the buffer subtraction into determine_available_memory() where the actual KV cache size is computed.
  • Reuse the same redundancy_buffer_memory value in compile_or_warm_up_model() to keep both paths consistent.

Code Changes

# In determine_available_memory(), after computing non_kv_cache_memory:
redundancy_buffer_memory = 150 * (1 << 20)
self.available_kv_cache_memory_bytes = (
    self.requested_memory
    - profile_result.non_kv_cache_memory
    - cudagraph_memory_estimate_applied
    - redundancy_buffer_memory
)
self.redundancy_buffer_memory = redundancy_buffer_memory  # store for reuse

# In compile_or_warm_up_model():
kv_cache_memory_bytes_to_requested_limit = (
    int(self.requested_memory) - non_kv_cache_memory - self.redundancy_buffer_memory
)

Verification

To verify the fix, run the application with a configuration that previously triggered OOM, such as:

vllm serve <model> --gpu-memory-utilization=0.95 --max-num-batched-tokens=65536

Monitor the application's memory usage and verify that it no longer runs out of memory.

Extra Tips

  • Consider replacing the hardcoded redundancy_buffer_memory value with a profiling-driven buffer based on measured allocator fragmentation.
  • Review the application's memory allocation and usage patterns to ensure that the fixed buffer size is sufficient for all scenarios.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: `redundancy_buffer_memory` is Never really used [2 pull requests, 1 comments, 1 participants]