vllm - 💡(How to fix) Fix [Bug]: profile_cudagraph_memory() ignores GPU memory clamp on sliced GPUs (HAMi/MIG/MPS) — --gpu-memory-utilization is inert with AutoRound INT4 + fp8_e5m2 KV + FlashInfer + CUDA graphs [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40937Fetched 2026-04-27 05:29:12
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1

Error Message

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory() ... torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.53 GiB. GPU has a total capacity of 23.00 GiB of which 1.49 GiB is free. Process has 21.59 GiB memory in use.

Root Cause

We cannot run collect_env.py directly inside the failing container because the engine crashes before the OpenAI server is reachable, but the entire process is the upstream vllm/vllm-openai:v0.19.1 image with no patches applied. The HAMi side is volcano-vgpu-device-plugin with the libvgpu.so postStart seeding fix in place, so the slice clamp is being enforced correctly inside the container.

Fix Action

Fix / Workaround

We cannot run collect_env.py directly inside the failing container because the engine crashes before the OpenAI server is reachable, but the entire process is the upstream vllm/vllm-openai:v0.19.1 image with no patches applied. The HAMi side is volcano-vgpu-device-plugin with the libvgpu.so postStart seeding fix in place, so the slice clamp is being enforced correctly inside the container.

Happy to test patches against our HAMi-sliced cluster (6× RTX 3090 24 GiB workers, libvgpu.so clamp confirmed working, full HAMi/Volcano stack). If a maintainer wants additional traces (VLLM_LOGGING_LEVEL= DEBUG, VLLM_TRACE_FUNCTION=1, NVML vs cudaMemGetInfo deltas captured from inside the slice, etc.), please ask in this thread.

Code Example

vLLM:        v0.19.1 (image: vllm/vllm-openai:v0.19.1)
Model:       Intel/Qwen3.6-27B-int4-AutoRound (also reproduced on the
             byte-identical Lorbus/Qwen3.6-27B-int4-AutoRound)
GPU:         NVIDIA RTX 3090 24 GiB, sm_86 (Ampere)
Slicing:     HAMi-core fractional sharing on Volcano scheduler under
             Kubernetes. Pod claims the entire physical card via
               volcano.sh/vgpu-memory: 24576
               volcano.sh/vgpu-cores:  100
             HAMi clamps the visible memory to ~23.56 GiB inside the
             container (libvgpu.so LD_PRELOAD).
Quantization: AutoRound INT4 weights
KV cache:    fp8_e5m2
Attention:   FlashInfer
CUDA graphs: enabled (no --enforce-eager)
Speculative: optional --speculative-config '{"method":"mtp",
             "num_speculative_tokens":3}' (MTP is NOT required to
             trigger the bug)

---

vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype fp8_e5m2 \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

---

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py",
  line 381, in determine_available_memory
    cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
...
torch.OutOfMemoryError: CUDA out of memory.
  Tried to allocate 1.53 GiB.
  GPU has a total capacity of 23.00 GiB of which 1.49 GiB is free.
  Process has 21.59 GiB memory in use.

---

requested_memory = math.ceil(
    init_snapshot.total_memory * cache_config.gpu_memory_utilization
)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment</summary>
vLLM:        v0.19.1 (image: vllm/vllm-openai:v0.19.1)
Model:       Intel/Qwen3.6-27B-int4-AutoRound (also reproduced on the
             byte-identical Lorbus/Qwen3.6-27B-int4-AutoRound)
GPU:         NVIDIA RTX 3090 24 GiB, sm_86 (Ampere)
Slicing:     HAMi-core fractional sharing on Volcano scheduler under
             Kubernetes. Pod claims the entire physical card via
               volcano.sh/vgpu-memory: 24576
               volcano.sh/vgpu-cores:  100
             HAMi clamps the visible memory to ~23.56 GiB inside the
             container (libvgpu.so LD_PRELOAD).
Quantization: AutoRound INT4 weights
KV cache:    fp8_e5m2
Attention:   FlashInfer
CUDA graphs: enabled (no --enforce-eager)
Speculative: optional --speculative-config '{"method":"mtp",
             "num_speculative_tokens":3}' (MTP is NOT required to
             trigger the bug)

We cannot run collect_env.py directly inside the failing container because the engine crashes before the OpenAI server is reachable, but the entire process is the upstream vllm/vllm-openai:v0.19.1 image with no patches applied. The HAMi side is volcano-vgpu-device-plugin with the libvgpu.so postStart seeding fix in place, so the slice clamp is being enforced correctly inside the container.

</details>

🐛 Describe the bug

Summary

On a HAMi-sliced 24 GiB Ampere GPU, vLLM's profile_cudagraph_memory() warmup step OOMs because the memory budget computed from init_snapshot.total_memory * gpu_memory_utilization is derived from the physical card capacity rather than the slice's clamped visible memory. The CUDA graph profiler then tries to allocate on top of a budget that already exceeds what the slice can deliver. Lowering --gpu-memory-utilization has no effect on the failure (identical numbers at 0.95 / 0.92 / 0.88), so the dial that should normally mitigate this is inert at this codepath.

Reproducer

Minimal flag set that triggers the crash on a HAMi-sliced 24 GiB Ampere card running the upstream vllm/vllm-openai:v0.19.1 image:

vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype fp8_e5m2 \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

The same flags also reproduce with Lorbus/Qwen3.6-27B-int4-AutoRound (byte-identical to the Intel publish), and with --speculative-config '{"method":"mtp", "num_speculative_tokens":3}' appended. MTP is not required. --enable-chunked-prefill is also not required — the failure fires with or without it.

Failure

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py",
  line 381, in determine_available_memory
    cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
...
torch.OutOfMemoryError: CUDA out of memory.
  Tried to allocate 1.53 GiB.
  GPU has a total capacity of 23.00 GiB of which 1.49 GiB is free.
  Process has 21.59 GiB memory in use.

PyTorch sees the slice (23.00 GiB total reported by torch.cuda.mem_get_info()[1] inside the HAMi-clamped container). At the moment profile_cudagraph_memory() runs, ~21.68 GiB of the slice is already occupied by weights + activations + KV cache reservation, leaving ~1.39 GiB free. The CUDA graph profiler asks for 1.53 GiB on top of that, which exceeds the remaining headroom and faults.

What we tested

VariableResult
--gpu-memory-utilization 0.95OOM, 21.68/1.39/1.53
--gpu-memory-utilization 0.92OOM, 21.68/1.39/1.53
--gpu-memory-utilization 0.88OOM, 21.68/1.39/1.53
Lorbus/Qwen3.6-27B-int4-AutoRoundidentical failure
Intel/Qwen3.6-27B-int4-AutoRoundidentical failure
with MTP --speculative-configidentical failure
without MTPidentical failure
with --enable-chunked-prefillidentical failure
without --enable-chunked-prefillidentical failure
--enforce-eager (CUDA graphs off)starts and serves OK

The GMU values 0.95 → 0.92 → 0.88 producing identical byte-level peak/free/needed numbers is the signal that the dial is not being honored at this codepath on this combination.

Hypothesis

Looking at vllm/v1/worker/utils.py::request_memory():

requested_memory = math.ceil(
    init_snapshot.total_memory * cache_config.gpu_memory_utilization
)

init_snapshot.total_memory is populated via current_platform.mem_get_info(device) in vllm/utils/mem_utils.py::MemorySnapshot.measure(). On a HAMi-sliced GPU, the value returned for total depends on which API HAMi has hooked. If cudaMemGetInfo returns the clamped slice (~23.56 GiB) but a different code path (e.g. NVML, device properties, or driver queries inside CUDA graph capture) leaks the physical card capacity (24.00 GiB), then:

  • requested_memory = 24.00 GiB × 0.92 = 22.08 GiB (uses physical)
  • The slice can only deliver ~23.56 GiB to CUDA contexts.
  • After weights + activations + KV reservation, the remaining headroom inside the slice is smaller than what profile_cudagraph_ memory() was budgeted to allocate.
  • Lowering GMU from 0.95 to 0.88 still produces a budget that is bigger than the slice can absorb after weights/KV, so the OOM numbers do not change in any visible way — the budget is being capped elsewhere (possibly by init_snapshot.free_memory at line 412 of utils.py), which is why the dial appears inert.

We have not fully nailed which call leaks the physical capacity (NVML vs cudaMemGetInfo vs device properties), but the empirical behaviour matches a budget computation that isn't bounded by the slice's actual visible memory.

What works

--enforce-eager disables CUDA graphs and removes the profile_cudagraph_memory() call from gpu_worker.py::determine_available_memory(), so the engine starts. Throughput on Qwen3.6-27B-int4-AutoRound drops from the ~70-80 t/s reported on a non-sliced 24 GiB consumer 3090 (see references) to ~23 t/s with eager mode. The throughput target is gated on this bug.

The same flag combination works on a non-sliced consumer RTX 3090 24 GiB card per the public AutoRound recipe write-ups linked below.

Why this matters

Sliced-GPU deployments (HAMi, MIG, MPS) are an increasingly common production pattern for serving multiple models on a single physical accelerator. With the current behaviour, CUDA graphs and speculative decoding — the two ~2× throughput recipes for Qwen3.5/Qwen3.6 — are unreachable on HAMi-sliced infrastructure. The user-facing dial that should mitigate this (--gpu-memory-utilization) does not, in fact, mitigate it, because the budget appears to be computed against physical capacity rather than visible-after-clamp capacity.

Suggested fix direction

This is up to the maintainers, but two directions seem reasonable:

  1. Have MemorySnapshot.total_memory always reflect the smallest plausible visible memory (e.g. take min(cudaMemGetInfo_total, nvmlDeviceGetMemoryInfo_total, cudaDeviceProp.totalGlobalMem)), so the GMU budget computation is bounded by the slice on sliced GPUs.
  2. Pass the profile_cudagraph_memory() allocation through the same requested_memory envelope rather than letting it allocate outside the GMU budget, so the dial is honoured at this codepath.

A third option would be to expose an absolute-bytes budget knob so operators can set the slice ceiling explicitly — closing #20256 along the way. We have no preference; whatever the team thinks is the right architectural answer.

Cross-references

  • AutoRound Qwen3.6 INT4 + fp8 KV + FlashInfer speedup recipe on non-sliced 24 GiB consumer cards, ~70-80 t/s: https://www.reddit.com/r/Qwen_AI/comments/1svixj8
  • Closely related but distinct OOMs on KV cache allocation (not profile_cudagraph_memory) and on non-sliced cards: #38486
  • Different root cause but adjacent symptom (PDL + Inductor synchronize inside graph capture): #40742
  • Closed feature request asking for an absolute-bytes memory cap, same underlying need: #20256

Willing to help

Happy to test patches against our HAMi-sliced cluster (6× RTX 3090 24 GiB workers, libvgpu.so clamp confirmed working, full HAMi/Volcano stack). If a maintainer wants additional traces (VLLM_LOGGING_LEVEL= DEBUG, VLLM_TRACE_FUNCTION=1, NVML vs cudaMemGetInfo deltas captured from inside the slice, etc.), please ask in this thread.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to modify the MemorySnapshot.total_memory calculation to reflect the smallest plausible visible memory, ensuring the GPU memory utilization budget is bounded by the slice on sliced GPUs.

Guidance

  • Investigate the MemorySnapshot.measure() function in vllm/utils/mem_utils.py to determine why init_snapshot.total_memory is not reflecting the clamped slice memory.
  • Consider modifying the request_memory() function in vllm/v1/worker/utils.py to use the smallest plausible visible memory for budget calculation.
  • Test the suggested fix directions, such as passing the profile_cudagraph_memory() allocation through the requested_memory envelope or exposing an absolute-bytes budget knob.
  • Verify the fix by running the reproducer with the modified code and checking for the absence of OOM errors.

Example

No code example is provided as the issue requires investigation and modification of the existing codebase.

Notes

The issue is specific to HAMi-sliced GPUs and the calculation of the GPU memory utilization budget. The suggested fix directions are based on the hypothesis that the budget is being computed against physical capacity rather than visible-after-clamp capacity.

Recommendation

Apply a workaround by modifying the MemorySnapshot.total_memory calculation to reflect the smallest plausible visible memory, as this is the most likely fix direction. This will ensure the GPU memory utilization budget is bounded by the slice on sliced GPUs, preventing OOM errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: profile_cudagraph_memory() ignores GPU memory clamp on sliced GPUs (HAMi/MIG/MPS) — --gpu-memory-utilization is inert with AutoRound INT4 + fp8_e5m2 KV + FlashInfer + CUDA graphs [1 comments, 1 participants]