vllm - 💡(How to fix) Fix [Bug]: profile_cudagraph_memory() ignores GPU memory clamp on sliced GPUs (HAMi/MIG/MPS) — --gpu-memory-utilization is inert with AutoRound INT4 + fp8_e5m2 KV + FlashInfer + CUDA graphs [1 comments, 1 participants]

vllm2026-04-26 21:05:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40937•Fetched 2026-04-27 05:29:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bjornmage

Participants

bjornmage

Timeline (top)

commented ×1cross-referenced ×1

Error Message

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory() ... torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.53 GiB. GPU has a total capacity of 23.00 GiB of which 1.49 GiB is free. Process has 21.59 GiB memory in use.

Root Cause

We cannot run collect_env.py directly inside the failing container because the engine crashes before the OpenAI server is reachable, but the entire process is the upstream vllm/vllm-openai:v0.19.1 image with no patches applied. The HAMi side is volcano-vgpu-device-plugin with the libvgpu.so postStart seeding fix in place, so the slice clamp is being enforced correctly inside the container.

Fix Action

Fix / Workaround

Happy to test patches against our HAMi-sliced cluster (6× RTX 3090 24 GiB workers, libvgpu.so clamp confirmed working, full HAMi/Volcano stack). If a maintainer wants additional traces (VLLM_LOGGING_LEVEL= DEBUG, VLLM_TRACE_FUNCTION=1, NVML vs cudaMemGetInfo deltas captured from inside the slice, etc.), please ask in this thread.

Code Example

vLLM:        v0.19.1 (image: vllm/vllm-openai:v0.19.1)
Model:       Intel/Qwen3.6-27B-int4-AutoRound (also reproduced on the
             byte-identical Lorbus/Qwen3.6-27B-int4-AutoRound)
GPU:         NVIDIA RTX 3090 24 GiB, sm_86 (Ampere)
Slicing:     HAMi-core fractional sharing on Volcano scheduler under
             Kubernetes. Pod claims the entire physical card via
               volcano.sh/vgpu-memory: 24576
               volcano.sh/vgpu-cores:  100
             HAMi clamps the visible memory to ~23.56 GiB inside the
             container (libvgpu.so LD_PRELOAD).
Quantization: AutoRound INT4 weights
KV cache:    fp8_e5m2
Attention:   FlashInfer
CUDA graphs: enabled (no --enforce-eager)
Speculative: optional --speculative-config '{"method":"mtp",
             "num_speculative_tokens":3}' (MTP is NOT required to
             trigger the bug)

---

vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype fp8_e5m2 \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

---

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py",
  line 381, in determine_available_memory
    cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
...
torch.OutOfMemoryError: CUDA out of memory.
  Tried to allocate 1.53 GiB.
  GPU has a total capacity of 23.00 GiB of which 1.49 GiB is free.
  Process has 21.59 GiB memory in use.

---

requested_memory = math.ceil(
    init_snapshot.total_memory * cache_config.gpu_memory_utilization
)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment</summary>

vLLM:        v0.19.1 (image: vllm/vllm-openai:v0.19.1)
Model:       Intel/Qwen3.6-27B-int4-AutoRound (also reproduced on the
             byte-identical Lorbus/Qwen3.6-27B-int4-AutoRound)
GPU:         NVIDIA RTX 3090 24 GiB, sm_86 (Ampere)
Slicing:     HAMi-core fractional sharing on Volcano scheduler under
             Kubernetes. Pod claims the entire physical card via
               volcano.sh/vgpu-memory: 24576
               volcano.sh/vgpu-cores:  100
             HAMi clamps the visible memory to ~23.56 GiB inside the
             container (libvgpu.so LD_PRELOAD).
Quantization: AutoRound INT4 weights
KV cache:    fp8_e5m2
Attention:   FlashInfer
CUDA graphs: enabled (no --enforce-eager)
Speculative: optional --speculative-config '{"method":"mtp",
             "num_speculative_tokens":3}' (MTP is NOT required to
             trigger the bug)

</details>

🐛 Describe the bug

Summary

On a HAMi-sliced 24 GiB Ampere GPU, vLLM's profile_cudagraph_memory() warmup step OOMs because the memory budget computed from init_snapshot.total_memory * gpu_memory_utilization is derived from the physical card capacity rather than the slice's clamped visible memory. The CUDA graph profiler then tries to allocate on top of a budget that already exceeds what the slice can deliver. Lowering --gpu-memory-utilization has no effect on the failure (identical numbers at 0.95 / 0.92 / 0.88), so the dial that should normally mitigate this is inert at this codepath.

Reproducer

Minimal flag set that triggers the crash on a HAMi-sliced 24 GiB Ampere card running the upstream vllm/vllm-openai:v0.19.1 image:

vllm serve Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype fp8_e5m2 \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768

The same flags also reproduce with Lorbus/Qwen3.6-27B-int4-AutoRound (byte-identical to the Intel publish), and with --speculative-config '{"method":"mtp", "num_speculative_tokens":3}' appended. MTP is not required. --enable-chunked-prefill is also not required — the failure fires with or without it.

Failure

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py",
  line 381, in determine_available_memory
    cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
...
torch.OutOfMemoryError: CUDA out of memory.
  Tried to allocate 1.53 GiB.
  GPU has a total capacity of 23.00 GiB of which 1.49 GiB is free.
  Process has 21.59 GiB memory in use.

PyTorch sees the slice (23.00 GiB total reported by torch.cuda.mem_get_info()[1] inside the HAMi-clamped container). At the moment profile_cudagraph_memory() runs, ~21.68 GiB of the slice is already occupied by weights + activations + KV cache reservation, leaving ~1.39 GiB free. The CUDA graph profiler asks for 1.53 GiB on top of that, which exceeds the remaining headroom and faults.

What we tested

Variable	Result
`--gpu-memory-utilization 0.95`	OOM, 21.68/1.39/1.53
`--gpu-memory-utilization 0.92`	OOM, 21.68/1.39/1.53
`--gpu-memory-utilization 0.88`	OOM, 21.68/1.39/1.53
`Lorbus/Qwen3.6-27B-int4-AutoRound`	identical failure
`Intel/Qwen3.6-27B-int4-AutoRound`	identical failure
with MTP `--speculative-config`	identical failure
without MTP	identical failure
with `--enable-chunked-prefill`	identical failure
without `--enable-chunked-prefill`	identical failure
`--enforce-eager` (CUDA graphs off)	starts and serves OK

The GMU values 0.95 → 0.92 → 0.88 producing identical byte-level peak/free/needed numbers is the signal that the dial is not being honored at this codepath on this combination.

Hypothesis

Looking at vllm/v1/worker/utils.py::request_memory():

requested_memory = math.ceil(
    init_snapshot.total_memory * cache_config.gpu_memory_utilization
)

init_snapshot.total_memory is populated via current_platform.mem_get_info(device) in vllm/utils/mem_utils.py::MemorySnapshot.measure(). On a HAMi-sliced GPU, the value returned for total depends on which API HAMi has hooked. If cudaMemGetInfo returns the clamped slice (~23.56 GiB) but a different code path (e.g. NVML, device properties, or driver queries inside CUDA graph capture) leaks the physical card capacity (24.00 GiB), then:

requested_memory = 24.00 GiB × 0.92 = 22.08 GiB (uses physical)
The slice can only deliver ~23.56 GiB to CUDA contexts.
After weights + activations + KV reservation, the remaining headroom inside the slice is smaller than what profile_cudagraph_ memory() was budgeted to allocate.
Lowering GMU from 0.95 to 0.88 still produces a budget that is bigger than the slice can absorb after weights/KV, so the OOM numbers do not change in any visible way — the budget is being capped elsewhere (possibly by init_snapshot.free_memory at line 412 of utils.py), which is why the dial appears inert.

We have not fully nailed which call leaks the physical capacity (NVML vs cudaMemGetInfo vs device properties), but the empirical behaviour matches a budget computation that isn't bounded by the slice's actual visible memory.

What works

--enforce-eager disables CUDA graphs and removes the profile_cudagraph_memory() call from gpu_worker.py::determine_available_memory(), so the engine starts. Throughput on Qwen3.6-27B-int4-AutoRound drops from the ~70-80 t/s reported on a non-sliced 24 GiB consumer 3090 (see references) to ~23 t/s with eager mode. The throughput target is gated on this bug.

The same flag combination works on a non-sliced consumer RTX 3090 24 GiB card per the public AutoRound recipe write-ups linked below.

Why this matters

Sliced-GPU deployments (HAMi, MIG, MPS) are an increasingly common production pattern for serving multiple models on a single physical accelerator. With the current behaviour, CUDA graphs and speculative decoding — the two ~2× throughput recipes for Qwen3.5/Qwen3.6 — are unreachable on HAMi-sliced infrastructure. The user-facing dial that should mitigate this (--gpu-memory-utilization) does not, in fact, mitigate it, because the budget appears to be computed against physical capacity rather than visible-after-clamp capacity.

Suggested fix direction

This is up to the maintainers, but two directions seem reasonable:

Have MemorySnapshot.total_memory always reflect the smallest plausible visible memory (e.g. take min(cudaMemGetInfo_total, nvmlDeviceGetMemoryInfo_total, cudaDeviceProp.totalGlobalMem)), so the GMU budget computation is bounded by the slice on sliced GPUs.
Pass the profile_cudagraph_memory() allocation through the same requested_memory envelope rather than letting it allocate outside the GMU budget, so the dial is honoured at this codepath.

A third option would be to expose an absolute-bytes budget knob so operators can set the slice ceiling explicitly — closing #20256 along the way. We have no preference; whatever the team thinks is the right architectural answer.

Cross-references

AutoRound Qwen3.6 INT4 + fp8 KV + FlashInfer speedup recipe on non-sliced 24 GiB consumer cards, ~70-80 t/s: https://www.reddit.com/r/Qwen_AI/comments/1svixj8
Closely related but distinct OOMs on KV cache allocation (not profile_cudagraph_memory) and on non-sliced cards: #38486
Different root cause but adjacent symptom (PDL + Inductor synchronize inside graph capture): #40742
Closed feature request asking for an absolute-bytes memory cap, same underlying need: #20256

Willing to help

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to modify the MemorySnapshot.total_memory calculation to reflect the smallest plausible visible memory, ensuring the GPU memory utilization budget is bounded by the slice on sliced GPUs.

Guidance

Investigate the MemorySnapshot.measure() function in vllm/utils/mem_utils.py to determine why init_snapshot.total_memory is not reflecting the clamped slice memory.
Consider modifying the request_memory() function in vllm/v1/worker/utils.py to use the smallest plausible visible memory for budget calculation.
Test the suggested fix directions, such as passing the profile_cudagraph_memory() allocation through the requested_memory envelope or exposing an absolute-bytes budget knob.
Verify the fix by running the reproducer with the modified code and checking for the absence of OOM errors.

Example

No code example is provided as the issue requires investigation and modification of the existing codebase.

Notes

The issue is specific to HAMi-sliced GPUs and the calculation of the GPU memory utilization budget. The suggested fix directions are based on the hypothesis that the budget is being computed against physical capacity rather than visible-after-clamp capacity.

Recommendation

Apply a workaround by modifying the MemorySnapshot.total_memory calculation to reflect the smallest plausible visible memory, as this is the most likely fix direction. This will ensure the GPU memory utilization budget is bounded by the slice on sliced GPUs, preventing OOM errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: profile_cudagraph_memory() ignores GPU memory clamp on sliced GPUs (HAMi/MIG/MPS) — --gpu-memory-utilization is inert with AutoRound INT4 + fp8_e5m2 KV + FlashInfer + CUDA graphs [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Summary

Reproducer

Failure

What we tested

Hypothesis

What works

Why this matters

Suggested fix direction

Cross-references

Willing to help

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: profile_cudagraph_memory() ignores GPU memory clamp on sliced GPUs (HAMi/MIG/MPS) — --gpu-memory-utilization is inert with AutoRound INT4 + fp8_e5m2 KV + FlashInfer + CUDA graphs [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Summary

Reproducer

Failure

What we tested

Hypothesis

What works

Why this matters

Suggested fix direction

Cross-references

Willing to help

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING