vllm - 💡(How to fix) Fix [Bug]: cuda graph takes too much memory for qwen 3.5 [11 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38486Fetched 2026-04-08 01:49:04
View on GitHub
Comments
11
Participants
2
Timeline
17
Reactions
0
Author
Participants
Timeline (top)
commented ×11subscribed ×3mentioned ×2labeled ×1

Error Message

(EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6229, in _allocate_kv_cache_tensors (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] tensor = torch.zeros( (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] ^^^^^^^^^^^^ (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.03 GiB. GPU 0 has a total capacity of 44.40 GiB of which 858.31 MiB is free. Process 12865 has 43.56 GiB memory in use. Of the allocated memory 42.94 GiB is allocated by PyTorch, and 101.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Root Cause

but I can't remove enforce-eager flag, because of out of memory.

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Hi,

everything is OK with enforce eager:

docker run --runtime nvidia --gpus all -d --name vllm-Qwen35_35B_fp8_v20 --restart unless-stopped -v ~/.cache/huggingface/hub:/models -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.18.0 --model Qwen/Qwen3.5-35B-A3B-FP8 --served-model-name llm vllm-local vllm --enforce-eager

but generation speed is very slow: 14.43 tokens per second

int4 version (Qwen/Qwen3.5-35B-A3B-GPTQ-Int4) gives me: 109.97 tokens per second but with cuda graph

but I can't remove enforce-eager flag, because of out of memory.

Probably there is problem with cuda graph.

from logs: Model loading took 33.38 GiB memory and 21.582860 seconds

So after loading weight 33 gb of memory of occupied, but I have 44 gb of memory:

from logs: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.03 GiB. GPU 0 has a total capacity of 44.40 GiB of which 858.31 MiB is free.

even when max model length is set to 1024: docker run --runtime nvidia --gpus all -d --name vllm-Qwen35_35B_fp8_v23 --restart unless-stopped -v ~/.cache/huggingface/hub:/models -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.18.0 --model Qwen/Qwen3.5-35B-A3B-FP8 --served-model-name llm vllm-local vllm --max-model-len 1024 --language-model-only

logs:

(EngineCore pid=86) INFO 03-29 18:32:46 [backends.py:371] Cache the graph of compile range (1, 2048) for later use (EngineCore pid=86) INFO 03-29 18:33:05 [backends.py:387] Compiling a graph for compile range (1, 2048) takes 22.84 s (EngineCore pid=86) INFO 03-29 18:33:07 [monitor.py:48] torch.compile took 33.00 s in total (EngineCore pid=86) INFO 03-29 18:34:26 [monitor.py:76] Initial profiling/warmup run took 78.59 s

and than:

(EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6229, in _allocate_kv_cache_tensors (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] tensor = torch.zeros( (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] ^^^^^^^^^^^^ (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.03 GiB. GPU 0 has a total capacity of 44.40 GiB of which 858.31 MiB is free. Process 12865 has 43.56 GiB memory in use. Of the allocated memory 42.94 GiB is allocated by PyTorch, and 101.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the CUDA out of memory issue, we'll focus on optimizing memory allocation and utilization. The error message suggests setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid memory fragmentation. Here are the steps:

  1. Set Environment Variable: Before running your Docker container, set the environment variable PYTORCH_ALLOC_CONF to expandable_segments:True. This can be done using the -e flag with Docker run:

docker run -e PYTORCH_ALLOC_CONF="expandable_segments:True" --runtime nvidia --gpus all -d --name vllm-Qwen35_35B_fp8_v20 --restart unless-stopped -v ~/.cache/huggingface/hub:/models -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.18.0 --model Qwen/Qwen3.5-35B-A3B-FP8 --served-model-name llm vllm-local vllm --enforce-eager

2. **Optimize Model Loading**: Ensure that the model is loaded efficiently. If possible, consider using model pruning or quantization techniques to reduce the model size.
3. **Adjust Batch Size**: If you're using batching, try reducing the batch size to decrease memory usage during inference.
4. **Regularly Clean Up**: Implement a mechanism to regularly clean up unused tensors and variables to prevent memory accumulation.

### Verification
To verify that the fix worked:
- Monitor the memory usage of your GPU during model loading and inference.
- Check for any `torch.OutOfMemoryError` in your logs.
- Measure the generation speed to ensure it has improved.

### Extra Tips
- Regularly update your PyTorch and CUDA versions to ensure you have the latest memory management optimizations.
- Consider using tools like `nvidia-smi` to monitor GPU memory usage in real-time.
- If issues persist, explore more advanced memory management techniques, such as gradient checkpointing or using a different memory allocator.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING