vllm - 💡(How to fix) Fix [Bug]: cuda graph takes too much memory for qwen 3.5 [11 comments, 2 participants]

Error Message

(EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6229, in _allocate_kv_cache_tensors (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] tensor = torch.zeros( (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] ^^^^^^^^^^^^ (EngineCore pid=86) ERROR 03-29 18:34:32 [core.py:1099] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.03 GiB. GPU 0 has a total capacity of 44.40 GiB of which 858.31 MiB is free. Process 12865 has 43.56 GiB memory in use. Of the allocated memory 42.94 GiB is allocated by PyTorch, and 101.36 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

Hi,

everything is OK with enforce eager:

docker run --runtime nvidia --gpus all -d --name vllm-Qwen35_35B_fp8_v20 --restart unless-stopped -v ~/.cache/huggingface/hub:/models -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.18.0 --model Qwen/Qwen3.5-35B-A3B-FP8 --served-model-name llm vllm-local vllm --enforce-eager

but generation speed is very slow: 14.43 tokens per second

int4 version (Qwen/Qwen3.5-35B-A3B-GPTQ-Int4) gives me: 109.97 tokens per second but with cuda graph

but I can't remove enforce-eager flag, because of out of memory.

Probably there is problem with cuda graph.

from logs: Model loading took 33.38 GiB memory and 21.582860 seconds

So after loading weight 33 gb of memory of occupied, but I have 44 gb of memory:

from logs: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.03 GiB. GPU 0 has a total capacity of 44.40 GiB of which 858.31 MiB is free.

even when max model length is set to 1024: docker run --runtime nvidia --gpus all -d --name vllm-Qwen35_35B_fp8_v23 --restart unless-stopped -v ~/.cache/huggingface/hub:/models -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.18.0 --model Qwen/Qwen3.5-35B-A3B-FP8 --served-model-name llm vllm-local vllm --max-model-len 1024 --language-model-only

logs:

(EngineCore pid=86) INFO 03-29 18:32:46 [backends.py:371] Cache the graph of compile range (1, 2048) for later use (EngineCore pid=86) INFO 03-29 18:33:05 [backends.py:387] Compiling a graph for compile range (1, 2048) takes 22.84 s (EngineCore pid=86) INFO 03-29 18:33:07 [monitor.py:48] torch.compile took 33.00 s in total (EngineCore pid=86) INFO 03-29 18:34:26 [monitor.py:76] Initial profiling/warmup run took 78.59 s

and than:

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the CUDA out of memory issue, we'll focus on optimizing memory allocation and utilization. The error message suggests setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid memory fragmentation. Here are the steps:

Set Environment Variable: Before running your Docker container, set the environment variable PYTORCH_ALLOC_CONF to expandable_segments:True. This can be done using the -e flag with Docker run:

docker run -e PYTORCH_ALLOC_CONF="expandable_segments:True" --runtime nvidia --gpus all -d --name vllm-Qwen35_35B_fp8_v20 --restart unless-stopped -v ~/.cache/huggingface/hub:/models -v ~/.cache/vllm:/root/.cache/vllm -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 vllm/vllm-openai:v0.18.0 --model Qwen/Qwen3.5-35B-A3B-FP8 --served-model-name llm vllm-local vllm --enforce-eager

2. **Optimize Model Loading**: Ensure that the model is loaded efficiently. If possible, consider using model pruning or quantization techniques to reduce the model size.
3. **Adjust Batch Size**: If you're using batching, try reducing the batch size to decrease memory usage during inference.
4. **Regularly Clean Up**: Implement a mechanism to regularly clean up unused tensors and variables to prevent memory accumulation.

### Verification
To verify that the fix worked:
- Monitor the memory usage of your GPU during model loading and inference.
- Check for any `torch.OutOfMemoryError` in your logs.
- Measure the generation speed to ensure it has improved.

### Extra Tips
- Regularly update your PyTorch and CUDA versions to ensure you have the latest memory management optimizations.
- Consider using tools like `nvidia-smi` to monitor GPU memory usage in real-time.
- If issues persist, explore more advanced memory management techniques, such as gradient checkpointing or using a different memory allocator.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: cuda graph takes too much memory for qwen 3.5 [11 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: cuda graph takes too much memory for qwen 3.5 [11 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING