vllm - 💡(How to fix) Fix vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vllm==0.20.1 hard-pins torch==2.11.0. On our 3x RTX 4090 Ubuntu server, this dependency stack appears to consume ~22-23 GiB VRAM per 24 GB GPU before model weights are loaded, causing vLLM to fail with CUDA OOM for any model.

The same machine works with torch==2.10.0+cu130, but vLLM 0.20.1 cannot be used with torch 2.10.0 because of dependency and ABI incompatibility.

Related upstream PyTorch issue: https://github.com/pytorch/pytorch/issues/182941

Error Message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Root Cause

The same machine works with torch==2.10.0+cu130, but vLLM 0.20.1 cannot be used with torch 2.10.0 because of dependency and ABI incompatibility.

Fix Action

Fix / Workaround

Attempted workarounds

  • torch==2.10.0 + vllm==0.20.1: fails due to binary/ABI incompatibility
  • torch==2.10.0 + older vLLM: possible workaround, but difficult because Qwen3.5/new model support is moving quickly
  1. A vLLM 0.20.x compatible wheel using the torch 2.10.0/cu130 stack.
  2. A documented workaround for 24 GB Ada GPUs affected by torch 2.11.0/cu130 startup memory usage.
  3. Guidance on whether users should use vLLM 0.19.1, cu129, or a torch 2.12/nightly build until this is fixed upstream.

Code Example

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

---

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

---

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("nccl:", torch.cuda.nccl.version())

torch.cuda.set_device(0)
print("allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)
print("device used GiB:", torch.cuda.device_memory_used(0) / 2**30)
print("mem_get_info:", tuple(x / 2**30 for x in torch.cuda.mem_get_info(0)))

---

torch 2.10.0+cu130:
  initialization GPU memory: ~0 GiB

torch 2.11.0+cu130:
  initialization GPU memory: ~22-23 GiB
RAW_BUFFERClick to expand / collapse

vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Summary

vllm==0.20.1 hard-pins torch==2.11.0. On our 3x RTX 4090 Ubuntu server, this dependency stack appears to consume ~22-23 GiB VRAM per 24 GB GPU before model weights are loaded, causing vLLM to fail with CUDA OOM for any model.

The same machine works with torch==2.10.0+cu130, but vLLM 0.20.1 cannot be used with torch 2.10.0 because of dependency and ABI incompatibility.

Related upstream PyTorch issue: https://github.com/pytorch/pytorch/issues/182941

Environment

  • OS: Ubuntu
  • Python: 3.11
  • GPU: 3x NVIDIA RTX 4090 24 GB
  • Driver: R580
  • CUDA: 13.0.3
  • vLLM: 0.20.1
  • torch: 2.11.0+cu130
  • NCCL: 2.30.4, from torch environment

Actual behavior

vLLM fails before model weights are loaded:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Before vLLM reaches the model loading phase, the torch environment already reports about 23 GiB GPU memory consumed.

Reproduction

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

The same issue happens with different models and with smaller/quantized models because almost no VRAM remains before model load starts.

Minimal torch-level check

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("nccl:", torch.cuda.nccl.version())

torch.cuda.set_device(0)
print("allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)
print("device used GiB:", torch.cuda.device_memory_used(0) / 2**30)
print("mem_get_info:", tuple(x / 2**30 for x in torch.cuda.mem_get_info(0)))

Observed comparison:

torch 2.10.0+cu130:
  initialization GPU memory: ~0 GiB

torch 2.11.0+cu130:
  initialization GPU memory: ~22-23 GiB

Attempted workarounds

No improvement with:

  • enforce_eager=True
  • gpu_memory_utilization=0.5
  • single GPU via CUDA_VISIBLE_DEVICES=1
  • tensor_parallel_size=2
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments

Also tried:

  • torch==2.10.0 + vllm==0.20.1: fails due to binary/ABI incompatibility
  • torch==2.10.0 + older vLLM: possible workaround, but difficult because Qwen3.5/new model support is moving quickly

Request

Could vLLM provide one of the following?

  1. A vLLM 0.20.x compatible wheel using the torch 2.10.0/cu130 stack.
  2. A documented workaround for 24 GB Ada GPUs affected by torch 2.11.0/cu130 startup memory usage.
  3. Guidance on whether users should use vLLM 0.19.1, cu129, or a torch 2.12/nightly build until this is fixed upstream.

This affects RTX 4090-class local inference deployments where the model would otherwise fit only if the runtime does not consume almost the entire GPU before loading weights.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130