vllm - 💡(How to fix) Fix vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

vllm2026-05-08 09:27:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

vllm==0.20.1 hard-pins torch==2.11.0. On our 3x RTX 4090 Ubuntu server, this dependency stack appears to consume ~22-23 GiB VRAM per 24 GB GPU before model weights are loaded, causing vLLM to fail with CUDA OOM for any model.

The same machine works with torch==2.10.0+cu130, but vLLM 0.20.1 cannot be used with torch 2.10.0 because of dependency and ABI incompatibility.

Related upstream PyTorch issue: https://github.com/pytorch/pytorch/issues/182941

Error Message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Root Cause

The same machine works with torch==2.10.0+cu130, but vLLM 0.20.1 cannot be used with torch 2.10.0 because of dependency and ABI incompatibility.

Fix Action

Fix / Workaround

Attempted workarounds

torch==2.10.0 + vllm==0.20.1: fails due to binary/ABI incompatibility
torch==2.10.0 + older vLLM: possible workaround, but difficult because Qwen3.5/new model support is moving quickly

A vLLM 0.20.x compatible wheel using the torch 2.10.0/cu130 stack.
A documented workaround for 24 GB Ada GPUs affected by torch 2.11.0/cu130 startup memory usage.
Guidance on whether users should use vLLM 0.19.1, cu129, or a torch 2.12/nightly build until this is fixed upstream.

Code Example

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

---

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

---

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("nccl:", torch.cuda.nccl.version())

torch.cuda.set_device(0)
print("allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)
print("device used GiB:", torch.cuda.device_memory_used(0) / 2**30)
print("mem_get_info:", tuple(x / 2**30 for x in torch.cuda.mem_get_info(0)))

---

torch 2.10.0+cu130:
  initialization GPU memory: ~0 GiB

torch 2.11.0+cu130:
  initialization GPU memory: ~22-23 GiB

RAW_BUFFERClick to expand / collapse

vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Summary

The same machine works with torch==2.10.0+cu130, but vLLM 0.20.1 cannot be used with torch 2.10.0 because of dependency and ABI incompatibility.

Related upstream PyTorch issue: https://github.com/pytorch/pytorch/issues/182941

Environment

OS: Ubuntu
Python: 3.11
GPU: 3x NVIDIA RTX 4090 24 GB
Driver: R580
CUDA: 13.0.3
vLLM: 0.20.1
torch: 2.11.0+cu130
NCCL: 2.30.4, from torch environment

Actual behavior

vLLM fails before model weights are loaded:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Before vLLM reaches the model loading phase, the torch environment already reports about 23 GiB GPU memory consumed.

Reproduction

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

The same issue happens with different models and with smaller/quantized models because almost no VRAM remains before model load starts.

Minimal torch-level check

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("nccl:", torch.cuda.nccl.version())

torch.cuda.set_device(0)
print("allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)
print("device used GiB:", torch.cuda.device_memory_used(0) / 2**30)
print("mem_get_info:", tuple(x / 2**30 for x in torch.cuda.mem_get_info(0)))

Observed comparison:

torch 2.10.0+cu130:
  initialization GPU memory: ~0 GiB

torch 2.11.0+cu130:
  initialization GPU memory: ~22-23 GiB

Attempted workarounds

No improvement with:

enforce_eager=True
gpu_memory_utilization=0.5
single GPU via CUDA_VISIBLE_DEVICES=1
tensor_parallel_size=2
PYTORCH_CUDA_ALLOC_CONF=expandable_segments

Also tried:

torch==2.10.0 + vllm==0.20.1: fails due to binary/ABI incompatibility
torch==2.10.0 + older vLLM: possible workaround, but difficult because Qwen3.5/new model support is moving quickly

Request

Could vLLM provide one of the following?

A vLLM 0.20.x compatible wheel using the torch 2.10.0/cu130 stack.
A documented workaround for 24 GB Ada GPUs affected by torch 2.11.0/cu130 startup memory usage.
Guidance on whether users should use vLLM 0.19.1, cu129, or a torch 2.12/nightly build until this is fixed upstream.

This affects RTX 4090-class local inference deployments where the model would otherwise fit only if the runtime does not consume almost the entire GPU before loading weights.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #model loading #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Attempted workarounds

Code Example

vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Summary

Environment

Actual behavior

Reproduction

Minimal torch-level check

Attempted workarounds

Request

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Attempted workarounds

Code Example

vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130

Summary

Environment

Actual behavior

Reproduction

Minimal torch-level check

Attempted workarounds

Request

Still need to ship something?

RELATED_DISCOVERY

TRENDING