pytorch - 💡(How to fix) Fix CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

pytorch2026-05-08 09:26:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On a 3x RTX 4090 Ubuntu inference server, torch==2.11.0+cu130 appears to consume almost all VRAM on each 24 GB GPU during CUDA initialization / early model loading, before vLLM starts loading model weights. This makes vLLM and even plain torch.load() + model.to("cuda") fail with OOM on workloads that worked with torch==2.10.0+cu130.

This looks like a regression between torch 2.10.0 and 2.11.0 on CUDA 13.0 wheels, possibly related to the CUDA/NCCL dependency update in the 2.11 release stack.

Error Message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Root Cause

The same class of failure also occurs with smaller models that should fit after quantization, because the device has already been almost fully consumed before the model load phase.

Fix Action

Fix / Workaround

If it is related to the NCCL/CUDA dependency update, would it be possible to identify a workaround or backport a fix to the 2.11 stable line?

Code Example

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

---

>>> import torch
>>> torch.cuda.set_device(0)
>>> torch.cuda.memory_allocated() / 2**30
23.04

---

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
try:
    print("nccl:", torch.cuda.nccl.version())
except Exception as exc:
    print("nccl: unavailable", repr(exc))

torch.cuda.set_device(0)
torch.cuda.synchronize()

print("memory_allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("memory_reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)

try:
    print("device_memory_used GiB:", torch.cuda.device_memory_used(0) / 2**30)
except Exception as exc:
    print("device_memory_used unavailable:", repr(exc))

free, total = torch.cuda.mem_get_info(0)
print("mem_get_info free GiB:", free / 2**30)
print("mem_get_info total GiB:", total / 2**30)

---

torch 2.10.0+cu130:
  memory_allocated: ~0.00 GiB
  usable free VRAM: ~23 GiB

torch 2.11.0+cu130:
  memory_allocated or device used: ~22-23 GiB
  usable free VRAM: near 0 GiB

---

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

RAW_BUFFERClick to expand / collapse

CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Summary

This looks like a regression between torch 2.10.0 and 2.11.0 on CUDA 13.0 wheels, possibly related to the CUDA/NCCL dependency update in the 2.11 release stack.

Environment

OS: Ubuntu
Python: 3.11
GPU: 3x NVIDIA RTX 4090, 24 GB each
NVIDIA driver: R580
CUDA runtime/toolkit: 13.0.3
torch bad version: 2.11.0+cu130
torch working version: 2.10.0+cu130
vLLM: 0.20.1
NCCL reported by torch 2.11 environment: 2.30.4

Actual behavior

With torch==2.11.0+cu130, loading any model through vLLM fails before model weight loading with:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Before model.loading() is reached, torch.cuda.memory_allocated() already reports about 23 GiB used on the target GPU.

Observed example:

>>> import torch
>>> torch.cuda.set_device(0)
>>> torch.cuda.memory_allocated() / 2**30
23.04

The same environment with torch==2.10.0+cu130 reports approximately 0.00 GiB after initialization and does not show this pre-weight-load OOM behavior.

Expected behavior

CUDA initialization should not consume almost the entire 24 GB device before model weights, KV cache, activations, or user tensors are loaded. The behavior should be close to torch 2.10.0, where initialization leaves nearly all VRAM available.

Minimal diagnostic script

Please run this on torch 2.10.0+cu130 and torch 2.11.0+cu130 in clean environments:

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
try:
    print("nccl:", torch.cuda.nccl.version())
except Exception as exc:
    print("nccl: unavailable", repr(exc))

torch.cuda.set_device(0)
torch.cuda.synchronize()

print("memory_allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("memory_reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)

try:
    print("device_memory_used GiB:", torch.cuda.device_memory_used(0) / 2**30)
except Exception as exc:
    print("device_memory_used unavailable:", repr(exc))

free, total = torch.cuda.mem_get_info(0)
print("mem_get_info free GiB:", free / 2**30)
print("mem_get_info total GiB:", total / 2**30)

Expected comparison:

torch 2.10.0+cu130:
  memory_allocated: ~0.00 GiB
  usable free VRAM: ~23 GiB

torch 2.11.0+cu130:
  memory_allocated or device used: ~22-23 GiB
  usable free VRAM: near 0 GiB

vLLM reproduction

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

This fails immediately with the OOM shown above before model weights can be loaded.

The same class of failure also occurs with smaller models that should fit after quantization, because the device has already been almost fully consumed before the model load phase.

Things already tried

The following did not change the behavior under torch==2.11.0+cu130:

enforce_eager=True
lower gpu_memory_utilization, including 0.5
CUDA_VISIBLE_DEVICES=1 single-GPU run
tensor_parallel_size=2
PYTORCH_CUDA_ALLOC_CONF=expandable_segments
rebuilding torch 2.11.0 from source in the same CUDA/NCCL stack

Also verified:

torch==2.10.0+cu130 does not show this behavior.
vllm==0.20.1 requires torch==2.11.0, so vLLM 0.20.1 cannot be used with the known-good torch 2.10.0 environment.
Trying to combine vLLM 0.20.1 with torch 2.10.0 fails due to binary/ABI incompatibility.

Impact

This breaks local inference serving on 24 GB consumer Ada GPUs, especially RTX 4090 systems. In this configuration, a pre-model-load 22-23 GiB allocation leaves no room for even small/quantized LLM weights, KV cache, or runtime buffers.

This is a regression for users who were able to run the same inference stack on torch 2.10.0+cu130.

Request

Could the PyTorch team confirm whether this is expected behavior in the torch 2.11.0 CUDA 13.0 wheel stack, or a regression in CUDA/NCCL allocator behavior?

If it is related to the NCCL/CUDA dependency update, would it be possible to identify a workaround or backport a fix to the 2.11 stable line?

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#pipeline error #runtime error #dependency conflict #environment setup #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Summary

Environment

Actual behavior

Expected behavior

Minimal diagnostic script

vLLM reproduction

Things already tried

Impact

Request

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Summary

Environment

Actual behavior

Expected behavior

Minimal diagnostic script

vLLM reproduction

Things already tried

Impact

Request

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING