pytorch - 💡(How to fix) Fix CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On a 3x RTX 4090 Ubuntu inference server, torch==2.11.0+cu130 appears to consume almost all VRAM on each 24 GB GPU during CUDA initialization / early model loading, before vLLM starts loading model weights. This makes vLLM and even plain torch.load() + model.to("cuda") fail with OOM on workloads that worked with torch==2.10.0+cu130.

This looks like a regression between torch 2.10.0 and 2.11.0 on CUDA 13.0 wheels, possibly related to the CUDA/NCCL dependency update in the 2.11 release stack.

Error Message

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Root Cause

The same class of failure also occurs with smaller models that should fit after quantization, because the device has already been almost fully consumed before the model load phase.

Fix Action

Fix / Workaround

If it is related to the NCCL/CUDA dependency update, would it be possible to identify a workaround or backport a fix to the 2.11 stable line?

Code Example

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

---

>>> import torch
>>> torch.cuda.set_device(0)
>>> torch.cuda.memory_allocated() / 2**30
23.04

---

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
try:
    print("nccl:", torch.cuda.nccl.version())
except Exception as exc:
    print("nccl: unavailable", repr(exc))

torch.cuda.set_device(0)
torch.cuda.synchronize()

print("memory_allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("memory_reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)

try:
    print("device_memory_used GiB:", torch.cuda.device_memory_used(0) / 2**30)
except Exception as exc:
    print("device_memory_used unavailable:", repr(exc))

free, total = torch.cuda.mem_get_info(0)
print("mem_get_info free GiB:", free / 2**30)
print("mem_get_info total GiB:", total / 2**30)

---

torch 2.10.0+cu130:
  memory_allocated: ~0.00 GiB
  usable free VRAM: ~23 GiB

torch 2.11.0+cu130:
  memory_allocated or device used: ~22-23 GiB
  usable free VRAM: near 0 GiB

---

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)
RAW_BUFFERClick to expand / collapse

CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded

Summary

On a 3x RTX 4090 Ubuntu inference server, torch==2.11.0+cu130 appears to consume almost all VRAM on each 24 GB GPU during CUDA initialization / early model loading, before vLLM starts loading model weights. This makes vLLM and even plain torch.load() + model.to("cuda") fail with OOM on workloads that worked with torch==2.10.0+cu130.

This looks like a regression between torch 2.10.0 and 2.11.0 on CUDA 13.0 wheels, possibly related to the CUDA/NCCL dependency update in the 2.11 release stack.

Environment

  • OS: Ubuntu
  • Python: 3.11
  • GPU: 3x NVIDIA RTX 4090, 24 GB each
  • NVIDIA driver: R580
  • CUDA runtime/toolkit: 13.0.3
  • torch bad version: 2.11.0+cu130
  • torch working version: 2.10.0+cu130
  • vLLM: 0.20.1
  • NCCL reported by torch 2.11 environment: 2.30.4

Actual behavior

With torch==2.11.0+cu130, loading any model through vLLM fails before model weight loading with:

torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 23.04 GiB (GPU 0; 23.52 GiB total; 0 bytes free)

Before model.loading() is reached, torch.cuda.memory_allocated() already reports about 23 GiB used on the target GPU.

Observed example:

>>> import torch
>>> torch.cuda.set_device(0)
>>> torch.cuda.memory_allocated() / 2**30
23.04

The same environment with torch==2.10.0+cu130 reports approximately 0.00 GiB after initialization and does not show this pre-weight-load OOM behavior.

Expected behavior

CUDA initialization should not consume almost the entire 24 GB device before model weights, KV cache, activations, or user tensors are loaded. The behavior should be close to torch 2.10.0, where initialization leaves nearly all VRAM available.

Minimal diagnostic script

Please run this on torch 2.10.0+cu130 and torch 2.11.0+cu130 in clean environments:

import torch

print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
try:
    print("nccl:", torch.cuda.nccl.version())
except Exception as exc:
    print("nccl: unavailable", repr(exc))

torch.cuda.set_device(0)
torch.cuda.synchronize()

print("memory_allocated GiB:", torch.cuda.memory_allocated(0) / 2**30)
print("memory_reserved GiB:", torch.cuda.memory_reserved(0) / 2**30)

try:
    print("device_memory_used GiB:", torch.cuda.device_memory_used(0) / 2**30)
except Exception as exc:
    print("device_memory_used unavailable:", repr(exc))

free, total = torch.cuda.mem_get_info(0)
print("mem_get_info free GiB:", free / 2**30)
print("mem_get_info total GiB:", total / 2**30)

Expected comparison:

torch 2.10.0+cu130:
  memory_allocated: ~0.00 GiB
  usable free VRAM: ~23 GiB

torch 2.11.0+cu130:
  memory_allocated or device used: ~22-23 GiB
  usable free VRAM: near 0 GiB

vLLM reproduction

from vllm import LLM

llm = LLM(
    model="/home/sinoma/models/google/gemma-4-31B-it",
    tensor_parallel_size=1,
)

This fails immediately with the OOM shown above before model weights can be loaded.

The same class of failure also occurs with smaller models that should fit after quantization, because the device has already been almost fully consumed before the model load phase.

Things already tried

The following did not change the behavior under torch==2.11.0+cu130:

  • enforce_eager=True
  • lower gpu_memory_utilization, including 0.5
  • CUDA_VISIBLE_DEVICES=1 single-GPU run
  • tensor_parallel_size=2
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  • rebuilding torch 2.11.0 from source in the same CUDA/NCCL stack

Also verified:

  • torch==2.10.0+cu130 does not show this behavior.
  • vllm==0.20.1 requires torch==2.11.0, so vLLM 0.20.1 cannot be used with the known-good torch 2.10.0 environment.
  • Trying to combine vLLM 0.20.1 with torch 2.10.0 fails due to binary/ABI incompatibility.

Impact

This breaks local inference serving on 24 GB consumer Ada GPUs, especially RTX 4090 systems. In this configuration, a pre-model-load 22-23 GiB allocation leaves no room for even small/quantized LLM weights, KV cache, or runtime buffers.

This is a regression for users who were able to run the same inference stack on torch 2.10.0+cu130.

Request

Could the PyTorch team confirm whether this is expected behavior in the torch 2.11.0 CUDA 13.0 wheel stack, or a regression in CUDA/NCCL allocator behavior?

If it is related to the NCCL/CUDA dependency update, would it be possible to identify a workaround or backport a fix to the 2.11 stable line?

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

CUDA initialization should not consume almost the entire 24 GB device before model weights, KV cache, activations, or user tensors are loaded. The behavior should be close to torch 2.10.0, where initialization leaves nearly all VRAM available.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix CUDA 13 / torch 2.11.0 appears to consume ~23 GiB on RTX 4090 before model weights are loaded