transformers - ✅(Solved) Fix Increased CUDA reserved memory in Transformers 5.x under int4 quantization leads to OOM [1 pull requests, 16 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44387Fetched 2026-04-08 00:28:48
View on GitHub
Comments
16
Participants
5
Timeline
45
Reactions
0
Author
Timeline (top)
commented ×16subscribed ×14mentioned ×11closed ×1

Error Message

I encountered a CUDA OOM error when using LlamaFactory to load the Qwen3.5-35B-A3B model with int4 quantization for inference.

  • In an environment with 2 × RTX 5090 32GB GPUs, loading the Qwen3.5-35B-A3B model with int4 quantization using LlamaFactory should not result in an OOM error.

Fix Action

Fixed

PR fix notes

PR #44576: Disable async loading when quantizing on the fly

Description (problem / solution / changelog)

What does this PR do?

Fixes https://github.com/huggingface/transformers/issues/44387.

This PR disable async loading when we want to quantize the model. it is actually faster than doing a semaphore. If a quantizer happens to quantize faster than materializing the weights, we might consider adding adding a semaphore for that.

For future reference:

I tried to Materialize at most GLOBAL_WORKERS (4) full-precision tensors sit on GPU waiting to be quantized. As the main thread quantizes one and releases a permit, one worker loads the next. We can tune the number in the future it makes more sense in the future. 

This is problematic in the case where we use weight converters. This happens for some models like mixtral,
qwen2_moe, qwen3_vl_moe, lfm2_moe, ernie4_5_moe, solar_open, jamba, and others. For these, we don't use the semaphore and we just warn the user to set `HF_DEACTIVATE_ASYNC_LOAD` to not have memory issues.

Results using Qwen/Qwen2.5-7B-Instruct

┌───────────────────────────────┬───────┬───────────┬──────────┐
│         Configuration         │ Time  │ Allocated │ Reserved │
├───────────────────────────────┼───────┼───────────┼──────────┤
│ Baseline (no fix, async ON)30.5s │ 5.46 GB   │ 13.47 GB │
├───────────────────────────────┼───────┼───────────┼──────────┤
│ Baseline (async OFF)29.6s │ 5.45 GB   │ 5.63 GB  │
├───────────────────────────────┼───────┼───────────┼──────────┤
│ With semaphore fix (async ON)30.3s │ 5.45 GB   │ 5.94 GB  │
└───────────────────────────────┴───────┴───────────┴──────────┘

Results using Qwen/Qwen3-30B-A3B with FP8

┌─────────┬────────────────┬──────────────────────────────────┐
 │         │     Async      │              Sync                │
 │         │   (default)(HF_DEACTIVATE_ASYNC_LOAD=1) ├─────────┼────────────────┼──────────────────────────────────┤
 │ Time    │ 12.7s          │ 15.4s                            │
 ├─────────┼────────────────┼──────────────────────────────────┤
 │ Warning │ Yes            │ No                               │
 └─────────┴────────────────┴──────────────────────────────────┘

Changed files

  • src/transformers/core_model_loading.py (modified, +16/-5)

Code Example

import torch
import gc
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

def print_mem(tag):
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"{tag}")
    print(f"  allocated: {allocated:.2f} GB")
    print(f"  reserved : {reserved:.2f} GB")


# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()

print_mem("Before loading")

model_name = "Qwen2.5-7B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

print_mem("After loading")

---

Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:02<00:00, 155.08it/s, Materializing param=model.norm.weight]
After loading
  allocated: 5.46 GB
  reserved : 13.83 GB

---

Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.43s/it]
After loading
  allocated: 5.46 GB
  reserved : 6.70 GB
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.2.0
  • Platform: Linux-6.12.57+deb13-amd64-x86_64-with-glibc2.41
  • Python version: 3.12.12
  • Huggingface_hub version: 0.36.2
  • Safetensors version: 0.7.0
  • Accelerate version: 1.11.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No
  • GPU type: NVIDIA GeForce RTX 5090

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Problem Description

I encountered a CUDA OOM error when using LlamaFactory to load the Qwen3.5-35B-A3B model with int4 quantization for inference.

My hardware setup consists of 2 × RTX 5090 32GB GPUs. Under normal circumstances, this configuration should be sufficient to load the model without OOM when using int4 quantization.

To further investigate the issue, I used a minimal reproduction script and observed that under Transformers 5.2.0, the CUDA reserved memory appears to be allocated as if the model were not quantized. This results in excessive reserved memory usage, leading to memory waste and potential OOM, even though the actual allocated memory remains consistent with int4 quantization expectations.

This behavior was not observed in earlier versions, where the model could be loaded successfully under the same hardware conditions.

import torch
import gc
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

def print_mem(tag):
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"{tag}")
    print(f"  allocated: {allocated:.2f} GB")
    print(f"  reserved : {reserved:.2f} GB")


# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()

print_mem("Before loading")

model_name = "Qwen2.5-7B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

print_mem("After loading")

Output Result

  • Transformers 5.2.0
Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:02<00:00, 155.08it/s, Materializing param=model.norm.weight]
After loading
  allocated: 5.46 GB
  reserved : 13.83 GB
  • Transformers 4.57.5
Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.43s/it]
After loading
  allocated: 5.46 GB
  reserved : 6.70 GB

Expected behavior

  • In an environment with 2 × RTX 5090 32GB GPUs, loading the Qwen3.5-35B-A3B model with int4 quantization using LlamaFactory should not result in an OOM error.
  • The CUDA reserved memory usage during model loading in Transformers 5.2.0 should be consistent with that of Transformers 4.57.5.

extent analysis

Fix Plan

1. Update Transformers to the Latest Version

Update the transformers version to the latest available version (5.3.0 or later) to ensure you have the latest bug fixes and improvements.

pip install --upgrade transformers

2. Configure CUDA Memory Management

In your code, add the following configuration to manage CUDA memory more efficiently:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Configure CUDA memory management
torch.cuda.set_device(0)  # Set the default CUDA device
torch.cuda.empty_cache()  # Clear the CUDA cache
gc.collect()  # Collect garbage to free up memory

# ... (rest of the code remains the same)

3. Use device_map to Optimize Memory Usage

Update the device_map configuration to optimize memory usage:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"cuda:0": model.parameters()},
)

4. Monitor CUDA Memory Usage

Add logging to monitor CUDA memory usage during model loading:

import logging

logging.basicConfig(level=logging.INFO)

def print_mem(tag):
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    logging.info(f"{tag}")
    logging.info(f"  allocated: {allocated:.2f} GB")
    logging.info(f"  reserved : {reserved:.2f} GB")

print_mem("Before loading")
print_mem("After loading")

Verification

  1. Run the reproduction script with the updated code and verify that the CUDA OOM error is resolved.
  2. Monitor CUDA memory usage during model loading and ensure that the reserved memory usage is consistent with the expected

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • In an environment with 2 × RTX 5090 32GB GPUs, loading the Qwen3.5-35B-A3B model with int4 quantization using LlamaFactory should not result in an OOM error.
  • The CUDA reserved memory usage during model loading in Transformers 5.2.0 should be consistent with that of Transformers 4.57.5.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING