- In an environment with **2 × RTX 5090 32GB GPUs**, loading the **Qwen3.5-35B-A3B** model with **int4 quantization** using LlamaFactory should not result in an OOM error. - The CUDA **reserved memory** usage during model loading in **Transformers 5.2.0** should be consistent with that of **Transformers 4.57.5**.

transformers - ✅(Solved) Fix Increased CUDA reserved memory in Transformers 5.x under int4 quantization leads to OOM [1 pull requests, 16 comments, 5 participants]

transformers2026-03-02 11:16:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44387•Fetched 2026-04-08 00:28:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×16subscribed ×14mentioned ×11closed ×1

Error Message

I encountered a CUDA OOM error when using LlamaFactory to load the Qwen3.5-35B-A3B model with int4 quantization for inference.

In an environment with 2 × RTX 5090 32GB GPUs, loading the Qwen3.5-35B-A3B model with int4 quantization using LlamaFactory should not result in an OOM error.

Fix Action

Fixed

Fixed by PR: Better loading when quantizing on the fly (https://github.com/huggingface/transformers/pull/44576)

PR fix notes

PR #44576: Disable async loading when quantizing on the fly

Repository: huggingface/transformers
Author: SunMarc
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44576

Description (problem / solution / changelog)

What does this PR do?

Fixes https://github.com/huggingface/transformers/issues/44387.

This PR disable async loading when we want to quantize the model. it is actually faster than doing a semaphore. If a quantizer happens to quantize faster than materializing the weights, we might consider adding adding a semaphore for that.

For future reference:

I tried to Materialize at most GLOBAL_WORKERS (4) full-precision tensors sit on GPU waiting to be quantized. As the main thread quantizes one and releases a permit, one worker loads the next. We can tune the number in the future it makes more sense in the future. 

This is problematic in the case where we use weight converters. This happens for some models like mixtral,
qwen2_moe, qwen3_vl_moe, lfm2_moe, ernie4_5_moe, solar_open, jamba, and others. For these, we don't use the semaphore and we just warn the user to set `HF_DEACTIVATE_ASYNC_LOAD` to not have memory issues.

Results using Qwen/Qwen2.5-7B-Instruct

┌───────────────────────────────┬───────┬───────────┬──────────┐
│         Configuration         │ Time  │ Allocated │ Reserved │
├───────────────────────────────┼───────┼───────────┼──────────┤
│ Baseline (no fix, async ON)   │ 30.5s │ 5.46 GB   │ 13.47 GB │
├───────────────────────────────┼───────┼───────────┼──────────┤
│ Baseline (async OFF)          │ 29.6s │ 5.45 GB   │ 5.63 GB  │
├───────────────────────────────┼───────┼───────────┼──────────┤
│ With semaphore fix (async ON) │ 30.3s │ 5.45 GB   │ 5.94 GB  │
└───────────────────────────────┴───────┴───────────┴──────────┘

Results using Qwen/Qwen3-30B-A3B with FP8

┌─────────┬────────────────┬──────────────────────────────────┐
 │         │     Async      │              Sync                │
 │         │   (default)    │   (HF_DEACTIVATE_ASYNC_LOAD=1)   │
 ├─────────┼────────────────┼──────────────────────────────────┤
 │ Time    │ 12.7s          │ 15.4s                            │
 ├─────────┼────────────────┼──────────────────────────────────┤
 │ Warning │ Yes            │ No                               │
 └─────────┴────────────────┴──────────────────────────────────┘

Changed files

src/transformers/core_model_loading.py (modified, +16/-5)

Code Example

import torch
import gc
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

def print_mem(tag):
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"{tag}")
    print(f"  allocated: {allocated:.2f} GB")
    print(f"  reserved : {reserved:.2f} GB")


# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()

print_mem("Before loading")

model_name = "Qwen2.5-7B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

print_mem("After loading")

---

Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:02<00:00, 155.08it/s, Materializing param=model.norm.weight]
After loading
  allocated: 5.46 GB
  reserved : 13.83 GB

---

Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.43s/it]
After loading
  allocated: 5.46 GB
  reserved : 6.70 GB

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.2.0
Platform: Linux-6.12.57+deb13-amd64-x86_64-with-glibc2.41
Python version: 3.12.12
Huggingface_hub version: 0.36.2
Safetensors version: 0.7.0
Accelerate version: 1.11.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.10.0+cu128 (CUDA)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: No
GPU type: NVIDIA GeForce RTX 5090

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Problem Description

I encountered a CUDA OOM error when using LlamaFactory to load the Qwen3.5-35B-A3B model with int4 quantization for inference.

My hardware setup consists of 2 × RTX 5090 32GB GPUs. Under normal circumstances, this configuration should be sufficient to load the model without OOM when using int4 quantization.

To further investigate the issue, I used a minimal reproduction script and observed that under Transformers 5.2.0, the CUDA reserved memory appears to be allocated as if the model were not quantized. This results in excessive reserved memory usage, leading to memory waste and potential OOM, even though the actual allocated memory remains consistent with int4 quantization expectations.

This behavior was not observed in earlier versions, where the model could be loaded successfully under the same hardware conditions.

import torch
import gc
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

def print_mem(tag):
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"{tag}")
    print(f"  allocated: {allocated:.2f} GB")
    print(f"  reserved : {reserved:.2f} GB")


# Clean up GPU memory
gc.collect()
torch.cuda.empty_cache()

print_mem("Before loading")

model_name = "Qwen2.5-7B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

print_mem("After loading")

Output Result

Transformers 5.2.0

Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:02<00:00, 155.08it/s, Materializing param=model.norm.weight]
After loading
  allocated: 5.46 GB
  reserved : 13.83 GB

Transformers 4.57.5

Before loading
  allocated: 0.00 GB
  reserved : 0.00 GB
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.43s/it]
After loading
  allocated: 5.46 GB
  reserved : 6.70 GB

Expected behavior

In an environment with 2 × RTX 5090 32GB GPUs, loading the Qwen3.5-35B-A3B model with int4 quantization using LlamaFactory should not result in an OOM error.
The CUDA reserved memory usage during model loading in Transformers 5.2.0 should be consistent with that of Transformers 4.57.5.

extent analysis

Fix Plan

1. Update Transformers to the Latest Version

Update the transformers version to the latest available version (5.3.0 or later) to ensure you have the latest bug fixes and improvements.

pip install --upgrade transformers

2. Configure CUDA Memory Management

In your code, add the following configuration to manage CUDA memory more efficiently:

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Configure CUDA memory management
torch.cuda.set_device(0)  # Set the default CUDA device
torch.cuda.empty_cache()  # Clear the CUDA cache
gc.collect()  # Collect garbage to free up memory

# ... (rest of the code remains the same)

3. Use `device_map` to Optimize Memory Usage

Update the device_map configuration to optimize memory usage:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"cuda:0": model.parameters()},
)

4. Monitor CUDA Memory Usage

Add logging to monitor CUDA memory usage during model loading:

import logging

logging.basicConfig(level=logging.INFO)

def print_mem(tag):
    torch.cuda.synchronize()
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    logging.info(f"{tag}")
    logging.info(f"  allocated: {allocated:.2f} GB")
    logging.info(f"  reserved : {reserved:.2f} GB")

print_mem("Before loading")
print_mem("After loading")

Verification

Run the reproduction script with the updated code and verify that the CUDA OOM error is resolved.
Monitor CUDA memory usage during model loading and ensure that the reserved memory usage is consistent with the expected

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

In an environment with 2 × RTX 5090 32GB GPUs, loading the Qwen3.5-35B-A3B model with int4 quantization using LlamaFactory should not result in an OOM error.
The CUDA reserved memory usage during model loading in Transformers 5.2.0 should be consistent with that of Transformers 4.57.5.

#api #ssr #installation #tensor shape #batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix Increased CUDA reserved memory in Transformers 5.x under int4 quantization leads to OOM [1 pull requests, 16 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44576: Disable async loading when quantizing on the fly

Description (problem / solution / changelog)

What does this PR do?

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Problem Description

Output Result

Expected behavior

extent analysis

Fix Plan

1. Update Transformers to the Latest Version

2. Configure CUDA Memory Management

3. Use device_map to Optimize Memory Usage

4. Monitor CUDA Memory Usage

Verification

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

3. Use `device_map` to Optimize Memory Usage