transformers - ✅(Solved) Fix load_best_model_at_end reloads PEFT adapter weights onto CUDA and can OOM under low remaining GPU memory [1 pull requests, 7 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44637Fetched 2026-04-08 00:43:03
View on GitHub
Comments
7
Participants
3
Timeline
14
Reactions
0
Author
Timeline (top)
commented ×7mentioned ×3subscribed ×3cross-referenced ×1

Error Message

Traceback (most recent call last): File "<LOCAL_REPRO_ROOT>/oom_repro_large_tail_qwen35_gsm8k.py", line 226, in run_single train_result = trainer.train() ^^^^^^^^^^^^^^^ File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1424, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1522, in _inner_training_loop return self._finalize_training(trial, num_train_samples, start_time) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1841, in _finalize_training self._load_best_model() File "<LOCAL_REPRO_ROOT>/oom_repro_large_tail_qwen35_gsm8k.py", line 142, in _load_best_model return super()._load_best_model() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 259, in _load_best_model out = super()._load_best_model() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 3449, in _load_best_model model.load_adapter(self.state.best_model_checkpoint, active_adapter) File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 185, in wrapped_load_adapter return original_load_adapter(self, model_id, adapter_name, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_VENV_SITE_PACKAGES>/peft/peft_model.py", line 1362, in load_adapter adapters_weights = load_peft_weights( ^^^^^^^^^^^^^^^^^^ File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 200, in wrapped_load_peft_weights out = original_load_peft_weights(model_id, device=device, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_VENV_SITE_PACKAGES>/peft/utils/save_and_load.py", line 693, in load_peft_weights adapters_weights = safe_load_file(filename, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<LOCAL_VENV_SITE_PACKAGES>/safetensors/torch.py", line 338, in load_file result[k] = f.get_tensor(k) ^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 17.19 MiB is free. Including non-PyTorch memory, this process has 47.38 GiB memory in use. Of the allocated memory 46.96 GiB is allocated by PyTorch, and 93.13 MiB is reserved by PyTorch but unallocated.

Root Cause

This is also the key reason I consider this a bug rather than just expected memory pressure: under the same settings, training without load_best_model_at_end is fine, but enabling load_best_model_at_end introduces a late OOM because adapter weights are reloaded onto CUDA at the end.

Fix Action

Fixed

PR fix notes

PR #44660: Fix: avoid late CUDA OOM in load_best_model_at_end with PEFT models

Description (problem / solution / changelog)

What does this PR do?

Fixes #44637

This PR makes the PEFT load_best_model_at_end path in Trainer use a CPU-first adapter reload path during best-model loading.

Previously, when training a PEFT model, Trainer could reload the best adapter through a path that materialized adapter weights on CUDA during the final best-model load. Under low remaining GPU memory, this could trigger a late OOM even though the training loop had already completed.

To be specific:

The OOM happens because PeftModel.load_adapter() does not load weights directly into the existing adapter parameters in place. Instead, it first calls load_peft_weights(), which materializes a full temporary adapter state_dict on the target device, and only then passes that state_dict into set_peft_model_state_dict() / model.load_state_dict(...) to copy the values into the actual model parameters.

When torch_device is not specified, the current PEFT path infers cuda, so the checkpoint tensors are first loaded as a separate set of CUDA tensors. Under low remaining GPU memory, this extra device-side materialization can OOM before the weights are fully copied into the model, even though training itself has already finished.

A CPU-first load path is more memory-safe here: load the adapter checkpoint onto CPU first, then copy the weights into the model parameters. That avoids creating a full temporary CUDA state_dict at the most memory-constrained point of load_best_model_at_end.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@SunMarc @BenjaminBossan

Changed files

  • src/transformers/trainer.py (modified, +5/-1)

Code Example

bash run_oom_repro_latest_transformers.sh

---

python oom_repro_large_tail_qwen35_gsm8k.py \
  --single-run \
  --tail-margin-mb 32 \
  --allocator-slack-mb 300 \
  --best-load-device default \
  --lora-r 2048 \
  --lora-alpha 4096 \
  --model-path <LOCAL_MODEL_PATH> \
  --gsm8k-root <LOCAL_GSM8K_ROOT> \
  --output-root <LOCAL_OUTPUT_ROOT>/oom_runs_default \
  --train-samples 2 \
  --eval-samples 2 \
  --max-steps 1 \
  --eval-steps 1 \
  --max-length 128 \
  --dtype fp16

---

python oom_repro_large_tail_qwen35_gsm8k.py \
  --single-run \
  --tail-margin-mb 32 \
  --allocator-slack-mb 300 \
  --best-load-device cpu \
  --lora-r 2048 \
  --lora-alpha 4096 \
  --model-path <LOCAL_MODEL_PATH> \
  --gsm8k-root <LOCAL_GSM8K_ROOT> \
  --output-root <LOCAL_OUTPUT_ROOT>/oom_runs_cpu \
  --train-samples 2 \
  --eval-samples 2 \
  --max-steps 1 \
  --eval-steps 1 \
  --max-length 128 \
  --dtype fp16

---

{"event": "trainer._load_best_model.enter", "is_in_train": true, "optimizer_is_none": false, "lr_scheduler_is_none": false, "callback_optimizer_is_none": false, "callback_lr_scheduler_is_none": false}

{"event": "peft.load_adapter.enter", "torch_device": "None"}

{"event": "peft.load_peft_weights.enter", "device": "cuda"}

---

CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 17.19 MiB is free.
Including non-PyTorch memory, this process has 47.38 GiB memory in use.
Of the allocated memory 46.96 GiB is allocated by PyTorch, and 93.13 MiB is reserved by PyTorch but unallocated.

---

Traceback (most recent call last):
  File "<LOCAL_REPRO_ROOT>/oom_repro_large_tail_qwen35_gsm8k.py", line 226, in run_single
    train_result = trainer.train()
                   ^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1424, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1522, in _inner_training_loop
    return self._finalize_training(trial, num_train_samples, start_time)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1841, in _finalize_training
    self._load_best_model()
  File "<LOCAL_REPRO_ROOT>/oom_repro_large_tail_qwen35_gsm8k.py", line 142, in _load_best_model
    return super()._load_best_model()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 259, in _load_best_model
    out = super()._load_best_model()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 3449, in _load_best_model
    model.load_adapter(self.state.best_model_checkpoint, active_adapter)
  File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 185, in wrapped_load_adapter
    return original_load_adapter(self, model_id, adapter_name, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_VENV_SITE_PACKAGES>/peft/peft_model.py", line 1362, in load_adapter
    adapters_weights = load_peft_weights(
                       ^^^^^^^^^^^^^^^^^^
  File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 200, in wrapped_load_peft_weights
    out = original_load_peft_weights(model_id, device=device, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_VENV_SITE_PACKAGES>/peft/utils/save_and_load.py", line 693, in load_peft_weights
    adapters_weights = safe_load_file(filename, device=device)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_VENV_SITE_PACKAGES>/safetensors/torch.py", line 338, in load_file
    result[k] = f.get_tensor(k)
                ^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 17.19 MiB is free. Including non-PyTorch memory, this process has 47.38 GiB memory in use. Of the allocated memory 46.96 GiB is allocated by PyTorch, and 93.13 MiB is reserved by PyTorch but unallocated.

---

model.load_adapter(
    ...
    torch_device="cpu",
    ...
)

---

model.load_adapter(
    self.state.best_model_checkpoint,
    active_adapter,
    torch_device="cpu",
    autocast_adapter_dtype=False,
)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: local current checkout (5.3.0.dev0)
  • Python: 3.12.12
  • PyTorch: 2.10.0+cu128
  • CUDA available: True
  • CUDA device count: 8
  • torchvision: 0.25.0+cu128
  • Pillow: 12.1.1
  • PEFT: 0.18.1

I can also provide the full transformers env output if needed.

Who can help?

@SunMarc @BenjaminBossan

Information

The problem arises when using:

  • TrainingArguments(load_best_model_at_end=True)
  • a PEFT / LoRA model
  • the built-in best checkpoint reload path at the end of training
  • a situation where very little GPU memory remains at that point

This is not based on the original private training pipeline. I reduced it to a minimal local repro using:

  • local transformers.Trainer
  • local peft
  • local Qwen3.5-0.8B
  • local GSM8K parquet files

Keeping GPU memory usage relatively high during training is a common and intentional setup when trying to maximize utilization.

Under the same training setup, disabling load_best_model_at_end does not cause a failure, while enabling it can trigger a late OOM during the best-model reload stage that should ideally not happen.

Tasks

This is not from an official example script. It is a minimal self-contained repro intended to isolate the bug.

Reproduction

I reduced this to a minimal repro in a separate folder that does not depend on the original training pipeline.

The key idea is:

  1. train a PEFT / LoRA model with load_best_model_at_end=True
  2. keep tail GPU memory very tight near the end of training
  3. let the built-in best-model reload happen
  4. observe that PEFT adapter loading goes directly to CUDA and can OOM at that stage

The one-shot repro script runs two cases:

  1. default built-in best-model reload path
  2. CPU-safe control run

The default repro parameters are:

  • tail_margin_mb=32
  • allocator_slack_mb=300
  • lora_r=2048
  • lora_alpha=4096
  • train_samples=2
  • eval_samples=2
  • max_steps=1
  • eval_steps=1
  • max_length=128
  • dtype=fp16

Command:

bash run_oom_repro_latest_transformers.sh

The default path runs:

python oom_repro_large_tail_qwen35_gsm8k.py \
  --single-run \
  --tail-margin-mb 32 \
  --allocator-slack-mb 300 \
  --best-load-device default \
  --lora-r 2048 \
  --lora-alpha 4096 \
  --model-path <LOCAL_MODEL_PATH> \
  --gsm8k-root <LOCAL_GSM8K_ROOT> \
  --output-root <LOCAL_OUTPUT_ROOT>/oom_runs_default \
  --train-samples 2 \
  --eval-samples 2 \
  --max-steps 1 \
  --eval-steps 1 \
  --max-length 128 \
  --dtype fp16

The control run only changes the best-load device to CPU:

python oom_repro_large_tail_qwen35_gsm8k.py \
  --single-run \
  --tail-margin-mb 32 \
  --allocator-slack-mb 300 \
  --best-load-device cpu \
  --lora-r 2048 \
  --lora-alpha 4096 \
  --model-path <LOCAL_MODEL_PATH> \
  --gsm8k-root <LOCAL_GSM8K_ROOT> \
  --output-root <LOCAL_OUTPUT_ROOT>/oom_runs_cpu \
  --train-samples 2 \
  --eval-samples 2 \
  --max-steps 1 \
  --eval-steps 1 \
  --max-length 128 \
  --dtype fp16

Observed result:

  • default path: status = "cuda_oom"
  • CPU-safe control path: status = "ok"

The important part is that training itself finishes. The failure happens during the built-in best-model reload stage.

This is also the key reason I consider this a bug rather than just expected memory pressure: under the same settings, training without load_best_model_at_end is fine, but enabling load_best_model_at_end introduces a late OOM because adapter weights are reloaded onto CUDA at the end.

From the recorded events in the default-path repro:

  • trainer._load_best_model.enter
    • is_in_train = true
    • optimizer_is_none = false
    • lr_scheduler_is_none = false
    • callback_optimizer_is_none = false
  • peft.load_adapter.enter
    • torch_device = None
  • peft.load_peft_weights.enter
    • device = "cuda"

So under low remaining GPU memory, the PEFT best-load path observed in this repro is loading adapter weights onto CUDA during load_best_model_at_end, and that can trigger a late OOM.

Key event excerpts from the failing run:

{"event": "trainer._load_best_model.enter", "is_in_train": true, "optimizer_is_none": false, "lr_scheduler_is_none": false, "callback_optimizer_is_none": false, "callback_lr_scheduler_is_none": false}

{"event": "peft.load_adapter.enter", "torch_device": "None"}

{"event": "peft.load_peft_weights.enter", "device": "cuda"}

The actual OOM in the failing repro was:

CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 17.19 MiB is free.
Including non-PyTorch memory, this process has 47.38 GiB memory in use.
Of the allocated memory 46.96 GiB is allocated by PyTorch, and 93.13 MiB is reserved by PyTorch but unallocated.

Full traceback from the failing default-path run:

Traceback (most recent call last):
  File "<LOCAL_REPRO_ROOT>/oom_repro_large_tail_qwen35_gsm8k.py", line 226, in run_single
    train_result = trainer.train()
                   ^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1424, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1522, in _inner_training_loop
    return self._finalize_training(trial, num_train_samples, start_time)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 1841, in _finalize_training
    self._load_best_model()
  File "<LOCAL_REPRO_ROOT>/oom_repro_large_tail_qwen35_gsm8k.py", line 142, in _load_best_model
    return super()._load_best_model()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 259, in _load_best_model
    out = super()._load_best_model()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_TRANSFORMERS_ROOT>/src/transformers/trainer.py", line 3449, in _load_best_model
    model.load_adapter(self.state.best_model_checkpoint, active_adapter)
  File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 185, in wrapped_load_adapter
    return original_load_adapter(self, model_id, adapter_name, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_VENV_SITE_PACKAGES>/peft/peft_model.py", line 1362, in load_adapter
    adapters_weights = load_peft_weights(
                       ^^^^^^^^^^^^^^^^^^
  File "<LOCAL_REPRO_ROOT>/mini_repro_qwen35_gsm8k.py", line 200, in wrapped_load_peft_weights
    out = original_load_peft_weights(model_id, device=device, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_VENV_SITE_PACKAGES>/peft/utils/save_and_load.py", line 693, in load_peft_weights
    adapters_weights = safe_load_file(filename, device=device)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<LOCAL_VENV_SITE_PACKAGES>/safetensors/torch.py", line 338, in load_file
    result[k] = f.get_tensor(k)
                ^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 47.40 GiB of which 17.19 MiB is free. Including non-PyTorch memory, this process has 47.38 GiB memory in use. Of the allocated memory 46.96 GiB is allocated by PyTorch, and 93.13 MiB is reserved by PyTorch but unallocated.

For comparison, under the same memory pressure, using a CPU-safe reload path succeeds:

model.load_adapter(
    ...
    torch_device="cpu",
    ...
)

In the successful control run:

  • peft.load_peft_weights.enter shows device = "cpu"
  • the run completes successfully
  • status = "ok"

Expected behavior

When load_best_model_at_end=True is used with a PEFT model, the best-model reload stage should not fail solely because adapter weights requires more GPU memory to reload on GPU.

Possible fix

One possible fix would be to make the PEFT load_best_model_at_end path follow a more memory-safe loading pattern similar to the non-PEFT model case: load the best model state on CPU first like:

model.load_adapter(
    self.state.best_model_checkpoint,
    active_adapter,
    torch_device="cpu",
    autocast_adapter_dtype=False,
)

oom_repro_large_tail_qwen35_gsm8k.py run_oom_repro_latest_transformers.sh

extent analysis

Fix Plan

To fix the issue, we need to modify the load_best_model_at_end path in the Trainer class to load the best model state on CPU first, similar to the non-PEFT model case.

Here are the steps:

  • Modify the load_best_model_at_end method in the Trainer class to load the model on CPU.
  • Update the load_adapter method to accept a torch_device parameter.
  • Set torch_device to "cpu" when loading the best model.

Example code:

class Trainer:
    # ...

    def _load_best_model(self):
        # ...
        model.load_adapter(
            self.state.best_model_checkpoint,
            active_adapter,
            torch_device="cpu",
            autocast_adapter_dtype=False,
        )
        # ...

Alternatively, you can also modify the load_peft_weights function in the peft library to load the weights on CPU by default:

def load_peft_weights(model_id, device="cpu", **kwargs):
    # ...
    adapters_weights = safe_load_file(filename, device=device)
    # ...

Verification

To verify that the fix worked, you can run the same reproduction script with the modified code and check that the status is "ok" and there are no CUDA out-of-memory errors.

Extra Tips

  • Make sure to test the modified code with different models and datasets to ensure that it works as expected.
  • Consider adding a configuration option to allow users to choose whether to load the best model on CPU or GPU.
  • If you're using a custom Trainer class, make sure to update it accordingly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When load_best_model_at_end=True is used with a PEFT model, the best-model reload stage should not fail solely because adapter weights requires more GPU memory to reload on GPU.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING