I'd like to suggest one of the following (in order of how invasive they are): 1. **An opt-out mechanism** — for example, an environment variable like `TRANSFORMERS_NO_APEX=1` or `USE_APEX_LAYER_NORM=0`, that forces T5 (and any other model that conditionally imports apex) to use the native PyTorch path even when apex is importable. This seems like the lowest-friction option: it doesn't change any default behavior, doesn't risk regressions for users who rely on apex's speed, and gives users a documented way to avoid the leak without resorting to `sys.modules` hacks. 2. **A warning / log** — at minimum, log once at model load time which layer- norm implementation is being used, so that users in environments with pre-installed apex (e.g. NVIDIA NGC containers, ML clusters) can at least notice it. 3. **A config flag on the model** — e.g. `T5Config(use_apex_layernorm=False)`, for users who construct models programmatically. I'd be happy to put together a PR for option 1 (env var) if maintainers think this is the right direction. Please let me know.

transformers - 💡(How to fix) Fix T5 silently uses apex.FusedRMSNorm which has a memory leak (NVIDIA/apex#1999) [1 participants]

Fix Action

Workaround

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None
 
# now import transformers / diffusers / ...

This forces T5 to fall back to the native PyTorch implementation of RMSNorm, and the leak disappears completely (verified across thousands of calls).

Code Example

# This loop OOMs after ~90 iterations with batch_size=4 on a 180GB B200,
# because every encode_prompt call leaks ~0.1 GB.
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16,
).to("cuda:0")
 
for i in range(1000):
    with torch.no_grad():
        pipe.encode_prompt(
            prompt="test", prompt_2="test", device="cuda:0",
            num_images_per_prompt=1, max_sequence_length=256,
        )
    # gc.collect() and torch.cuda.empty_cache() do NOT free the leaked tensors

---

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None
 
# now import transformers / diffusers / ...

System Info

transformers version: 4.57.6
Platform: Linux-6.8.0-100-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Huggingface_hub version: 0.36.2
Safetensors version: 0.7.0
Accelerate version: 1.13.0
Accelerate config: not found
DeepSpeed version: 0.18.9
PyTorch version (accelerator?): 2.11.0+cu130 (CUDA)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No(single-GPU minimal repro)
Using GPU in script?: Yes (single NVIDIA B200, cuda:0)
GPU type: NVIDIA B200

Who can help?

@ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When apex is importable in the current environment, transformers silently uses apex.normalization.FusedRMSNorm (or FusedLayerNorm) as the layer-norm implementation in T5. There is a known memory leak in apex.FusedRMSNorm that I have separately reported here:

NVIDIA/apex#1999 — FusedRMSNorm leaks 2 CUDA tensors per forward call under torch.no_grad()

Because T5-XXL has 49 layer-norm calls per forward(...), this means 98 CUDA tensors are leaked per T5EncoderModel.forward(...) in a typical transformers environment where apex is installed.

This is hard to discover from the user side because:

The apex usage is conditional and silent — there is no log message, no warning, no flag.
The leak is invisible in single-shot inference (it's only ~0.1 GB per call), and only becomes visible after many calls.
The user's pipeline code (e.g. FluxPipeline.encode_prompt(...)) does not mention apex at all.

In our case, this surfaced as an OOM after ~90 batches of FLUX inference on a single B200 (180GB VRAM):

# This loop OOMs after ~90 iterations with batch_size=4 on a 180GB B200,
# because every encode_prompt call leaks ~0.1 GB.
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16,
).to("cuda:0")
 
for i in range(1000):
    with torch.no_grad():
        pipe.encode_prompt(
            prompt="test", prompt_2="test", device="cuda:0",
            num_images_per_prompt=1, max_sequence_length=256,
        )
    # gc.collect() and torch.cuda.empty_cache() do NOT free the leaked tensors

We confirmed that the leak originates from apex's FusedRMSNorm by isolating each pipeline stage:

Component	CUDA tensors leaked per call
`pipe.encode_prompt(...)` (uses T5)	+98
`pipe.scheduler.set_timesteps(...)`	0
`pipe.vae.decode(...)`	0
`pipe.transformer(...)`	0

We further confirmed it by blocking apex's import before transformers is loaded — see workaround below.

Workaround

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None
 
# now import transformers / diffusers / ...

This forces T5 to fall back to the native PyTorch implementation of RMSNorm, and the leak disappears completely (verified across thousands of calls).

Expected behavior

I'd like to suggest one of the following (in order of how invasive they are):

An opt-out mechanism — for example, an environment variable like TRANSFORMERS_NO_APEX=1 or USE_APEX_LAYER_NORM=0, that forces T5 (and any other model that conditionally imports apex) to use the native PyTorch path even when apex is importable. This seems like the lowest-friction option: it doesn't change any default behavior, doesn't risk regressions for users who rely on apex's speed, and gives users a documented way to avoid the leak without resorting to sys.modules hacks.
A warning / log — at minimum, log once at model load time which layer- norm implementation is being used, so that users in environments with pre-installed apex (e.g. NVIDIA NGC containers, ML clusters) can at least notice it.
A config flag on the model — e.g. T5Config(use_apex_layernorm=False), for users who construct models programmatically.

I'd be happy to put together a PR for option 1 (env var) if maintainers think this is the right direction. Please let me know.

extent analysis

TL;DR

To fix the memory leak issue in T5 models caused by apex.FusedRMSNorm, use the provided workaround to block apex's import before loading transformers.

Guidance

The memory leak originates from apex.FusedRMSNorm used in T5 models when apex is importable.
To verify the leak, run the provided example code and monitor CUDA tensor memory usage.
To mitigate the issue, use the workaround: sys.modules['apex'] = None before importing transformers.
Consider implementing an opt-out mechanism, such as an environment variable, to force T5 to use the native PyTorch implementation of RMSNorm.

Example

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None

# now import transformers / diffusers / ...

Notes

The issue is specific to environments where apex is installed and importable.
The leak is only visible after many calls to T5EncoderModel.forward() and can cause OOM errors.

Recommendation

Apply the workaround by blocking apex's import before loading transformers, as it is a non-invasive and effective solution to mitigate the memory leak issue.

FAQ

Expected behavior

I'd like to suggest one of the following (in order of how invasive they are):

An opt-out mechanism — for example, an environment variable like TRANSFORMERS_NO_APEX=1 or USE_APEX_LAYER_NORM=0, that forces T5 (and any other model that conditionally imports apex) to use the native PyTorch path even when apex is importable. This seems like the lowest-friction option: it doesn't change any default behavior, doesn't risk regressions for users who rely on apex's speed, and gives users a documented way to avoid the leak without resorting to sys.modules hacks.
A warning / log — at minimum, log once at model load time which layer- norm implementation is being used, so that users in environments with pre-installed apex (e.g. NVIDIA NGC containers, ML clusters) can at least notice it.
A config flag on the model — e.g. T5Config(use_apex_layernorm=False), for users who construct models programmatically.

I'd be happy to put together a PR for option 1 (env var) if maintainers think this is the right direction. Please let me know.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix T5 silently uses apex.FusedRMSNorm which has a memory leak (NVIDIA/apex#1999) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Workaround

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix T5 silently uses apex.FusedRMSNorm which has a memory leak (NVIDIA/apex#1999) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Workaround

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING