transformers - 💡(How to fix) Fix T5 silently uses apex.FusedRMSNorm which has a memory leak (NVIDIA/apex#1999) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45704Fetched 2026-04-30 06:18:20
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
mentioned ×2subscribed ×2labeled ×1renamed ×1

Root Cause

Because T5-XXL has 49 layer-norm calls per forward(...), this means 98 CUDA tensors are leaked per T5EncoderModel.forward(...) in a typical transformers environment where apex is installed.

Fix Action

Workaround

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None
 
# now import transformers / diffusers / ...

This forces T5 to fall back to the native PyTorch implementation of RMSNorm, and the leak disappears completely (verified across thousands of calls).

Code Example

# This loop OOMs after ~90 iterations with batch_size=4 on a 180GB B200,
# because every encode_prompt call leaks ~0.1 GB.
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16,
).to("cuda:0")
 
for i in range(1000):
    with torch.no_grad():
        pipe.encode_prompt(
            prompt="test", prompt_2="test", device="cuda:0",
            num_images_per_prompt=1, max_sequence_length=256,
        )
    # gc.collect() and torch.cuda.empty_cache() do NOT free the leaked tensors

---

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None
 
# now import transformers / diffusers / ...
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 4.57.6
  • Platform: Linux-6.8.0-100-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 0.36.2
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: 0.18.9
  • PyTorch version (accelerator?): 2.11.0+cu130 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No(single-GPU minimal repro)
  • Using GPU in script?: Yes (single NVIDIA B200, cuda:0)
  • GPU type: NVIDIA B200

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When apex is importable in the current environment, transformers silently uses apex.normalization.FusedRMSNorm (or FusedLayerNorm) as the layer-norm implementation in T5. There is a known memory leak in apex.FusedRMSNorm that I have separately reported here:

NVIDIA/apex#1999 — FusedRMSNorm leaks 2 CUDA tensors per forward call under torch.no_grad()

Because T5-XXL has 49 layer-norm calls per forward(...), this means 98 CUDA tensors are leaked per T5EncoderModel.forward(...) in a typical transformers environment where apex is installed.

This is hard to discover from the user side because:

  1. The apex usage is conditional and silent — there is no log message, no warning, no flag.
  2. The leak is invisible in single-shot inference (it's only ~0.1 GB per call), and only becomes visible after many calls.
  3. The user's pipeline code (e.g. FluxPipeline.encode_prompt(...)) does not mention apex at all.

In our case, this surfaced as an OOM after ~90 batches of FLUX inference on a single B200 (180GB VRAM):

# This loop OOMs after ~90 iterations with batch_size=4 on a 180GB B200,
# because every encode_prompt call leaks ~0.1 GB.
from diffusers import FluxPipeline
import torch
 
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16,
).to("cuda:0")
 
for i in range(1000):
    with torch.no_grad():
        pipe.encode_prompt(
            prompt="test", prompt_2="test", device="cuda:0",
            num_images_per_prompt=1, max_sequence_length=256,
        )
    # gc.collect() and torch.cuda.empty_cache() do NOT free the leaked tensors

We confirmed that the leak originates from apex's FusedRMSNorm by isolating each pipeline stage:

ComponentCUDA tensors leaked per call
pipe.encode_prompt(...) (uses T5)+98
pipe.scheduler.set_timesteps(...)0
pipe.vae.decode(...)0
pipe.transformer(...)0

We further confirmed it by blocking apex's import before transformers is loaded — see workaround below.

Workaround

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None
 
# now import transformers / diffusers / ...

This forces T5 to fall back to the native PyTorch implementation of RMSNorm, and the leak disappears completely (verified across thousands of calls).

Expected behavior

I'd like to suggest one of the following (in order of how invasive they are):

  1. An opt-out mechanism — for example, an environment variable like TRANSFORMERS_NO_APEX=1 or USE_APEX_LAYER_NORM=0, that forces T5 (and any other model that conditionally imports apex) to use the native PyTorch path even when apex is importable. This seems like the lowest-friction option: it doesn't change any default behavior, doesn't risk regressions for users who rely on apex's speed, and gives users a documented way to avoid the leak without resorting to sys.modules hacks.
  2. A warning / log — at minimum, log once at model load time which layer- norm implementation is being used, so that users in environments with pre-installed apex (e.g. NVIDIA NGC containers, ML clusters) can at least notice it.
  3. A config flag on the model — e.g. T5Config(use_apex_layernorm=False), for users who construct models programmatically.

I'd be happy to put together a PR for option 1 (env var) if maintainers think this is the right direction. Please let me know.

extent analysis

TL;DR

To fix the memory leak issue in T5 models caused by apex.FusedRMSNorm, use the provided workaround to block apex's import before loading transformers.

Guidance

  • The memory leak originates from apex.FusedRMSNorm used in T5 models when apex is importable.
  • To verify the leak, run the provided example code and monitor CUDA tensor memory usage.
  • To mitigate the issue, use the workaround: sys.modules['apex'] = None before importing transformers.
  • Consider implementing an opt-out mechanism, such as an environment variable, to force T5 to use the native PyTorch implementation of RMSNorm.

Example

import sys
sys.modules['apex'] = None
sys.modules['apex.normalization'] = None
sys.modules['apex.normalization.fused_layer_norm'] = None

# now import transformers / diffusers / ...

Notes

  • The issue is specific to environments where apex is installed and importable.
  • The leak is only visible after many calls to T5EncoderModel.forward() and can cause OOM errors.

Recommendation

Apply the workaround by blocking apex's import before loading transformers, as it is a non-invasive and effective solution to mitigate the memory leak issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

I'd like to suggest one of the following (in order of how invasive they are):

  1. An opt-out mechanism — for example, an environment variable like TRANSFORMERS_NO_APEX=1 or USE_APEX_LAYER_NORM=0, that forces T5 (and any other model that conditionally imports apex) to use the native PyTorch path even when apex is importable. This seems like the lowest-friction option: it doesn't change any default behavior, doesn't risk regressions for users who rely on apex's speed, and gives users a documented way to avoid the leak without resorting to sys.modules hacks.
  2. A warning / log — at minimum, log once at model load time which layer- norm implementation is being used, so that users in environments with pre-installed apex (e.g. NVIDIA NGC containers, ML clusters) can at least notice it.
  3. A config flag on the model — e.g. T5Config(use_apex_layernorm=False), for users who construct models programmatically.

I'd be happy to put together a PR for option 1 (env var) if maintainers think this is the right direction. Please let me know.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix T5 silently uses apex.FusedRMSNorm which has a memory leak (NVIDIA/apex#1999) [1 participants]