transformers - 💡(How to fix) Fix Gemma4 31B-IT Multi-GPU inference CUDA OOM [1 comments, 1 participants]

transformers2026-04-03 21:07:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45229•Fetched 2026-04-08 02:43:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

vaibhavBh-0

Participants

vaibhavBh-0

Timeline (top)

commented ×1labeled ×1subscribed ×1

I updated my transformers module to 5.5.0 from 4.53.0 to try google/gemma-4-31B-it model. I was using meta-llama/Llama-3.3-70B-Instruct for the same set of prompts. The Llama model is able to process the prompt without any problems despite occupying more VRAM than Gemma4. Gemma4 on the other hand crashes giving CUDA OOM error in an isocompute multi-GPU setting of 4 A100s (80GB each)

logs.txt

Error Message

Root Cause

logs.txt

Code Example

model_id = MODEL_PATH_DICT['gemma-4-31B-it']
model_pipe = pipeline('text-generation', model=model_id, dtype=torch.bfloat16, device_map='auto') # 'auto' -> 'balanced' -> 'sequential'

def do_inference_gemma_4(model_pipe: TextGenerationPipeline, prompts_dict : dict):
    model_pipe.generation_config.max_new_tokens = 256 * 10
    model_pipe.generation_config.do_sample = False

    section_outputs = {}

    for section_key, prompts in prompts_dict.items():
        try:
            output = model_pipe(prompts)
            response = output[0]['generated_text'][-1]['content']
            torch.cuda.empty_cache()
            section_outputs[section_key] = response
        except (TypeError, KeyboardInterrupt) as e:
            print(output)
            torch.cuda.empty_cache()
            raise e
        
    torch.cuda.empty_cache()

    return section_outputs

md_notes = [note for note in pt_notes if note.note_type != NoteType.DS]
patient_prompt_dicts = {
       'Hospital Management': summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes),
}

# length of single prompt is 26558
#len(tokenizer.apply_chat_template(summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes))['input_ids'])

output = do_inference_gemma_4(model_pipe, patient_prompt_dicts)

RAW_BUFFERClick to expand / collapse

System Info

Description

logs.txt

Environment

transformers 5.5.0
PyTorch 2.5.1
Python 3.12.9
Linux
4 A100s (80GB each)

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce:

Load the model with device_map='auto' for text-generation pipeline.
Prepare a long (less than 128K tokens; gemma 4 32B IT supports 256K tokens) prompt following the chat template format.
Create some prompt_dict prompt_dict = {'dummy section': prompt}

In my use case, I am focusing on summarization of a private dataset which I cannot share.

Code Snippets

model_id = MODEL_PATH_DICT['gemma-4-31B-it']
model_pipe = pipeline('text-generation', model=model_id, dtype=torch.bfloat16, device_map='auto') # 'auto' -> 'balanced' -> 'sequential'

def do_inference_gemma_4(model_pipe: TextGenerationPipeline, prompts_dict : dict):
    model_pipe.generation_config.max_new_tokens = 256 * 10
    model_pipe.generation_config.do_sample = False

    section_outputs = {}

    for section_key, prompts in prompts_dict.items():
        try:
            output = model_pipe(prompts)
            response = output[0]['generated_text'][-1]['content']
            torch.cuda.empty_cache()
            section_outputs[section_key] = response
        except (TypeError, KeyboardInterrupt) as e:
            print(output)
            torch.cuda.empty_cache()
            raise e
        
    torch.cuda.empty_cache()

    return section_outputs

md_notes = [note for note in pt_notes if note.note_type != NoteType.DS]
patient_prompt_dicts = {
       'Hospital Management': summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes),
}

# length of single prompt is 26558
#len(tokenizer.apply_chat_template(summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes))['input_ids'])

output = do_inference_gemma_4(model_pipe, patient_prompt_dicts)

Crash log

logs.txt

Expected behavior

Expected Behavior

google/gemma-4-31B-it should process the prompt and generate the response without throwing any errors.

Observed Behavior

google/gemma-4-31B-it loads using auto device_map, however, while generating VRAM is occupied for the first GPU leading to OutOfMemoryError . meta-llama/Llama-3.3-70B-Instruct evenly distributes the kv cache across multiple GPUs and does not focus on the first GPU.

extent analysis

TL;DR

The most likely fix for the CUDA OOM error with the google/gemma-4-31B-it model is to adjust the device_map configuration to better distribute the model's memory usage across the multiple GPUs.

Guidance

Verify that the issue persists when using a different device_map configuration, such as 'balanced' or 'sequential', to determine if the problem is specific to the 'auto' setting.
Consider reducing the max_new_tokens parameter in the generation_config to decrease the memory requirements for the model.
Check if the torch.cuda.empty_cache() calls are effective in freeing up GPU memory and consider adding more frequent calls to mitigate memory accumulation.
Investigate if the dtype=torch.bfloat16 setting is contributing to the memory usage issue, potentially by comparing performance with a different data type.

Example

model_pipe = pipeline('text-generation', model=model_id, dtype=torch.bfloat16, device_map='balanced')

Notes

The provided code and logs suggest that the issue is related to the memory distribution across the GPUs, but without more detailed information about the system's memory usage and the model's architecture, it's difficult to provide a more specific solution.

Recommendation

Apply a workaround by adjusting the device_map configuration to 'balanced' or 'sequential' to better distribute the model's memory usage across the multiple GPUs, as this may help mitigate the CUDA OOM error.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Gemma4 31B-IT Multi-GPU inference CUDA OOM [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

System Info

Description

Environment

Who can help?

Information

Tasks

Reproduction

Steps to reproduce:

Code Snippets

Crash log

Expected behavior

Expected Behavior

Observed Behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Gemma4 31B-IT Multi-GPU inference CUDA OOM [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

System Info

Description

Environment

Who can help?

Information

Tasks

Reproduction

Steps to reproduce:

Code Snippets

Crash log

Expected behavior

Expected Behavior

Observed Behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING