transformers - 💡(How to fix) Fix Gemma4 31B-IT Multi-GPU inference CUDA OOM [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45229Fetched 2026-04-08 02:43:51
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
commented ×1labeled ×1subscribed ×1

I updated my transformers module to 5.5.0 from 4.53.0 to try google/gemma-4-31B-it model. I was using meta-llama/Llama-3.3-70B-Instruct for the same set of prompts. The Llama model is able to process the prompt without any problems despite occupying more VRAM than Gemma4. Gemma4 on the other hand crashes giving CUDA OOM error in an isocompute multi-GPU setting of 4 A100s (80GB each)

logs.txt

Error Message

I updated my transformers module to 5.5.0 from 4.53.0 to try google/gemma-4-31B-it model. I was using meta-llama/Llama-3.3-70B-Instruct for the same set of prompts. The Llama model is able to process the prompt without any problems despite occupying more VRAM than Gemma4. Gemma4 on the other hand crashes giving CUDA OOM error in an isocompute multi-GPU setting of 4 A100s (80GB each)

Root Cause

I updated my transformers module to 5.5.0 from 4.53.0 to try google/gemma-4-31B-it model. I was using meta-llama/Llama-3.3-70B-Instruct for the same set of prompts. The Llama model is able to process the prompt without any problems despite occupying more VRAM than Gemma4. Gemma4 on the other hand crashes giving CUDA OOM error in an isocompute multi-GPU setting of 4 A100s (80GB each)

logs.txt

Code Example

model_id = MODEL_PATH_DICT['gemma-4-31B-it']
model_pipe = pipeline('text-generation', model=model_id, dtype=torch.bfloat16, device_map='auto') # 'auto' -> 'balanced' -> 'sequential'

def do_inference_gemma_4(model_pipe: TextGenerationPipeline, prompts_dict : dict):
    model_pipe.generation_config.max_new_tokens = 256 * 10
    model_pipe.generation_config.do_sample = False

    section_outputs = {}

    for section_key, prompts in prompts_dict.items():
        try:
            output = model_pipe(prompts)
            response = output[0]['generated_text'][-1]['content']
            torch.cuda.empty_cache()
            section_outputs[section_key] = response
        except (TypeError, KeyboardInterrupt) as e:
            print(output)
            torch.cuda.empty_cache()
            raise e
        
    torch.cuda.empty_cache()

    return section_outputs

md_notes = [note for note in pt_notes if note.note_type != NoteType.DS]
patient_prompt_dicts = {
       'Hospital Management': summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes),
}

# length of single prompt is 26558
#len(tokenizer.apply_chat_template(summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes))['input_ids'])

output = do_inference_gemma_4(model_pipe, patient_prompt_dicts)
RAW_BUFFERClick to expand / collapse

System Info

Description

I updated my transformers module to 5.5.0 from 4.53.0 to try google/gemma-4-31B-it model. I was using meta-llama/Llama-3.3-70B-Instruct for the same set of prompts. The Llama model is able to process the prompt without any problems despite occupying more VRAM than Gemma4. Gemma4 on the other hand crashes giving CUDA OOM error in an isocompute multi-GPU setting of 4 A100s (80GB each)

logs.txt

Environment

  • transformers 5.5.0
  • PyTorch 2.5.1
  • Python 3.12.9
  • Linux
  • 4 A100s (80GB each)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce:

  1. Load the model with device_map='auto' for text-generation pipeline.
  2. Prepare a long (less than 128K tokens; gemma 4 32B IT supports 256K tokens) prompt following the chat template format.
  3. Create some prompt_dict prompt_dict = {'dummy section': prompt}

In my use case, I am focusing on summarization of a private dataset which I cannot share.

Code Snippets

model_id = MODEL_PATH_DICT['gemma-4-31B-it']
model_pipe = pipeline('text-generation', model=model_id, dtype=torch.bfloat16, device_map='auto') # 'auto' -> 'balanced' -> 'sequential'

def do_inference_gemma_4(model_pipe: TextGenerationPipeline, prompts_dict : dict):
    model_pipe.generation_config.max_new_tokens = 256 * 10
    model_pipe.generation_config.do_sample = False

    section_outputs = {}

    for section_key, prompts in prompts_dict.items():
        try:
            output = model_pipe(prompts)
            response = output[0]['generated_text'][-1]['content']
            torch.cuda.empty_cache()
            section_outputs[section_key] = response
        except (TypeError, KeyboardInterrupt) as e:
            print(output)
            torch.cuda.empty_cache()
            raise e
        
    torch.cuda.empty_cache()

    return section_outputs

md_notes = [note for note in pt_notes if note.note_type != NoteType.DS]
patient_prompt_dicts = {
       'Hospital Management': summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes),
}

# length of single prompt is 26558
#len(tokenizer.apply_chat_template(summarize_hospital_management_stay_md_notes_no_incorrect_acronyms_sorted(md_notes))['input_ids'])

output = do_inference_gemma_4(model_pipe, patient_prompt_dicts)

Crash log

logs.txt

Expected behavior

Expected Behavior

google/gemma-4-31B-it should process the prompt and generate the response without throwing any errors.

Observed Behavior

google/gemma-4-31B-it loads using auto device_map, however, while generating VRAM is occupied for the first GPU leading to OutOfMemoryError . meta-llama/Llama-3.3-70B-Instruct evenly distributes the kv cache across multiple GPUs and does not focus on the first GPU.

extent analysis

TL;DR

The most likely fix for the CUDA OOM error with the google/gemma-4-31B-it model is to adjust the device_map configuration to better distribute the model's memory usage across the multiple GPUs.

Guidance

  • Verify that the issue persists when using a different device_map configuration, such as 'balanced' or 'sequential', to determine if the problem is specific to the 'auto' setting.
  • Consider reducing the max_new_tokens parameter in the generation_config to decrease the memory requirements for the model.
  • Check if the torch.cuda.empty_cache() calls are effective in freeing up GPU memory and consider adding more frequent calls to mitigate memory accumulation.
  • Investigate if the dtype=torch.bfloat16 setting is contributing to the memory usage issue, potentially by comparing performance with a different data type.

Example

model_pipe = pipeline('text-generation', model=model_id, dtype=torch.bfloat16, device_map='balanced')

Notes

The provided code and logs suggest that the issue is related to the memory distribution across the GPUs, but without more detailed information about the system's memory usage and the model's architecture, it's difficult to provide a more specific solution.

Recommendation

Apply a workaround by adjusting the device_map configuration to 'balanced' or 'sequential' to better distribute the model's memory usage across the multiple GPUs, as this may help mitigate the CUDA OOM error.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING