transformers - 💡(How to fix) Fix Sam3Video: CUDA out of memory [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44617Fetched 2026-04-08 00:27:24
View on GitHub
Comments
3
Participants
2
Timeline
6
Reactions
0
Timeline (top)
commented ×3labeled ×1mentioned ×1subscribed ×1

Code Example

from transformers import Sam3VideoConfig, Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import sys
import os
import cv2
import numpy as np
import json
import time
import gc
import torch
import math
from datetime import datetime

import inspect
import os
def print_memory(message=""):
    if not torch.cuda.is_available():
        print(f"[Line {inspect.currentframe().f_back.f_lineno}] {message} - CUDA not available")
        return
    
    caller_frame = inspect.currentframe().f_back
    line_no = caller_frame.f_lineno
    
    allocated = torch.cuda.memory_allocated() / 1024**2  # MB
    reserved = torch.cuda.memory_reserved() / 1024**2    # MB
    max_allocated = torch.cuda.max_memory_allocated() / 1024**2
    
    print(f"[Line {line_no:4d}] {message} - Allocated: {allocated}")

if __name__ == "__main__":
    sam_path = "./sam3/"
    video_path = "./reverse.mp4"
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    #	device = Accelerator().device
    config = Sam3VideoConfig.from_pretrained(sam_path)
    config.image_size = 1008
    model = Sam3VideoModel.from_pretrained(sam_path).to(device, dtype=torch.bfloat16)
    # processor = Sam3VideoProcessor.from_pretrained(sam_path)
    processor = Sam3VideoProcessor.from_pretrained(sam_path, size={"height": 1008, "width": 1008})

    frame_count = 0
    detection_results = []
    track_ids_set = set()
    color_map = {}

    frame_mask_status = []

    batch_counter = 0

    video_frames, _ = load_video(video_path)
    frames_num = video_frames.shape[0]

    inference_session = processor.init_video_session(
        video=video_frames[0],
        inference_device=device,
        processing_device=device,
        video_storage_device='cpu',
        dtype=torch.bfloat16,
    )
    text = "person"
    inference_session = processor.add_text_prompt(
        inference_session=inference_session,
        text=text,
    )
    for idx in range(1, frames_num):
        processed_video = processor.video_processor(videos=video_frames[idx], device=device, return_tensors="pt")
        pixel_values_video = processed_video.pixel_values_videos[0]
        inference_session.add_new_frame(pixel_values_video)

    with torch.no_grad():
        total_model_outputs = model.propagate_in_video_iterator(
            inference_session=inference_session
        )
        print(f"111")

        for model_outputs in total_model_outputs:
            print_memory("test!!!")

            result = processor.postprocess_outputs(inference_session, model_outputs)

            print_memory("before total_model_outputs!!!")
        
        # print(f"{result['object_ids']}")
        test = 0
RAW_BUFFERClick to expand / collapse

System Info

transformers 5.3.0 Python 3.10.12 torch 2.4.0+cu124

Tracking multiple targets simultaneously, typically numbering in the dozens, results in out of memory.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import Sam3VideoConfig, Sam3VideoModel, Sam3VideoProcessor
from transformers.video_utils import load_video
import sys
import os
import cv2
import numpy as np
import json
import time
import gc
import torch
import math
from datetime import datetime

import inspect
import os
def print_memory(message=""):
    if not torch.cuda.is_available():
        print(f"[Line {inspect.currentframe().f_back.f_lineno}] {message} - CUDA not available")
        return
    
    caller_frame = inspect.currentframe().f_back
    line_no = caller_frame.f_lineno
    
    allocated = torch.cuda.memory_allocated() / 1024**2  # MB
    reserved = torch.cuda.memory_reserved() / 1024**2    # MB
    max_allocated = torch.cuda.max_memory_allocated() / 1024**2
    
    print(f"[Line {line_no:4d}] {message} - Allocated: {allocated}")

if __name__ == "__main__":
    sam_path = "./sam3/"
    video_path = "./reverse.mp4"
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    #	device = Accelerator().device
    config = Sam3VideoConfig.from_pretrained(sam_path)
    config.image_size = 1008
    model = Sam3VideoModel.from_pretrained(sam_path).to(device, dtype=torch.bfloat16)
    # processor = Sam3VideoProcessor.from_pretrained(sam_path)
    processor = Sam3VideoProcessor.from_pretrained(sam_path, size={"height": 1008, "width": 1008})

    frame_count = 0
    detection_results = []
    track_ids_set = set()
    color_map = {}

    frame_mask_status = []

    batch_counter = 0

    video_frames, _ = load_video(video_path)
    frames_num = video_frames.shape[0]

    inference_session = processor.init_video_session(
        video=video_frames[0],
        inference_device=device,
        processing_device=device,
        video_storage_device='cpu',
        dtype=torch.bfloat16,
    )
    text = "person"
    inference_session = processor.add_text_prompt(
        inference_session=inference_session,
        text=text,
    )
    for idx in range(1, frames_num):
        processed_video = processor.video_processor(videos=video_frames[idx], device=device, return_tensors="pt")
        pixel_values_video = processed_video.pixel_values_videos[0]
        inference_session.add_new_frame(pixel_values_video)

    with torch.no_grad():
        total_model_outputs = model.propagate_in_video_iterator(
            inference_session=inference_session
        )
        print(f"111")

        for model_outputs in total_model_outputs:
            print_memory("test!!!")

            result = processor.postprocess_outputs(inference_session, model_outputs)

            print_memory("before total_model_outputs!!!")
        
        # print(f"{result['object_ids']}")
        test = 0

Expected behavior

[Line 77] test!!! - Allocated: 2696.203125 [Line 81] before total_model_outputs!!! - Allocated: 2838.58642578125 [Line 77] test!!! - Allocated: 2898.85205078125 [Line 81] before total_model_outputs!!! - Allocated: 2898.85205078125 [Line 77] test!!! - Allocated: 2958.947265625 [Line 81] before total_model_outputs!!! - Allocated: 2958.947265625 [Line 77] test!!! - Allocated: 3017.658203125 [Line 81] before total_model_outputs!!! - Allocated: 3017.658203125 [Line 77] test!!! - Allocated: 3077.51611328125

How do I fix it?

extent analysis

Fix Plan

1. Reduce Memory Allocation

The main issue is the high memory allocation during the propagate_in_video_iterator method. To fix this, we can use the torch.cuda.empty_cache() function to clear the CUDA cache after each iteration.

2. Use Efficient Data Structures

The inference_session object is storing a large amount of data. We can use a more efficient data structure, such as a torch.cuda.FloatTensor to store the video frames.

3. Optimize Model Outputs

The total_model_outputs variable is storing all the model outputs. We can optimize this by using a generator to yield the outputs one by one, instead of storing them all in memory.

Code Changes

# Clear CUDA cache after each iteration
for model_outputs in model.propagate_in_video_iterator(
    inference_session=inference_session
):
    torch.cuda.empty_cache()
    result = processor.postprocess_outputs(inference_session, model_outputs)

# Use a generator to yield model outputs
def propagate_in_video_iterator(self, inference_session):
    for frame in inference_session:
        yield self.propagate_in_frame(frame)

# Optimize model outputs
with torch.no_grad():
    for model_outputs in model.propagate_in_video_iterator(
        inference_session=inference_session
    ):
        result = processor.postprocess_outputs(inference_session, model_outputs)

4. Monitor Memory Usage

To monitor memory usage, we can use the print_memory function to print the allocated and reserved memory at each iteration.

print_memory("test!!!")

Verification

To verify that the fix worked, we can monitor the memory usage and check if it is within the expected range. We can also use tools like nvidia-smi to monitor the GPU memory usage.

Extra Tips

  • Make sure to clear the CUDA cache after each iteration to avoid memory leaks.
  • Use efficient data structures

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

[Line 77] test!!! - Allocated: 2696.203125 [Line 81] before total_model_outputs!!! - Allocated: 2838.58642578125 [Line 77] test!!! - Allocated: 2898.85205078125 [Line 81] before total_model_outputs!!! - Allocated: 2898.85205078125 [Line 77] test!!! - Allocated: 2958.947265625 [Line 81] before total_model_outputs!!! - Allocated: 2958.947265625 [Line 77] test!!! - Allocated: 3017.658203125 [Line 81] before total_model_outputs!!! - Allocated: 3017.658203125 [Line 77] test!!! - Allocated: 3077.51611328125

How do I fix it?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Sam3Video: CUDA out of memory [3 comments, 2 participants]