vllm - 💡(How to fix) Fix [CI Failure]: MultiModal Models Extended 2 - isaac test case OOMs [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37459Fetched 2026-04-08 00:58:33
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Timeline (top)
mentioned ×3subscribed ×3added_to_project_v2 ×1labeled ×1

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)
RAW_BUFFERClick to expand / collapse

Name of failing test

models/multimodal/generation/test_common.py::test_single_image_models[isaac-test_case35]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

The OOMs started recently with nightly run triggered Tuesday 2AM EST (March 17th) - https://buildkite.com/vllm/ci/builds/56555#019cfa63-99c7-4529-b370-c2b45ea05393

Curiously, it showed up when other tests in the batch started failing. maybe flaky.

📝 History of failing test

The OOMs started recently with nightly run triggered Tuesday 2AM (March 17th) - https://buildkite.com/vllm/ci/builds/56555#019cfa63-99c7-4529-b370-c2b45ea05393

CC List.

cc @AkshatSh @ywang96 @oscardev256

extent analysis

Fix Plan

To resolve the Out of Memory (OOM) issue in the test_single_image_models test, we will implement the following steps:

  • Increase the batch size gradually to identify the threshold where OOM occurs
  • Optimize memory usage by:
    • Reducing model size or complexity
    • Using mixed precision training
    • Gradient checkpointing
  • Implement a try-except block to catch OOM exceptions and provide a meaningful error message

Example Code

import torch

# Increase batch size gradually
batch_sizes = [4, 8, 16, 32]
for batch_size in batch_sizes:
    try:
        # Run the test with the current batch size
        test_single_image_models(batch_size)
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"OOM occurred at batch size {batch_size}")
            break

# Optimize memory usage
model = torch.nn.Module()
# Use mixed precision training
scaler = torch.cuda.amp.GradScaler()
# Gradient checkpointing
torch.utils.checkpoint()

# Implement try-except block
try:
    test_single_image_models()
except RuntimeError as e:
    if "out of memory" in str(e):
        print("OOM occurred. Please reduce batch size or model complexity.")

Verification

To verify that the fix worked, run the test with the optimized batch size and model configuration. Monitor the memory usage and test execution time to ensure that the OOM issue is resolved.

Extra Tips

  • Regularly monitor test execution time and memory usage to detect potential issues early
  • Use tools like nvidia-smi to monitor GPU memory usage during test execution
  • Consider using a more efficient model architecture or optimizing the existing model for better performance

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: MultiModal Models Extended 2 - isaac test case OOMs [1 participants]