vllm - ✅(Solved) Fix [CI Failure]: Gemma3 OOMs with transformers backend [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37736Fetched 2026-04-08 01:08:28
View on GitHub
Comments
2
Participants
2
Timeline
19
Reactions
0
Timeline (top)
mentioned ×4subscribed ×4added_to_project_v2 ×3project_v2_item_status_changed ×3

Fix Action

Fixed

PR fix notes

PR #37717: [ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm

Description (problem / solution / changelog)

Follow-up for:

  • #34839

Marks max_tokens test with distilbert/distilgpt2 as a large GPU test. Addresses failure in mi250_1: Regression

Motivation: https://buildkite.com/vllm/amd-ci/builds/6721/steps/canvas?sid=019d09d4-708e-44b0-a0d0-ccf0e3c00a94&tab=output

cc @kenroche

Changed files

  • tests/test_regression.py (modified, +14/-2)
RAW_BUFFERClick to expand / collapse

Test group

mi250_1: Multi-Modal Models (Standard) 2: qwen3 + gemma

Describe the failing test

This is not exactly a test failure, but it has been recommended to investigate further the OOMing event of Gemma3, which is a 4B model. The intuition here is that the fake tensor that is used for profiling is large enough that exceeds the 64 GB of MI250 GPUs. However, it has been suggested that this is still weird.

📝 History of failing test

https://github.com/vllm-project/vllm/pull/37610#issuecomment-4102286515

CC List.

@DarkLight1337 (for transparency)

extent analysis

Fix Plan

To address the OOMing event of Gemma3, we will implement a solution to reduce the memory usage of the fake tensor used for profiling.

Steps to Fix

  • Reduce the size of the fake tensor to fit within the 64 GB memory limit of the MI250 GPUs.
  • Implement a more efficient profiling method that doesn't require large tensors.

Example Code

# Before
fake_tensor = torch.randn(4096, 4096)  # large tensor

# After
fake_tensor = torch.randn(1024, 1024)  # reduced size tensor

Alternatively, consider using a more memory-efficient profiling method, such as:

# Using a smaller tensor and iterating over it
for i in range(0, 4096, 1024):
    fake_tensor = torch.randn(1024, 1024)
    # profiling code here

Verification

Verify that the OOMing event is resolved by running the profiling test with the reduced tensor size or the new profiling method.

Extra Tips

  • Monitor memory usage during profiling to ensure the fix is effective.
  • Consider implementing a dynamic tensor sizing mechanism to adapt to different GPU memory limits.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [CI Failure]: Gemma3 OOMs with transformers backend [1 pull requests, 2 comments, 2 participants]