vllm - 💡(How to fix) Fix [CI Failure]: Fusion E2E TP2 Quick (H100) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41156Fetched 2026-04-29 06:11:59
View on GitHub
Comments
0
Participants
1
Timeline
22
Reactions
0
Participants
Timeline (top)
mentioned ×6subscribed ×6labeled ×4project_v2_item_status_changed ×3

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)
RAW_BUFFERClick to expand / collapse

Name of failing test

tests/compile/fusions_e2e/test_tp2_ar_rms.py::test_tp2_ar_rms_fp8_fusions

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

Consistent OOM, haven't tried to repro locally. DSV4 PR touched many files, something probably caused a new unaccounted allocation somewhere. Our memory tracking is known to be unsafe.

📝 History of failing test

Started failing with the Deepseek V4 PR (#40860):

Been failing consistently on main since:

CC List.

cc @zou3519 @youkaichao @pavanimajety @mgoin

cc @ywang96 @WoosukKwon just fyi

extent analysis

TL;DR

Review the changes introduced in the Deepseek V4 PR (#40860) to identify potential memory allocation issues causing the consistent OOM errors in the test_tp2_ar_rms_fp8_fusions test.

Guidance

  • Investigate the memory tracking mechanism to understand why it's considered unsafe and how it might be improved to provide more accurate allocation information.
  • Analyze the changes made in the DSV4 PR, focusing on any new allocations or memory-intensive operations that could be causing the OOM errors.
  • Consider running the test with increased memory allocation or using a memory profiling tool to identify the specific allocation causing the issue.
  • Review the test case test_tp2_ar_rms_fp8_fusions to see if there are any opportunities to optimize memory usage or reduce allocations.

Notes

The lack of ability to reproduce the issue locally and the known issues with memory tracking make it challenging to provide a definitive solution. Further investigation into the changes introduced by the DSV4 PR and the memory usage patterns of the test is necessary.

Recommendation

Apply workaround: Increase memory allocation for the test or optimize memory usage in the test_tp2_ar_rms_fp8_fusions test to mitigate the OOM errors until the root cause can be fully identified and addressed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: Fusion E2E TP2 Quick (H100) [1 participants]