vllm - 💡(How to fix) Fix [CI Failure]: mi300_2: Distributed Compile Unit Tests (2xH100-2xMI300) [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41581Fetched 2026-05-04 04:58:41
View on GitHub
Comments
2
Participants
1
Timeline
8
Reactions
0
Participants
Timeline (top)
added_to_project_v2 ×2commented ×2labeled ×1mentioned ×1

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype0-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype0-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestAGScaledMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestCutlassScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestAGCutlassScaledMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype0-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype0-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestAGScaledMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestCutlassScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestAGCutlassScaledMMModel]
RAW_BUFFERClick to expand / collapse

Name of failing test

(command rocm-smi || true) && export VLLM_TEST_GROUP_NAME=mi300_2-distributed-compile-unit-tests-2xh100-2xmi300 && export VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 && cd /vllm-workspace/ && export VLLM_TEST_CLEAN_GPU_MEMORY=1 && VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/passes/distributed/test_async_tp.py && pytest -v -s tests/compile/passes/distributed/test_sequence_parallelism.py && pytest -v -s tests/compile/passes/distributed/test_tp2_ar_rms.py::test_tp2_ar_rms_fusions

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype0-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype0-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestAGScaledMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestCutlassScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[True-dtype1-16-16-8-TestAGCutlassScaledMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype0-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype0-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestAGMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestAGScaledMMModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestCutlassScaledMMRSModel]
FAILED tests/compile/passes/distributed/test_async_tp.py::test_async_tp_pass_replace[False-dtype1-16-16-8-TestAGCutlassScaledMMModel]

📝 History of failing test

  • Current streak start: 2026-05-02
  • First failure in 60d window: 2026-05-02
  • Last successful nightly: 2026-05-01
  • Break frequency (60d, pass↔fail flips): 1
  • Latest nightly date: 2026-05-09
  • Latest build(s): amd-ci #8371
  • Latest hardware status: mi300_2=fail

extent analysis

TL;DR

Investigate and fix the issue with shards collecting 0 items in the distributed test setup.

Guidance

  • Review the test configuration and setup for test_async_tp.py, test_sequence_parallelism.py, and test_tp2_ar_rms.py to ensure that shards are properly defined and data is being correctly distributed.
  • Check the test environment and hardware setup, specifically the mi300_2 configuration, to identify any potential issues that could be causing the shards to collect 0 items.
  • Verify that the test data is correctly prepared and available for the shards to collect.
  • Investigate the test history and build logs to see if there were any changes or updates around the time the test started failing.

Notes

The issue seems to be related to the test setup and environment, but without more information about the test code and configuration, it's difficult to provide a more specific solution.

Recommendation

Apply workaround: Investigate and fix the issue with shards collecting 0 items, as it's likely a configuration or environment issue rather than a code problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING