vllm - ✅(Solved) Fix [CI Failure]: mi325_2: Distributed Tests (2 GPUs)(H100-MI325) [1 pull requests, 2 comments, 2 participants]

vllm2026-03-20 17:47:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37709•Fetched 2026-04-08 01:08:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

AndreasKaratzas

Participants

AndreasKaratzas

github-actions[bot]

Timeline (top)

mentioned ×4subscribed ×4added_to_project_v2 ×2commented ×2

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #38396: [AMD][CI] Update DeepEP branch

Repository: vllm-project/vllm
Author: rjrock
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38396

Description (problem / solution / changelog)

Purpose

Update the DeepEP branch to a version that correctly ahead-of-time compiles for gfx942 and gfx950. This partially addresses #37709

Also, move the testcase to MI325 in order to verify the change, since there are currently no MI355 agents.

Test Plan

python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput

Test Result

Exit code of 0 with the below stdout.

DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Computers class for this' DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive' DP rank 0, Prompt: 'The capital of France is', Generated text: '______.\nA. London\nB. Paris\nC. New York\n' DP rank 0, Prompt: 'The future of AI is', Generated text: ' being decided in Cambridge\nArtificial intelligence (AI) is one of the most' DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Belinda and I am a 43 year old woman who is passionate about' DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Literacy Teaching in the 21' DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive branch of the U' DP rank 1, Prompt: 'The capital of France is', Generated text: ' ______.\nA. Berlin\nB. London\nC. Madrid\nD. Paris\n答案:\n' DP rank 1, Prompt: 'The future of AI is', Generated text: ' being determined right now – by you.\nFor all the excitement about the transformative power of artificial intelligence,' DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Alan Belcher and I am a 43 year old male. I am a 20'

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

.buildkite/test-amd.yaml (modified, +1/-1)
docker/Dockerfile.rocm (modified, +7/-8)

RAW_BUFFERClick to expand / collapse

Name of failing test

pytest -v -s tests/distributed/test_context_parallel.py && VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput && pytest -v -s tests/v1/distributed/test_dbo.py

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

There is a DeepEP issue in this TG.

📝 History of failing test

https://buildkite.com/vllm/amd-ci/builds/6721/steps/canvas?sid=019d09d4-710f-4752-845d-a402e44ef028&tab=output

extent analysis

Fix Plan

The fix involves updating the DeepEP configuration to ensure compatibility with the current environment.

Check the DeepEP version and update it if necessary:

pip install --upgrade deepep

*   Verify that the `deepep_high_throughput` backend is properly configured in the `data_parallel.py` script:
    ```python
import deepep

# ...

if __name__ == "__main__":
    # ...
    args.all2all_backend = "deepep_high_throughput"
    # ...
    deepep.init_all2all_backend(args.all2all_backend)
    # ...

Ensure that the transformers library is up-to-date, as it may be related to the DeepEP issue:

pip install --upgrade transformers


### Verification
To verify that the fix worked, re-run the failing test:
```bash
pytest -v -s tests/distributed/test_context_parallel.py && VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput && pytest -v -s tests/v1/distributed/test_dbo.py

If the test passes, the issue is resolved.

Extra Tips

Regularly update dependencies to prevent compatibility issues.
Monitor test results and investigate failures promptly to avoid regressions.
Consider adding automated tests for DeepEP configuration and version checks.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #pipeline error #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [CI Failure]: mi325_2: Distributed Tests (2 GPUs)(H100-MI325) [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #38396: [AMD][CI] Update DeepEP branch

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

extent analysis

Fix Plan

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [CI Failure]: mi325_2: Distributed Tests (2 GPUs)(H100-MI325) [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #38396: [AMD][CI] Update DeepEP branch

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

extent analysis

Fix Plan

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING