vllm - ✅(Solved) Fix [CI Failure]: mi325_2: Distributed Tests (2 GPUs)(H100-MI325) [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37709Fetched 2026-04-08 01:08:39
View on GitHub
Comments
2
Participants
2
Timeline
17
Reactions
0
Timeline (top)
mentioned ×4subscribed ×4added_to_project_v2 ×2commented ×2

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

PR fix notes

PR #38396: [AMD][CI] Update DeepEP branch

Description (problem / solution / changelog)

Purpose

Update the DeepEP branch to a version that correctly ahead-of-time compiles for gfx942 and gfx950. This partially addresses #37709

Also, move the testcase to MI325 in order to verify the change, since there are currently no MI355 agents.

Test Plan

python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput

Test Result

Exit code of 0 with the below stdout.

DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Computers class for this' DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive' DP rank 0, Prompt: 'The capital of France is', Generated text: '______.\nA. London\nB. Paris\nC. New York\n' DP rank 0, Prompt: 'The future of AI is', Generated text: ' being decided in Cambridge\nArtificial intelligence (AI) is one of the most' DP rank 0, Prompt: 'Hello, my name is', Generated text: ' Belinda and I am a 43 year old woman who is passionate about' DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Josh and I will be the teacher for the Braille and Literacy Teaching in the 21' DP rank 1, Prompt: 'The president of the United States is', Generated text: ' the most powerful person in the world. The President is the head of the executive branch of the U' DP rank 1, Prompt: 'The capital of France is', Generated text: ' ______.\nA. Berlin\nB. London\nC. Madrid\nD. Paris\n答案:\n' DP rank 1, Prompt: 'The future of AI is', Generated text: ' being determined right now – by you.\nFor all the excitement about the transformative power of artificial intelligence,' DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Alan Belcher and I am a 43 year old male. I am a 20'


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • .buildkite/test-amd.yaml (modified, +1/-1)
  • docker/Dockerfile.rocm (modified, +7/-8)
RAW_BUFFERClick to expand / collapse

Name of failing test

pytest -v -s tests/distributed/test_context_parallel.py && VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput && pytest -v -s tests/v1/distributed/test_dbo.py

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

There is a DeepEP issue in this TG.

📝 History of failing test

https://buildkite.com/vllm/amd-ci/builds/6721/steps/canvas?sid=019d09d4-710f-4752-845d-a402e44ef028&tab=output

extent analysis

Fix Plan

The fix involves updating the DeepEP configuration to ensure compatibility with the current environment.

  • Check the DeepEP version and update it if necessary:

pip install --upgrade deepep

*   Verify that the `deepep_high_throughput` backend is properly configured in the `data_parallel.py` script:
    ```python
import deepep

# ...

if __name__ == "__main__":
    # ...
    args.all2all_backend = "deepep_high_throughput"
    # ...
    deepep.init_all2all_backend(args.all2all_backend)
    # ...
  • Ensure that the transformers library is up-to-date, as it may be related to the DeepEP issue:

pip install --upgrade transformers


### Verification
To verify that the fix worked, re-run the failing test:
```bash
pytest -v -s tests/distributed/test_context_parallel.py && VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput && pytest -v -s tests/v1/distributed/test_dbo.py

If the test passes, the issue is resolved.

Extra Tips

  • Regularly update dependencies to prevent compatibility issues.
  • Monitor test results and investigate failures promptly to avoid regressions.
  • Consider adding automated tests for DeepEP configuration and version checks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING