vllm - 💡(How to fix) Fix [CI Failure]: mi300_4: Distributed Torchrun + Examples (4 GPUs)

vllm2026-05-09 21:00:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

Root Cause

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

Code Example

RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

RAW_BUFFERClick to expand / collapse

Name of failing test

(command rocm-smi || true) && export VLLM_TEST_GROUP_NAME=mi300_4-distributed-torchrun---examples-4-gpus && export VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 && cd /vllm-workspace/tests && torchrun --nproc-per-node=4 distributed/test_torchrun_example.py && PP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py && TP_SIZE=4 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && PP_SIZE=2 TP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && DP_SIZE=4 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && TP_SIZE=2 DP_SIZE=2 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && python3 ../examples/features/data_parallel/data_parallel_offline.py --enforce-eager && VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 ../examples/rl/rlhf_nccl.py && VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 ../examples/rl/rlhf_ipc.py

Basic information

Flaky test
Can reproduce locally
Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

https://buildkite.com/vllm/amd-ci/builds/8371/canvas?sid=019e0bf7-4aec-42bb-85d9-00364d4e4194&tab=output

📝 History of failing test

Current streak start: 2026-05-06
First failure in 60d window: 2026-05-06
Last successful nightly: 2026-05-04
Break frequency (60d, pass↔fail flips): 1
Latest nightly date: 2026-05-09
Latest build(s): amd-ci #8371
Latest hardware status: mi300_4=fail

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#training loop #device allocation #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [CI Failure]: mi300_4: Distributed Torchrun + Examples (4 GPUs)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: mi300_4: Distributed Torchrun + Examples (4 GPUs)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Name of failing test

Basic information

🧪 Describe the failing test

📝 History of failing test

Still need to ship something?

RELATED_DISCOVERY

TRENDING