vllm - 💡(How to fix) Fix [CI Failure]: mi300_4: Distributed Torchrun + Examples (4 GPUs)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
RAW_BUFFERClick to expand / collapse

Name of failing test

(command rocm-smi || true) && export VLLM_TEST_GROUP_NAME=mi300_4-distributed-torchrun---examples-4-gpus && export VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 && cd /vllm-workspace/tests && torchrun --nproc-per-node=4 distributed/test_torchrun_example.py && PP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py && TP_SIZE=4 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && PP_SIZE=2 TP_SIZE=2 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && DP_SIZE=4 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && TP_SIZE=2 DP_SIZE=2 ENABLE_EP=1 torchrun --nproc-per-node=4 distributed/test_torchrun_example_moe.py && python3 ../examples/features/data_parallel/data_parallel_offline.py --enforce-eager && VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 ../examples/rl/rlhf_nccl.py && VLLM_ALLOW_INSECURE_SERIALIZATION=1 python3 ../examples/rl/rlhf_ipc.py

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

https://buildkite.com/vllm/amd-ci/builds/8371/canvas?sid=019e0bf7-4aec-42bb-85d9-00364d4e4194&tab=output

📝 History of failing test

  • Current streak start: 2026-05-06
  • First failure in 60d window: 2026-05-06
  • Last successful nightly: 2026-05-04
  • Break frequency (60d, pass↔fail flips): 1
  • Latest nightly date: 2026-05-09
  • Latest build(s): amd-ci #8371
  • Latest hardware status: mi300_4=fail

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: mi300_4: Distributed Torchrun + Examples (4 GPUs)