vllm - 💡(How to fix) Fix [Bug]: cuda graph capture hipErrorCapturedEvent crash on AMD ROCM when LoRA is enabled [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41622Fetched 2026-05-05 05:44:37
View on GitHub
Comments
3
Participants
3
Timeline
15
Reactions
0
Author
Timeline (top)
mentioned ×4subscribed ×4commented ×3labeled ×2

Error Message

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/14 [00:00<?, ?it/s](Worker_TP0 pid=26771) WARNING 05-04 03:41:42 [utils.py:267] Using default LoRA kernel configs [rank7]:[E504 03:41:43.748340662 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x700a65eddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x700a65eddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x700a34482d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x700a34492920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x700a34497b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x700a3449a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' [rank0]:[E504 03:41:43.748550299 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7f27c16bdf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7f27c16bdafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f278de82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f278de92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7f278de97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7f278de9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x700a65eddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x700a65eddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x700a34482d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x700a34492920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x700a34497b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x700a3449a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x700a31a0b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7f27c16bdf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7f27c16bdafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f278de82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f278de92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7f278de97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7f278de9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x7f278b40b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E504 03:41:43.749268130 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x79a2e8a88f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x79a2e8a88afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x79a2b5282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x79a2b5292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x79a2b5297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x79a2b529a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' [rank1]:[E504 03:41:43.749451627 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7df2832ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7df2832ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7df251882d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7df251892920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7df251897b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7df25189a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' [rank6]:[E504 03:41:43.749530306 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7cb0b7af0f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7cb0b7af0afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7cb084282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7cb084292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7cb084297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7cb08429a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' [rank3]:[E504 03:41:43.749652165 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x77c188cddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x77c188cddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x77c157282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x77c157292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x77c157297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x77c15729a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x79a2e8a88f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x79a2e8a88afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x79a2b5282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x79a2b5292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x79a2b5297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x79a2b529a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x79a2b280b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7df2832ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7df2832ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7df251882d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7df251892920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7df251897b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7df25189a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x7df24ee0b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7cb0b7af0f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7cb0b7af0afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7cb084282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7cb084292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7cb084297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7cb08429a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x7cb08180b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

what(): [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x77c188cddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x77c188cddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x77c157282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x77c157292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x77c157297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x77c15729a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x77c15480b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E504 03:41:43.750397375 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x75e64b2f4f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x75e64b2f4afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x75e617a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x75e617a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x75e617a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x75e617a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x75e64b2f4f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x75e64b2f4afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x75e617a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x75e617a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x75e617a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x75e617a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x75e61500b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E504 03:41:43.752109544 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7b73bb4ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7b73bb4ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b7389a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7b7389a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7b7389a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7b7389a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing AMD_SERIALIZE_KERNEL=3 Device-side assertion tracking was not enabled by user. Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x4bf9a (0x7b73bb4ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7b73bb4ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b7389a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7b7389a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7b7389a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7b7389a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #7: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #8: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xe0b5e1 (0x7b738700b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so) frame #2: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #3: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

(EngineCore pid=26621) ERROR 05-04 03:41:44 [multiproc_executor.py:283] Worker proc VllmWorker-7 died unexpectedly, shutting down executor. (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] EngineCore failed to start. (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] Traceback (most recent call last): (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 1110, in run_engine_core (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] return func(*args, **kwargs) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 876, in init (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] super().init( (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 128, in init (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] return func(*args, **kwargs) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] compilation_times: list[CompilationTimes] = self.collective_rpc( (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] return future if non_block else future.result() (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 90, in result (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] return super().result() (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] return self.__get_result() (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] raise self._exception (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] response = self.aggregate(self.get_response()) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 386, in get_response (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] status, result = mq.dequeue(timeout=dequeue_timeout) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] with self.acquire_read(timeout, indefinite) as buf: (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/usr/lib/python3.12/contextlib.py", line 137, in enter (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] return next(self.gen) (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] ^^^^^^^^^^^^^^ (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] File "/home/arli/vllm-amd/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 677, in acquire_read (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] raise RuntimeError("cancelled") (EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] RuntimeError: cancelled

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7443 24-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 54% CPU max MHz: 4035.6440 CPU min MHz: 1500.0000 BogoMIPS: 5700.49 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap Virtualization: AMD-V L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 12 MiB (24 instances) L3 cache: 128 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Mitigation; Clear CPU buffers Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 4.3.2
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+rocm7.2
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.2.26015
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-23-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : AMD Radeon PRO W6800 (gfx1030)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.2.26015
MIOpen runtime version       : 3.5.1
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7443 24-Core Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               1
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      54%
CPU max MHz:                             4035.6440
CPU min MHz:                             1500.0000
BogoMIPS:                                5700.49
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization:                          AMD-V
L1d cache:                               768 KiB (24 instances)
L1i cache:                               768 KiB (24 instances)
L2 cache:                                12 MiB (24 instances)
L3 cache:                                128 MiB (4 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-47
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.1
[pip3] onnxscript==0.7.0
[pip3] onnxslim==0.1.92
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+rocm7.2
[pip3] torchvision==0.26.0+rocm7.2
[pip3] transformers==5.7.0
[pip3] triton==3.0.0+git0ec280cf
[pip3] triton-rocm==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.2.53211-671d39a71e
vLLM Version                 : 0.1.dev16294+gd0fb16d6d (git sha: d0fb16d6d)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            40           40           40           40           40           40           40           
GPU1   40           0            40           40           40           40           40           40           
GPU2   40           40           0            40           40           40           40           40           
GPU3   40           40           40           0            40           40           40           40           
GPU4   40           40           40           40           0            40           40           40           
GPU5   40           40           40           40           40           0            40           40           
GPU6   40           40           40           40           40           40           0            40           
GPU7   40           40           40           40           40           40           40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            2            2            2            2            2            2            2            
GPU1   2            0            2            2            2            2            2            2            
GPU2   2            2            0            2            2            2            2            2            
GPU3   2            2            2            0            2            2            2            2            
GPU4   2            2            2            2            0            2            2            2            
GPU5   2            2            2            2            2            0            2            2            
GPU6   2            2            2            2            2            2            0            2            
GPU7   2            2            2            2            2            2            2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         
GPU1   PCIE         0            PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         
GPU2   PCIE         PCIE         0            PCIE         PCIE         PCIE         PCIE         PCIE         
GPU3   PCIE         PCIE         PCIE         0            PCIE         PCIE         PCIE         PCIE         
GPU4   PCIE         PCIE         PCIE         PCIE         0            PCIE         PCIE         PCIE         
GPU5   PCIE         PCIE         PCIE         PCIE         PCIE         0            PCIE         PCIE         
GPU6   PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         0            PCIE         
GPU7   PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: -1
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: -1
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: -1
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: -1
GPU[4]          : (Topology) Numa Node: 0
GPU[4]          : (Topology) Numa Affinity: -1
GPU[5]          : (Topology) Numa Node: 0
GPU[5]          : (Topology) Numa Affinity: -1
GPU[6]          : (Topology) Numa Node: 0
GPU[6]          : (Topology) Numa Affinity: -1
GPU[7]          : (Topology) Numa Node: 0
GPU[7]          : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
PYTORCH_ROCM_ARCH=gfx1030
LD_LIBRARY_PATH=/opt/rocm-7.2.2/lib
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_arli

---

vllm serve /home/arli/models/Llama-3.3-70B-Instruct-GPTQModel-Int8 \
--port 8000 \
--scheduling-policy priority \
--max-num-seqs 16 \
-tp 8 \
--attention-backend TRITON_ATTN \
--dtype float16 \
--gpu-memory-utilization 0.95 --max-model-len 32768 \
--max-num-batched-tokens 1024 \
--served-model-name Llama-3.3-70B-Instruct \
--enable-lora --max-lora-rank 64 --max-loras 1 --max-cpu-loras 1 --lora-modules \
test-lora=/home/arli/loras-70b/Llama-3.3-70B-ArliAI-RPMax-v3-LoRA \

---

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|                                                                                                                          | 0/14 [00:00<?, ?it/s](Worker_TP0 pid=26771) WARNING 05-04 03:41:42 [utils.py:267] Using default LoRA kernel configs
[rank7]:[E504 03:41:43.748340662 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x700a65eddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x700a65eddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x700a34482d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x700a34492920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x700a34497b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x700a3449a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]:[E504 03:41:43.748550299 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7f27c16bdf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7f27c16bdafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f278de82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f278de92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7f278de97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7f278de9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x700a65eddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x700a65eddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x700a34482d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x700a34492920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x700a34497b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x700a3449a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x700a31a0b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7f27c16bdf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7f27c16bdafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f278de82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f278de92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7f278de97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7f278de9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7f278b40b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E504 03:41:43.749268130 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x79a2e8a88f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x79a2e8a88afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x79a2b5282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x79a2b5292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x79a2b5297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x79a2b529a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]:[E504 03:41:43.749451627 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7df2832ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7df2832ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7df251882d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7df251892920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7df251897b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7df25189a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank6]:[E504 03:41:43.749530306 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7cb0b7af0f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7cb0b7af0afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7cb084282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7cb084292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7cb084297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7cb08429a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank3]:[E504 03:41:43.749652165 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x77c188cddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x77c188cddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x77c157282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x77c157292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x77c157297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x77c15729a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x79a2e8a88f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x79a2e8a88afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x79a2b5282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x79a2b5292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x79a2b5297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x79a2b529a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x79a2b280b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7df2832ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7df2832ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7df251882d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7df251892920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7df251897b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7df25189a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7df24ee0b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7cb0b7af0f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7cb0b7af0afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7cb084282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7cb084292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7cb084297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7cb08429a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7cb08180b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x77c188cddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x77c188cddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x77c157282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x77c157292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x77c157297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x77c15729a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x77c15480b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E504 03:41:43.750397375 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x75e64b2f4f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x75e64b2f4afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x75e617a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x75e617a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x75e617a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x75e617a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x75e64b2f4f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x75e64b2f4afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x75e617a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x75e617a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x75e617a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x75e617a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x75e61500b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E504 03:41:43.752109544 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7b73bb4ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7b73bb4ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b7389a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7b7389a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7b7389a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7b7389a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7b73bb4ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7b73bb4ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b7389a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7b7389a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7b7389a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7b7389a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7b738700b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

(EngineCore pid=26621) ERROR 05-04 03:41:44 [multiproc_executor.py:283] Worker proc VllmWorker-7 died unexpectedly, shutting down executor.
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] EngineCore failed to start.
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     super().__init__(
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                                                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return future if non_block else future.result()
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                                     ^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return super().result()
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return self.__get_result()
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     raise self._exception
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     response = self.aggregate(self.get_response())
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 386, in get_response
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return next(self.gen)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 677, in acquire_read
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     raise RuntimeError("cancelled")
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] RuntimeError: cancelled
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 4.3.2
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+rocm7.2
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.2.26015
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-23-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : AMD Radeon PRO W6800 (gfx1030)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.2.26015
MIOpen runtime version       : 3.5.1
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7443 24-Core Processor
CPU family:                              25
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               1
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      54%
CPU max MHz:                             4035.6440
CPU min MHz:                             1500.0000
BogoMIPS:                                5700.49
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization:                          AMD-V
L1d cache:                               768 KiB (24 instances)
L1i cache:                               768 KiB (24 instances)
L2 cache:                                12 MiB (24 instances)
L3 cache:                                128 MiB (4 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-47
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Mitigation; Clear CPU buffers
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.1
[pip3] onnxscript==0.7.0
[pip3] onnxslim==0.1.92
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+rocm7.2
[pip3] torchvision==0.26.0+rocm7.2
[pip3] transformers==5.7.0
[pip3] triton==3.0.0+git0ec280cf
[pip3] triton-rocm==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.2.53211-671d39a71e
vLLM Version                 : 0.1.dev16294+gd0fb16d6d (git sha: d0fb16d6d)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            40           40           40           40           40           40           40           
GPU1   40           0            40           40           40           40           40           40           
GPU2   40           40           0            40           40           40           40           40           
GPU3   40           40           40           0            40           40           40           40           
GPU4   40           40           40           40           0            40           40           40           
GPU5   40           40           40           40           40           0            40           40           
GPU6   40           40           40           40           40           40           0            40           
GPU7   40           40           40           40           40           40           40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            2            2            2            2            2            2            2            
GPU1   2            0            2            2            2            2            2            2            
GPU2   2            2            0            2            2            2            2            2            
GPU3   2            2            2            0            2            2            2            2            
GPU4   2            2            2            2            0            2            2            2            
GPU5   2            2            2            2            2            0            2            2            
GPU6   2            2            2            2            2            2            0            2            
GPU7   2            2            2            2            2            2            2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         
GPU1   PCIE         0            PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         
GPU2   PCIE         PCIE         0            PCIE         PCIE         PCIE         PCIE         PCIE         
GPU3   PCIE         PCIE         PCIE         0            PCIE         PCIE         PCIE         PCIE         
GPU4   PCIE         PCIE         PCIE         PCIE         0            PCIE         PCIE         PCIE         
GPU5   PCIE         PCIE         PCIE         PCIE         PCIE         0            PCIE         PCIE         
GPU6   PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         0            PCIE         
GPU7   PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: -1
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: -1
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: -1
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: -1
GPU[4]          : (Topology) Numa Node: 0
GPU[4]          : (Topology) Numa Affinity: -1
GPU[5]          : (Topology) Numa Node: 0
GPU[5]          : (Topology) Numa Affinity: -1
GPU[6]          : (Topology) Numa Node: 0
GPU[6]          : (Topology) Numa Affinity: -1
GPU[7]          : (Topology) Numa Node: 0
GPU[7]          : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
PYTORCH_ROCM_ARCH=gfx1030
LD_LIBRARY_PATH=/opt/rocm-7.2.2/lib
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_arli
</details>

🐛 Describe the bug

When running vLLM on AMD ROCm GPUs with LoRA enabled and tensor parallelism > 1, the engine crashes during CUDA graph capture with hipErrorCapturedEvent across all ranks. This happens for any model with LoRA enabled. The same config on Nvidia CUDA GPUs deploys just fine.

--enforce-eager allows it to run but performance is predictably terrible without graph capture. Running the models without LoRA also deploys just fine.

vllm commit 6ec9bbec384b14401f901af189754f7d8a6754e2 installed from source

Start command used:

vllm serve /home/arli/models/Llama-3.3-70B-Instruct-GPTQModel-Int8 \
--port 8000 \
--scheduling-policy priority \
--max-num-seqs 16 \
-tp 8 \
--attention-backend TRITON_ATTN \
--dtype float16 \
--gpu-memory-utilization 0.95 --max-model-len 32768 \
--max-num-batched-tokens 1024 \
--served-model-name Llama-3.3-70B-Instruct \
--enable-lora --max-lora-rank 64 --max-loras 1 --max-cpu-loras 1 --lora-modules \
test-lora=/home/arli/loras-70b/Llama-3.3-70B-ArliAI-RPMax-v3-LoRA \

Crash dump:

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|                                                                                                                          | 0/14 [00:00<?, ?it/s](Worker_TP0 pid=26771) WARNING 05-04 03:41:42 [utils.py:267] Using default LoRA kernel configs
[rank7]:[E504 03:41:43.748340662 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x700a65eddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x700a65eddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x700a34482d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x700a34492920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x700a34497b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x700a3449a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]:[E504 03:41:43.748550299 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7f27c16bdf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7f27c16bdafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f278de82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f278de92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7f278de97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7f278de9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 7] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x700a65eddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x700a65eddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x700a34482d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x700a34492920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x700a34497b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x700a3449a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x700a67970648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x700a31a0b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7009b8cecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x700a6949caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x700a69529c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7f27c16bdf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7f27c16bdafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f278de82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f278de92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7f278de97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7f278de9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f27c1370648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7f278b40b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7f27126ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7f27c2e9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7f27c2f29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank4]:[E504 03:41:43.749268130 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x79a2e8a88f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x79a2e8a88afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x79a2b5282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x79a2b5292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x79a2b5297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x79a2b529a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]:[E504 03:41:43.749451627 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7df2832ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7df2832ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7df251882d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7df251892920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7df251897b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7df25189a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank6]:[E504 03:41:43.749530306 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7cb0b7af0f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7cb0b7af0afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7cb084282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7cb084292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7cb084297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7cb08429a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[rank3]:[E504 03:41:43.749652165 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x77c188cddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x77c188cddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x77c157282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x77c157292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x77c157297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x77c15729a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 4] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x79a2e8a88f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x79a2e8a88afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x79a2b5282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x79a2b5292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x79a2b5297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x79a2b529a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x79a2e8770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x79a2b280b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x79a239aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x79a2ea29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x79a2ea329c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7df2832ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7df2832ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7df251882d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7df251892920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7df251897b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7df25189a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7df284d70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7df24ee0b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7df1d60ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7df28689caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7df286929c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 6] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7cb0b7af0f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7cb0b7af0afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7cb084282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7cb084292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7cb084297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7cb08429a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7cb0b7770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7cb08180b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7cb008aecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7cb0b929caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7cb0b9329c6c in /lib/x86_64-linux-gnu/libc.so.6)

  what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x77c188cddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x77c188cddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x77c157282d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x77c157292920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x77c157297b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x77c15729a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x77c18a770648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x77c15480b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x77c0dbaecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x77c18c29caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x77c18c329c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E504 03:41:43.750397375 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x75e64b2f4f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x75e64b2f4afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x75e617a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x75e617a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x75e617a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x75e617a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 5] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x75e64b2f4f9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x75e64b2f4afd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x75e617a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x75e617a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x75e617a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x75e617a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75e64af70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x75e61500b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x75e59c2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x75e64ca9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x75e64cb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E504 03:41:43.752109544 ProcessGroupNCCL.cpp:2119] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7b73bb4ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7b73bb4ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b7389a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7b7389a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7b7389a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7b7389a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: operation not permitted on an event last recorded in a capturing stream
Search for `hipErrorCapturedEvent' in https://rocm.docs.amd.com/projects/HIP/en/latest/index.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Device-side assertion tracking was not enabled by user.
Exception raised from query at /pytorch/c10/hip/HIPEvent.h:112 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x4bf9a (0x7b73bb4ddf9a in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1fd (0x7b73bb4ddafd in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7b7389a82d96 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7b7389a92920 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xadc (0x7b7389a97b8c in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0x14f (0x7b7389a9a33f in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2125 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7b73bcf70648 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe0b5e1 (0x7b738700b5e1 in /home/arli/vllm-amd/vllm/.venv/lib/python3.12/site-packages/torch/lib/libtorch_hip.so)
frame #2: <unknown function> + 0xecdb4 (0x7b730e2ecdb4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9caa4 (0x7b73bea9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x129c6c (0x7b73beb29c6c in /lib/x86_64-linux-gnu/libc.so.6)

(EngineCore pid=26621) ERROR 05-04 03:41:44 [multiproc_executor.py:283] Worker proc VllmWorker-7 died unexpectedly, shutting down executor.
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] EngineCore failed to start.
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     super().__init__(
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/engine/core.py", line 283, in _initialize_kv_caches
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/abstract.py", line 124, in initialize_from_config
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     compilation_times: list[CompilationTimes] = self.collective_rpc(
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                                                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 403, in collective_rpc
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return future if non_block else future.result()
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                                     ^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 90, in result
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return super().result()
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return self.__get_result()
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     raise self._exception
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 94, in _wait_for_response
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     response = self.aggregate(self.get_response())
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/v1/executor/multiproc_executor.py", line 386, in get_response
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     return next(self.gen)
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]            ^^^^^^^^^^^^^^
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]   File "/home/arli/vllm-amd/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 677, in acquire_read
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136]     raise RuntimeError("cancelled")
(EngineCore pid=26621) ERROR 05-04 03:41:51 [core.py:1136] RuntimeError: cancelled

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely related to a compatibility problem between PyTorch, ROCm, and the specific GPU architecture, causing a CUDA error when running vLLM with LoRA enabled and tensor parallelism greater than 1.

Guidance

  • Check the PyTorch and ROCm versions for compatibility issues, as the error message suggests a problem with the CUDA graph capture.
  • Verify that the GPU architecture (AMD Radeon PRO W6800) is supported by the current PyTorch and ROCm versions.
  • Consider passing the AMD_SERIALIZE_KERNEL=3 environment variable for debugging, as suggested in the error message.
  • Try disabling LoRA or reducing the tensor parallelism to see if the issue persists.

Example

No specific code example is provided, as the issue seems to be related to a configuration or compatibility problem rather than a code error.

Notes

The error message suggests a problem with the CUDA graph capture, which might be related to a compatibility issue between PyTorch, ROCm, and the specific GPU architecture. The fact that the same configuration works on Nvidia CUDA GPUs suggests that the issue might be specific to the AMD GPU.

Recommendation

Apply a workaround by disabling LoRA or reducing the tensor parallelism to see if the issue persists. If the problem is resolved, it may indicate a compatibility issue that needs to be addressed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: cuda graph capture hipErrorCapturedEvent crash on AMD ROCM when LoRA is enabled [3 comments, 3 participants]