pytorch - 💡(How to fix) Fix [ROCm] Segfault in matmul backward with Kineto / rocprofiler-sdk (regression from kineto@dadc7be) [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#182719Fetched 2026-05-07 03:30:32
View on GitHub
Comments
2
Participants
1
Timeline
128
Reactions
0
Author
Participants
Timeline (top)
mentioned ×57subscribed ×57labeled ×8commented ×2

Segfault when running a minimal matmul + backward loop on HIP/ROCm. Wheel/source bisect implicates the Kineto bump that introduced rocprofiler-sdk (dadc7be / kineto#1249); parent revision does not repro.

Issue seems to happen on MI350 according to our tests.

Error Message

Error logs

Root Cause

Segfault when running a minimal matmul + backward loop on HIP/ROCm. Wheel/source bisect implicates the Kineto bump that introduced rocprofiler-sdk (dadc7be / kineto#1249); parent revision does not repro.

Issue seems to happen on MI350 according to our tests.

Fix Action

Fix / Workaround

Thread 132 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff4713ff6c0 (LWP 236073)]
0x00007ff47162eb00 in ?? ()
(gdb) bt
#0  0x00007ff47162eb00 in ?? ()
#1  0x00007fffa438a179 in rocr::core::InterceptQueue::StoreRelaxed(long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007fffa437ad31 in rocr::HSA::hsa_signal_store_screlease(hsa_signal_s, long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007fffa94cfb5f in ?? () from /opt/rocm/lib/libamdhip64.so.7
#4  0x00007fffa94cd566 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#5  0x00007fffa94cdaed in ?? () from /opt/rocm/lib/libamdhip64.so.7
#6  0x00007fffa9491bb3 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#7  0x00007fffa93143b2 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#8  0x00007fffa9316cb5 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#9  0x00007fffa8903666 in auto rocprofiler::hip::hip_api_impl<1ul, 388ul>::exec<hipError_t (*&)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int), ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int>(hipError_t (*&)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int), ihipModuleSymbol_t*&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned long&&, ihipStream_t*&&, void**&&, void**&&, ihipEvent_t*&&, ihipEvent_t*&&, unsigned int&&) ()
   from /opt/rocm/lib/librocprofiler-sdk.so.1
#10 0x00007fffa89029c5 in hipError_t rocprofiler::hip::hip_api_impl<1ul, 388ul>::functor<hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ih--Type <RET> for more, q to quit, c to continue without paging--
ipEvent_t*, unsigned int>(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int) () from /opt/rocm/lib/librocprofiler-sdk.so.1
#11 0x00007fffa8a97cc9 in rocprofiler::hip::stream::create_read_functor<1ul, 388ul, hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)>(hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int))::{lambda(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)#1}::operator()(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int) const () from /opt/rocm/lib/librocprofiler-sdk.so.1
#12 0x00007fffa8a97946 in rocprofiler::hip::stream::create_read_functor<1ul, 388ul, hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)>(hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int))::{lambda(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)#1}::__invoke(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**,--Type <RET> for more, q to quit, c to continue without paging--
 ihipEvent_t*, ihipEvent_t*, unsigned int) () from /opt/rocm/lib/librocprofiler-sdk.so.1
#13 0x00007fffa94291f3 in hipExtModuleLaunchKernel () from /opt/rocm/lib/libamdhip64.so.7
#14 0x00007fffa4dce5c1 in TensileLite::hip::SolutionAdapter::launchKernel(TensileLite::KernelInvocation const&, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, bool) () from /opt/rocm/lib/libhipblaslt.so.1
#15 0x00007fffa4dcf20e in TensileLite::hip::SolutionAdapter::launchKernels(std::vector<TensileLite::KernelInvocation, std::allocator<TensileLite::KernelInvocation> > const&, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, bool) () from /opt/rocm/lib/libhipblaslt.so.1
#16 0x00007fffa4e84896 in runContractionProblem(_rocblaslt_handle*, _rocblaslt_matmul_algo const*, RocblasltContractionProblem const&, std::shared_ptr<void>) () from /opt/rocm/lib/libhipblaslt.so.1
#17 0x00007fffa4e76d0c in rocblaslt_matmul_impl () from /opt/rocm/lib/libhipblaslt.so.1
#18 0x00007fffa4e7c425 in rocblaslt_matmul () from /opt/rocm/lib/libhipblaslt.so.1
#19 0x00007fffa4f6d85e in hipblasLtMatmul () from /opt/rocm/lib/libhipblaslt.so.1
#20 0x00007fffe1ff7b05 in bool at::cuda::blas::bgemm_internal_cublaslt<float, float>(char, char, long, long, long, at::OpMathType<float>::type, float const*, long, long, float const*, long, long, at::OpMathType<float>::type, float*, long, long, long) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#21 0x00007fffe200245b in void at::cuda::blas::gemm_internal<float, float>(char, char, long, long, long, at::OpMathType<float>::type, float const*, long, float const*, long, at::OpMathType<float>::type, float*, long) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#22 0x00007fffe20ec84c in at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::native::(anonymous namespace)::Activation, bool) [clone .isra.0] ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#23 0x00007fffe20eee67 in at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#24 0x00007fffe256b095 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#25 0x00007fffe9b1a790 in at::_ops::mm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fffec65977b in torch::autograd::VariableType::(anonymous namespace)::mm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fffec659eba in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007fffe9bb339a in at::_ops::mm::call(at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007fffedc4b910 in torch::autograd::generated::details::mm_mat1_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c1--Type <RET> for more, q to quit, c to continue without paging--
0::SymInt>, c10::ArrayRef<c10::SymInt>, c10::Layout, c10::Scalar const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#30 0x00007fffec029845 in torch::autograd::generated::MmBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 2ul>, at::Tensor&, c10::Layout&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, at::Tensor&, c10::Layout&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#31 0x00007fffec029a05 in torch::autograd::generated::MmBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#32 0x00007fffed0778d8 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#33 0x00007fffed070b59 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#34 0x00007fffed071b72 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#35 0x00007fffed066364 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#36 0x00007ffff666e9c6 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_python.so
--Type <RET> for more, q to quit, c to continue without paging--
#37 0x00007fffa8dfddb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#38 0x00007ffff7d11aa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#39 0x00007ffff7d9ec6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)

Code Example

import sys
import torch
from contextlib import nullcontext

USE_PROFILER = "--use-profiler" in sys.argv

a = torch.rand(4096, 256, device="cuda").t().requires_grad_(True)
b = torch.rand(4096, 4096, device="cuda").requires_grad_(True)

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) if USE_PROFILER else nullcontext():
    loss = torch.matmul(a, b).mean()
    for _ in range(10000):
        loss.backward(retain_graph=True)
    torch.cuda.synchronize()

---

Thread 132 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff4713ff6c0 (LWP 236073)]
0x00007ff47162eb00 in ?? ()
(gdb) bt
#0  0x00007ff47162eb00 in ?? ()
#1  0x00007fffa438a179 in rocr::core::InterceptQueue::StoreRelaxed(long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007fffa437ad31 in rocr::HSA::hsa_signal_store_screlease(hsa_signal_s, long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007fffa94cfb5f in ?? () from /opt/rocm/lib/libamdhip64.so.7
#4  0x00007fffa94cd566 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#5  0x00007fffa94cdaed in ?? () from /opt/rocm/lib/libamdhip64.so.7
#6  0x00007fffa9491bb3 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#7  0x00007fffa93143b2 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#8  0x00007fffa9316cb5 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#9  0x00007fffa8903666 in auto rocprofiler::hip::hip_api_impl<1ul, 388ul>::exec<hipError_t (*&)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int), ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int>(hipError_t (*&)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int), ihipModuleSymbol_t*&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned long&&, ihipStream_t*&&, void**&&, void**&&, ihipEvent_t*&&, ihipEvent_t*&&, unsigned int&&) ()
   from /opt/rocm/lib/librocprofiler-sdk.so.1
#10 0x00007fffa89029c5 in hipError_t rocprofiler::hip::hip_api_impl<1ul, 388ul>::functor<hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ih--Type <RET> for more, q to quit, c to continue without paging--
ipEvent_t*, unsigned int>(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int) () from /opt/rocm/lib/librocprofiler-sdk.so.1
#11 0x00007fffa8a97cc9 in rocprofiler::hip::stream::create_read_functor<1ul, 388ul, hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)>(hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int))::{lambda(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)#1}::operator()(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int) const () from /opt/rocm/lib/librocprofiler-sdk.so.1
#12 0x00007fffa8a97946 in rocprofiler::hip::stream::create_read_functor<1ul, 388ul, hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)>(hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int))::{lambda(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)#1}::__invoke(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**,--Type <RET> for more, q to quit, c to continue without paging--
 ihipEvent_t*, ihipEvent_t*, unsigned int) () from /opt/rocm/lib/librocprofiler-sdk.so.1
#13 0x00007fffa94291f3 in hipExtModuleLaunchKernel () from /opt/rocm/lib/libamdhip64.so.7
#14 0x00007fffa4dce5c1 in TensileLite::hip::SolutionAdapter::launchKernel(TensileLite::KernelInvocation const&, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, bool) () from /opt/rocm/lib/libhipblaslt.so.1
#15 0x00007fffa4dcf20e in TensileLite::hip::SolutionAdapter::launchKernels(std::vector<TensileLite::KernelInvocation, std::allocator<TensileLite::KernelInvocation> > const&, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, bool) () from /opt/rocm/lib/libhipblaslt.so.1
#16 0x00007fffa4e84896 in runContractionProblem(_rocblaslt_handle*, _rocblaslt_matmul_algo const*, RocblasltContractionProblem const&, std::shared_ptr<void>) () from /opt/rocm/lib/libhipblaslt.so.1
#17 0x00007fffa4e76d0c in rocblaslt_matmul_impl () from /opt/rocm/lib/libhipblaslt.so.1
#18 0x00007fffa4e7c425 in rocblaslt_matmul () from /opt/rocm/lib/libhipblaslt.so.1
#19 0x00007fffa4f6d85e in hipblasLtMatmul () from /opt/rocm/lib/libhipblaslt.so.1
#20 0x00007fffe1ff7b05 in bool at::cuda::blas::bgemm_internal_cublaslt<float, float>(char, char, long, long, long, at::OpMathType<float>::type, float const*, long, long, float const*, long, long, at::OpMathType<float>::type, float*, long, long, long) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#21 0x00007fffe200245b in void at::cuda::blas::gemm_internal<float, float>(char, char, long, long, long, at::OpMathType<float>::type, float const*, long, float const*, long, at::OpMathType<float>::type, float*, long) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#22 0x00007fffe20ec84c in at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::native::(anonymous namespace)::Activation, bool) [clone .isra.0] ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#23 0x00007fffe20eee67 in at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#24 0x00007fffe256b095 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#25 0x00007fffe9b1a790 in at::_ops::mm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fffec65977b in torch::autograd::VariableType::(anonymous namespace)::mm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fffec659eba in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007fffe9bb339a in at::_ops::mm::call(at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007fffedc4b910 in torch::autograd::generated::details::mm_mat1_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c1--Type <RET> for more, q to quit, c to continue without paging--
0::SymInt>, c10::ArrayRef<c10::SymInt>, c10::Layout, c10::Scalar const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#30 0x00007fffec029845 in torch::autograd::generated::MmBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 2ul>, at::Tensor&, c10::Layout&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, at::Tensor&, c10::Layout&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#31 0x00007fffec029a05 in torch::autograd::generated::MmBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#32 0x00007fffed0778d8 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#33 0x00007fffed070b59 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#34 0x00007fffed071b72 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#35 0x00007fffed066364 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#36 0x00007ffff666e9c6 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_python.so
--Type <RET> for more, q to quit, c to continue without paging--
#37 0x00007fffa8dfddb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#38 0x00007ffff7d11aa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#39 0x00007ffff7d9ec6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

Segfault when running a minimal matmul + backward loop on HIP/ROCm. Wheel/source bisect implicates the Kineto bump that introduced rocprofiler-sdk (dadc7be / kineto#1249); parent revision does not repro.

Issue seems to happen on MI350 according to our tests.

Repro

Save as repro.py and run (profiler flag optional):

import sys
import torch
from contextlib import nullcontext

USE_PROFILER = "--use-profiler" in sys.argv

a = torch.rand(4096, 256, device="cuda").t().requires_grad_(True)
b = torch.rand(4096, 4096, device="cuda").requires_grad_(True)

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CUDA]) if USE_PROFILER else nullcontext():
    loss = torch.matmul(a, b).mean()
    for _ in range(10000):
        loss.backward(retain_graph=True)
    torch.cuda.synchronize()

Error logs

Thread 132 "pt_autograd_0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff4713ff6c0 (LWP 236073)]
0x00007ff47162eb00 in ?? ()
(gdb) bt
#0  0x00007ff47162eb00 in ?? ()
#1  0x00007fffa438a179 in rocr::core::InterceptQueue::StoreRelaxed(long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007fffa437ad31 in rocr::HSA::hsa_signal_store_screlease(hsa_signal_s, long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007fffa94cfb5f in ?? () from /opt/rocm/lib/libamdhip64.so.7
#4  0x00007fffa94cd566 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#5  0x00007fffa94cdaed in ?? () from /opt/rocm/lib/libamdhip64.so.7
#6  0x00007fffa9491bb3 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#7  0x00007fffa93143b2 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#8  0x00007fffa9316cb5 in ?? () from /opt/rocm/lib/libamdhip64.so.7
#9  0x00007fffa8903666 in auto rocprofiler::hip::hip_api_impl<1ul, 388ul>::exec<hipError_t (*&)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int), ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int>(hipError_t (*&)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int), ihipModuleSymbol_t*&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned int&&, unsigned long&&, ihipStream_t*&&, void**&&, void**&&, ihipEvent_t*&&, ihipEvent_t*&&, unsigned int&&) ()
   from /opt/rocm/lib/librocprofiler-sdk.so.1
#10 0x00007fffa89029c5 in hipError_t rocprofiler::hip::hip_api_impl<1ul, 388ul>::functor<hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ih--Type <RET> for more, q to quit, c to continue without paging--
ipEvent_t*, unsigned int>(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int) () from /opt/rocm/lib/librocprofiler-sdk.so.1
#11 0x00007fffa8a97cc9 in rocprofiler::hip::stream::create_read_functor<1ul, 388ul, hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)>(hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int))::{lambda(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)#1}::operator()(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int) const () from /opt/rocm/lib/librocprofiler-sdk.so.1
#12 0x00007fffa8a97946 in rocprofiler::hip::stream::create_read_functor<1ul, 388ul, hipError_t, ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)>(hipError_t (*)(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int))::{lambda(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int)#1}::__invoke(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, ihipStream_t*, void**, void**,--Type <RET> for more, q to quit, c to continue without paging--
 ihipEvent_t*, ihipEvent_t*, unsigned int) () from /opt/rocm/lib/librocprofiler-sdk.so.1
#13 0x00007fffa94291f3 in hipExtModuleLaunchKernel () from /opt/rocm/lib/libamdhip64.so.7
#14 0x00007fffa4dce5c1 in TensileLite::hip::SolutionAdapter::launchKernel(TensileLite::KernelInvocation const&, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, bool) () from /opt/rocm/lib/libhipblaslt.so.1
#15 0x00007fffa4dcf20e in TensileLite::hip::SolutionAdapter::launchKernels(std::vector<TensileLite::KernelInvocation, std::allocator<TensileLite::KernelInvocation> > const&, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, bool) () from /opt/rocm/lib/libhipblaslt.so.1
#16 0x00007fffa4e84896 in runContractionProblem(_rocblaslt_handle*, _rocblaslt_matmul_algo const*, RocblasltContractionProblem const&, std::shared_ptr<void>) () from /opt/rocm/lib/libhipblaslt.so.1
#17 0x00007fffa4e76d0c in rocblaslt_matmul_impl () from /opt/rocm/lib/libhipblaslt.so.1
#18 0x00007fffa4e7c425 in rocblaslt_matmul () from /opt/rocm/lib/libhipblaslt.so.1
#19 0x00007fffa4f6d85e in hipblasLtMatmul () from /opt/rocm/lib/libhipblaslt.so.1
#20 0x00007fffe1ff7b05 in bool at::cuda::blas::bgemm_internal_cublaslt<float, float>(char, char, long, long, long, at::OpMathType<float>::type, float const*, long, long, float const*, long, long, at::OpMathType<float>::type, float*, long, long, long) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#21 0x00007fffe200245b in void at::cuda::blas::gemm_internal<float, float>(char, char, long, long, long, at::OpMathType<float>::type, float const*, long, float const*, long, at::OpMathType<float>::type, float*, long) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#22 0x00007fffe20ec84c in at::native::(anonymous namespace)::addmm_out_cuda_impl(at::Tensor&, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, at::native::(anonymous namespace)::Activation, bool) [clone .isra.0] ()
--Type <RET> for more, q to quit, c to continue without paging--
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#23 0x00007fffe20eee67 in at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#24 0x00007fffe256b095 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::wrapper_CUDA_mm>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_hip.so
#25 0x00007fffe9b1a790 in at::_ops::mm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fffec65977b in torch::autograd::VariableType::(anonymous namespace)::mm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fffec659eba in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::mm>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#28 0x00007fffe9bb339a in at::_ops::mm::call(at::Tensor const&, at::Tensor const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#29 0x00007fffedc4b910 in torch::autograd::generated::details::mm_mat1_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c1--Type <RET> for more, q to quit, c to continue without paging--
0::SymInt>, c10::ArrayRef<c10::SymInt>, c10::Layout, c10::Scalar const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#30 0x00007fffec029845 in torch::autograd::generated::MmBackward0_apply_functional(std::vector<at::Tensor, std::allocator<at::Tensor> >&&, std::array<bool, 2ul>, at::Tensor&, c10::Layout&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, at::Tensor&, c10::Layout&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&, std::vector<c10::SymInt, std::allocator<c10::SymInt> >&) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#31 0x00007fffec029a05 in torch::autograd::generated::MmBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#32 0x00007fffed0778d8 in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#33 0x00007fffed070b59 in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#34 0x00007fffed071b72 in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#35 0x00007fffed066364 in torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) ()
   from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_cpu.so
#36 0x00007ffff666e9c6 in torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) () from /opt/venv/lib/python3.13/site-packages/torch/lib/libtorch_python.so
--Type <RET> for more, q to quit, c to continue without paging--
#37 0x00007fffa8dfddb4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#38 0x00007ffff7d11aa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#39 0x00007ffff7d9ec6c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
(gdb)

Versions

PyTorch version: 2.10.0+gitc19ca7a Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 7.12.60610

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: Could not collect CMake version: version 4.0.0 Libc version: glibc-2.39

Python version: 3.13.12 (main, Feb 4 2026, 09:25:39) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-5.15.0-144-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: AMD Radeon Graphics (gfx950) Nvidia driver version: Could not collect cuDNN version: Could not collect Is XPU available: False HIP runtime version: 7.12.60610 MIOpen runtime version: 3.5.1 Is XNNPACK available: True Caching allocator config: N/A

Versions of relevant libraries: [pip3] mypy==1.16.0 [pip3] mypy_extensions==1.1.0 [pip3] numpy==2.1.2 [pip3] onnx==1.19.1 [pip3] onnx-ir==0.1.12 [pip3] onnxscript==0.5.4 [pip3] optree==0.13.0 [pip3] torch==2.10.0+gitc19ca7a [pip3] torchaudio==2.10.0+rocm7.12.0rc1 [pip3] torchvision==0.25.0+rocm7.12.0rc1 [pip3] triton==3.6.0+rocm7.12.0rc1 [conda] Could not collect

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93 @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @robieta @chaekit @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise @mwootton @divyanshk @jiannanWang @scotts @ryanzhang22 @chauhang @penguinwu

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING