pytorch - ✅(Solved) Fix NCCLComm::abort() crashes when deregistering window-registered (symmetric) memory segments [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181610Fetched 2026-04-28 06:24:31
View on GitHub
Comments
0
Participants
1
Timeline
39
Reactions
0
Participants
Timeline (top)
mentioned ×14subscribed ×14labeled ×4referenced ×4

Error Message

[rank0]:[W427 16:49:49.099958687 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more
info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

[2026-04-27 16:49:49] gb-nvl-059-compute03:29643:29668 [0] register/register.cc:150 NCCL WARN Deregister: Could not find handle

[rank0]:[E427 16:49:49.101881331 ProcessGroupNCCL.cpp:1370] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception thrown when waiting for future ProcessGroup abort: NCCL error in:
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.29.7
Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f28f40
Exception raised from abort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x141296c (0xffffcf7d296c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,
std::shared_ptrc10d::NCCLComm, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const,
std::shared_ptrc10d::NCCLComm > > >&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x84 (0xffffcf7fb004 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x1b4 (0xffffcf7fb444 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x143b594 (0xffffcf7fb594 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x12182dc (0xffffcf5d82dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x8abdc (0xfffff7cfabdc in /lib/aarch64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x14334c8 (0xffffcf7f34c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0xe1ae0 (0xffffcd901ae0 in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x8595c (0xfffff7cf595c in /lib/aarch64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0xebb4c (0xfffff7d5bb4c in /lib/aarch64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception thrown when waiting for future ProcessGroup abort: NCCL error in:
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.29.7
Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f28f40
Exception raised from abort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x141296c (0xffffcf7d296c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,
std::shared_ptrc10d::NCCLComm, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const,
std::shared_ptrc10d::NCCLComm > > >&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x84 (0xffffcf7fb004 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x1b4 (0xffffcf7fb444 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x143b594 (0xffffcf7fb594 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x12182dc (0xffffcf5d82dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x8abdc (0xfffff7cfabdc in /lib/aarch64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x14334c8 (0xffffcf7f34c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0xe1ae0 (0xffffcd901ae0 in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x8595c (0xfffff7cf595c in /lib/aarch64-linux-gnu/libc.so.6)
frame #10: <unknown function> + 0xebb4c (0xfffff7d5bb4c in /lib/aarch64-linux-gnu/libc.so.6)

Exception raised from waitForFutureOrTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1392 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::waitForFutureOrTimeout(std::future<bool>&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&, std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, c10d::C10dLoggingData&, bool) + 0x6d8 (0xffffcf80e9c8 in
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::abort() + 0x29c (0xffffcf81a6dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd8 (0xffffcf823308 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x14 (0xffffcf823af4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroup::release_resources() + 0xc0 (0xfffff3187ab0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xe62cc8 (0xfffff64a2cc8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x121fb80 (0xfffff685fb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x6373cc (0xfffff5c773cc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x637d24 (0xfffff5c77d24 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python3() [0x4fb0bc]
frame #11: /usr/bin/python3() [0x5249e0]
frame #12: /usr/bin/python3() [0x4fba1c]
frame #13: /usr/bin/python3() [0x5b3cac]
frame #14: /usr/bin/python3() [0x68c2b8]
frame #15: Py_FinalizeEx + 0xb0 (0x67b0e0 in /usr/bin/python3)
frame #16: Py_Exit + 0x18 (0x67c518 in /usr/bin/python3)
frame #17: /usr/bin/python3() [0x6811d0]
frame #18: /usr/bin/python3() [0x680f04]
frame #19: _PyRun_SimpleFileObject + 0x1d4 (0x680858 in /usr/bin/python3)
frame #20: _PyRun_AnyFileObject + 0x54 (0x6805e4 in /usr/bin/python3)
frame #21: Py_RunMain + 0x2dc (0x68b5dc in /usr/bin/python3)
frame #22: Py_BytesMain + 0x28 (0x68b198 in /usr/bin/python3)
frame #23: <unknown function> + 0x284c4 (0xfffff7c984c4 in /lib/aarch64-linux-gnu/libc.so.6)
frame #24: __libc_start_main + 0x98 (0xfffff7c98598 in /lib/aarch64-linux-gnu/libc.so.6)
frame #25: _start + 0x30 (0x5f66f0 in /usr/bin/python3)

Signal 6 (SIGABRT) received by PID 29096

Root Cause

In torch/csrc/distributed/c10d/NCCLUtils.cpp, NCCLComm::abort():

#ifdef NCCL_HAS_COMM_REGISTER
  // Deregister all registered segments before aborting.
  for (auto& it : registeredSegmentHandles_) {
    void* handle = it.second;
    C10D_NCCL_CHECK(
        ::ncclCommDeregister(ncclComm_, handle),   // BUG: wrong API for window handles
        ...);
  }

registeredSegmentHandles_ stores both window-registered and non-window-registered handles in the same std::unordered_map<void*, void*> with no way to tell them apart. The abort path calls ncclCommDeregister on all of them, but window handles need ncclCommWindowDeregister.

Compare with the correct path in NCCLComm::deregisterSegment() which dispatches properly:

#ifdef NCCL_HAS_COMM_WINDOW_REGISTER
  if (window) {
    ncclCommWindowDeregister(comm, (ncclWindow_t)handle);  // correct for symm=True
  } else {
    ncclCommDeregister(comm, handle);                       // correct for symm=False
  }

Fix Action

Fix / Workaround

The normal deregistration path (NCCLComm::deregisterSegment()) correctly dispatches between ncclCommDeregister and ncclCommWindowDeregister based on a window flag. But
the abort path has no such distinction — it treats all handles as non-window.

Compare with the correct path in NCCLComm::deregisterSegment() which dispatches properly:

PR fix notes

PR #181626: Fixes NCCLComm::abort() to use correct deregister API for window-registered handles

Description (problem / solution / changelog)

NCCLComm::abort() unconditionally called ncclCommDeregister() on all entries in registeredSegmentHandles_. However, handles registered via ncclCommWindowRegister() (symmetric registration) require ncclCommWindowDeregister() instead. Using the wrong API causes NCCL to fail with "Deregister: Could not find handle" followed by a SIGABRT crash during ProcessGroupNCCL destruction.

The fix tracks whether each handle was window-registered by changing registeredSegmentHandles_ from map<void*, void*> to map<void*, pair<void*, bool>>, and dispatches to the correct deregister call in abort(), matching the existing logic in deregisterSegment().

Fixes: https://github.com/pytorch/pytorch/issues/181610

Assistant used: Claude Opus 4.7 (1M)

Changed files

  • torch/csrc/distributed/c10d/NCCLUtils.cpp (modified, +23/-10)
  • torch/csrc/distributed/c10d/NCCLUtils.hpp (modified, +3/-2)

Code Example

NCCL WARN Deregister: Could not find handle                                                                                                                                      
  Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f320a0

---

#!/usr/bin/env python3                                                                                                                                                           
  # Run: NCCL_DEBUG=WARN torchrun --nproc_per_node 2 repro.py                                                                                                                      
  import os, sys, torch, torch.distributed as dist                                       
    
  local_rank = int(os.environ["LOCAL_RANK"])
  torch.cuda.set_device(local_rank)                                                                                                   
  dist.init_process_group(backend="nccl")   
    
  backend = dist.group.WORLD._get_backend(torch.device("cuda", local_rank))              

  pool = torch.cuda.MemPool(backend.mem_allocator)
  with torch.cuda.use_mem_pool(pool):       
      buf = torch.zeros(1024, dtype=torch.float32, device="cuda")                        
    
  dist.all_reduce(torch.ones(1, device="cuda"))                                          
  backend.register_mem_pool(pool, symm=True)
    
  sys.exit(0)

---

[rank0]:[W427 16:49:49.099958687 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more                                                                              
  info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())                                                                                                                                                          
                                                                                                                                                                                                                                                            
  [2026-04-27 16:49:49] gb-nvl-059-compute03:29643:29668 [0] register/register.cc:150 NCCL WARN Deregister: Could not find handle                                                                                                                           
                                                                                                                                                                                                                                                            
  [rank0]:[E427 16:49:49.101881331 ProcessGroupNCCL.cpp:1370] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception thrown when waiting for future ProcessGroup abort: NCCL error in:                                                                            
  /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.29.7                                                                                                             
  Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f28f40                                                                                                                                                                                   
  Exception raised from abort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383 (most recent call first):                                                                                                                               
  frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in                                                                                               
  /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)                                                                                                                     
  frame #1: <unknown function> + 0x141296c (0xffffcf7d296c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #2: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,                          
  std::shared_ptr<c10d::NCCLComm>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char,     
  std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const,                
  std::shared_ptr<c10d::NCCLComm> > > >&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x84 (0xffffcf7fb004 in        
  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
  frame #3: c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x1b4 (0xffffcf7fb444 in  
  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
  frame #4: <unknown function> + 0x143b594 (0xffffcf7fb594 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #5: <unknown function> + 0x12182dc (0xffffcf5d82dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #6: <unknown function> + 0x8abdc (0xfffff7cfabdc in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
  frame #7: <unknown function> + 0x14334c8 (0xffffcf7f34c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #8: <unknown function> + 0xe1ae0 (0xffffcd901ae0 in /lib/aarch64-linux-gnu/libstdc++.so.6)                                                                                 
  frame #9: <unknown function> + 0x8595c (0xfffff7cf595c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
  frame #10: <unknown function> + 0xebb4c (0xfffff7d5bb4c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                     
                                                                                                                                                                                   
  terminate called after throwing an instance of 'c10::DistBackendError'                                                                                                           
    what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception thrown when waiting for future ProcessGroup abort: NCCL error in:                                                    
  /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.29.7                                    
  Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f28f40                                                                                                          
  Exception raised from abort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383 (most recent call first):                                                      
  frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in                      
  /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)                                                                                                                     
  frame #1: <unknown function> + 0x141296c (0xffffcf7d296c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #2: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,                          
  std::shared_ptr<c10d::NCCLComm>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char,     
  std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const,                
  std::shared_ptr<c10d::NCCLComm> > > >&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x84 (0xffffcf7fb004 in        
  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
  frame #3: c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x1b4 (0xffffcf7fb444 in  
  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
  frame #4: <unknown function> + 0x143b594 (0xffffcf7fb594 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #5: <unknown function> + 0x12182dc (0xffffcf5d82dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #6: <unknown function> + 0x8abdc (0xfffff7cfabdc in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
  frame #7: <unknown function> + 0x14334c8 (0xffffcf7f34c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
  frame #8: <unknown function> + 0xe1ae0 (0xffffcd901ae0 in /lib/aarch64-linux-gnu/libstdc++.so.6)                                                                                 
  frame #9: <unknown function> + 0x8595c (0xfffff7cf595c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
  frame #10: <unknown function> + 0xebb4c (0xfffff7d5bb4c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                     
                                                                                                                                                                                   
  Exception raised from waitForFutureOrTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1392 (most recent call first):                             
  frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in                      
  /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)                                                                                                                     
  frame #1: c10d::ProcessGroupNCCL::waitForFutureOrTimeout(std::future<bool>&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&, std::__cxx11::basic_string<char,        
  std::char_traits<char>, std::allocator<char> > const&, c10d::C10dLoggingData&, bool) + 0x6d8 (0xffffcf80e9c8 in                                                                  
  /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
  frame #2: c10d::ProcessGroupNCCL::abort() + 0x29c (0xffffcf81a6dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                         
  frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd8 (0xffffcf823308 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                              
  frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x14 (0xffffcf823af4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                              
  frame #5: c10d::ProcessGroup::release_resources() + 0xc0 (0xfffff3187ab0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)                                   
  frame #6: <unknown function> + 0xe62cc8 (0xfffff64a2cc8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                 
  frame #7: <unknown function> + 0x121fb80 (0xfffff685fb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                
  frame #8: <unknown function> + 0x6373cc (0xfffff5c773cc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                                                 
  frame #9: <unknown function> + 0x637d24 (0xfffff5c77d24 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                                                 
  frame #10: /usr/bin/python3() [0x4fb0bc]                                                                    
  frame #11: /usr/bin/python3() [0x5249e0]          
  frame #12: /usr/bin/python3() [0x4fba1c]                                                                                                                                                                                   
  frame #13: /usr/bin/python3() [0x5b3cac]                                                                                                                                                                                   
  frame #14: /usr/bin/python3() [0x68c2b8]                                                                                                                                                                         
  frame #15: Py_FinalizeEx + 0xb0 (0x67b0e0 in /usr/bin/python3)                                                                                                                                                             
  frame #16: Py_Exit + 0x18 (0x67c518 in /usr/bin/python3)                                                                                                                                                                   
  frame #17: /usr/bin/python3() [0x6811d0]                                                                                                                                                                                   
  frame #18: /usr/bin/python3() [0x680f04]                                                                                                                                                                                   
  frame #19: _PyRun_SimpleFileObject + 0x1d4 (0x680858 in /usr/bin/python3)                                                                                                                                                  
  frame #20: _PyRun_AnyFileObject + 0x54 (0x6805e4 in /usr/bin/python3)                                                                                                                                                      
  frame #21: Py_RunMain + 0x2dc (0x68b5dc in /usr/bin/python3)                                                                                                                                                               
  frame #22: Py_BytesMain + 0x28 (0x68b198 in /usr/bin/python3)                                                                                                                                                              
  frame #23: <unknown function> + 0x284c4 (0xfffff7c984c4 in /lib/aarch64-linux-gnu/libc.so.6)                                                                                                                                                                     
  frame #24: __libc_start_main + 0x98 (0xfffff7c98598 in /lib/aarch64-linux-gnu/libc.so.6)                                                                                                                                   
  frame #25: _start + 0x30 (0x5f66f0 in /usr/bin/python3)                                                                                                                                                                                                          
                                                       
  Signal 6 (SIGABRT) received by PID 29096

---

#ifdef NCCL_HAS_COMM_REGISTER
  // Deregister all registered segments before aborting.
  for (auto& it : registeredSegmentHandles_) {
    void* handle = it.second;
    C10D_NCCL_CHECK(
        ::ncclCommDeregister(ncclComm_, handle),   // BUG: wrong API for window handles
        ...);
  }

---

#ifdef NCCL_HAS_COMM_WINDOW_REGISTER
  if (window) {
    ncclCommWindowDeregister(comm, (ncclWindow_t)handle);  // correct for symm=True
  } else {
    ncclCommDeregister(comm, handle);                       // correct for symm=False
  }

---

Collecting environment information...
PyTorch version: 2.11.0a0+a6c236b9fd.nv26.03.46836102
Is debug build: False
CUDA used to build PyTorch: 13.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (aarch64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-1013-nvidia-64k-aarch64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.2.51
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GB200
GPU 1: NVIDIA GB200
GPU 2: NVIDIA GB200
GPU 3: NVIDIA GB200

Nvidia driver version: 580.126.20
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.20.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  144
On-line CPU(s) list:                     0-143
Vendor ID:                               ARM
BIOS Vendor ID:                          NVIDIA
Model name:                              Neoverse-V2
BIOS Model name:                         Grace A02 699-2G548-0201-TS3 CPU @ 3.3GHz
BIOS CPU family:                         258
Model:                                   0
Thread(s) per core:                      1
Core(s) per socket:                      72
Socket(s):                               2
Stepping:                                r0p0
Frequency boost:                         disabled
CPU(s) scaling MHz:                      100%
CPU max MHz:                             3375.0000
CPU min MHz:                             81.0000
BogoMIPS:                                2000.00
L1d cache:                               9 MiB (144 instances)
L1i cache:                               9 MiB (144 instances)
L2 cache:                                144 MiB (144 instances)
L3 cache:                                228 MiB (2 instances)
NUMA node(s):                            34
NUMA node0 CPU(s):                       0-71
NUMA node1 CPU(s):                       72-143

Versions of relevant libraries:
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.1.0
[pip3] nvidia-cuda-runtime-cu13==0.0.0a0
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvtx==0.2.15
[pip3] onnx==1.18.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] optree==0.19.0
[pip3] torch==2.11.0a0+a6c236b9fd.nv26.3.46836102
[pip3] torch_tensorrt==2.11.0a0
[pip3] torchao==0.17.0+gitd9881220
[pip3] torchdata==0.11.0
[pip3] torchtitan==0.2.1+git71517cf6
[pip3] torchvision==0.25.0a0+b7d91027.nv26.3.46836102
[pip3] triton==3.6.0+git5d72932fc5.nv26.3
[pip3] triton_kernels==1.0.0+git5d72932fc5.nv26.3
[conda] Could not collect
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

NCCLComm::abort() in NCCLUtils.cpp:381-390 unconditionally calls
ncclCommDeregister() on all entries in registeredSegmentHandles_. However, segments registered via ncclCommWindowRegister() (when symm=True is passed to
register_mem_pool) require ncclCommWindowDeregister() instead. Using the wrong deregister API causes NCCL to fail with:

NCCL WARN Deregister: Could not find handle                                                                                                                                      
Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f320a0

followed by a SIGABRT crash.

The normal deregistration path (NCCLComm::deregisterSegment()) correctly dispatches between ncclCommDeregister and ncclCommWindowDeregister based on a window flag. But
the abort path has no such distinction — it treats all handles as non-window.

Reproducer

#!/usr/bin/env python3                                                                                                                                                           
# Run: NCCL_DEBUG=WARN torchrun --nproc_per_node 2 repro.py                                                                                                                      
import os, sys, torch, torch.distributed as dist                                       
  
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)                                                                                                   
dist.init_process_group(backend="nccl")   
  
backend = dist.group.WORLD._get_backend(torch.device("cuda", local_rank))              

pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):       
    buf = torch.zeros(1024, dtype=torch.float32, device="cuda")                        
  
dist.all_reduce(torch.ones(1, device="cuda"))                                          
backend.register_mem_pool(pool, symm=True)
  
sys.exit(0)

Error trace

[rank0]:[W427 16:49:49.099958687 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more                                                                              
info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())                                                                                                                                                          
                                                                                                                                                                                                                                                          
[2026-04-27 16:49:49] gb-nvl-059-compute03:29643:29668 [0] register/register.cc:150 NCCL WARN Deregister: Could not find handle                                                                                                                           
                                                                                                                                                                                                                                                          
[rank0]:[E427 16:49:49.101881331 ProcessGroupNCCL.cpp:1370] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception thrown when waiting for future ProcessGroup abort: NCCL error in:                                                                            
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.29.7                                                                                                             
Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f28f40                                                                                                                                                                                   
Exception raised from abort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383 (most recent call first):                                                                                                                               
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in                                                                                               
/usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)                                                                                                                     
frame #1: <unknown function> + 0x141296c (0xffffcf7d296c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #2: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,                          
std::shared_ptr<c10d::NCCLComm>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char,     
std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const,                
std::shared_ptr<c10d::NCCLComm> > > >&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x84 (0xffffcf7fb004 in        
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
frame #3: c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x1b4 (0xffffcf7fb444 in  
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
frame #4: <unknown function> + 0x143b594 (0xffffcf7fb594 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #5: <unknown function> + 0x12182dc (0xffffcf5d82dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #6: <unknown function> + 0x8abdc (0xfffff7cfabdc in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
frame #7: <unknown function> + 0x14334c8 (0xffffcf7f34c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #8: <unknown function> + 0xe1ae0 (0xffffcd901ae0 in /lib/aarch64-linux-gnu/libstdc++.so.6)                                                                                 
frame #9: <unknown function> + 0x8595c (0xfffff7cf595c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
frame #10: <unknown function> + 0xebb4c (0xfffff7d5bb4c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                     
                                                                                                                                                                                 
terminate called after throwing an instance of 'c10::DistBackendError'                                                                                                           
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception thrown when waiting for future ProcessGroup abort: NCCL error in:                                                    
/opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.29.7                                    
Failed to deregister segment handle 0x640000d048 on ncclComm_ 0x8f28f40                                                                                                          
Exception raised from abort at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:383 (most recent call first):                                                      
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in                      
/usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)                                                                                                                     
frame #1: <unknown function> + 0x141296c (0xffffcf7d296c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #2: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >,                          
std::shared_ptr<c10d::NCCLComm>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char,     
std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const,                
std::shared_ptr<c10d::NCCLComm> > > >&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x84 (0xffffcf7fb004 in        
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
frame #3: c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&) + 0x1b4 (0xffffcf7fb444 in  
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
frame #4: <unknown function> + 0x143b594 (0xffffcf7fb594 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #5: <unknown function> + 0x12182dc (0xffffcf5d82dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #6: <unknown function> + 0x8abdc (0xfffff7cfabdc in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
frame #7: <unknown function> + 0x14334c8 (0xffffcf7f34c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                  
frame #8: <unknown function> + 0xe1ae0 (0xffffcd901ae0 in /lib/aarch64-linux-gnu/libstdc++.so.6)                                                                                 
frame #9: <unknown function> + 0x8595c (0xfffff7cf595c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                      
frame #10: <unknown function> + 0xebb4c (0xfffff7d5bb4c in /lib/aarch64-linux-gnu/libc.so.6)                                                                                     
                                                                                                                                                                                 
Exception raised from waitForFutureOrTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1392 (most recent call first):                             
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xffffce6449d4 in                      
/usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)                                                                                                                     
frame #1: c10d::ProcessGroupNCCL::waitForFutureOrTimeout(std::future<bool>&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&, std::__cxx11::basic_string<char,        
std::char_traits<char>, std::allocator<char> > const&, c10d::C10dLoggingData&, bool) + 0x6d8 (0xffffcf80e9c8 in                                                                  
/usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                                                                                              
frame #2: c10d::ProcessGroupNCCL::abort() + 0x29c (0xffffcf81a6dc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                                         
frame #3: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0xd8 (0xffffcf823308 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                              
frame #4: c10d::ProcessGroupNCCL::~ProcessGroupNCCL() + 0x14 (0xffffcf823af4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)                              
frame #5: c10d::ProcessGroup::release_resources() + 0xc0 (0xfffff3187ab0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)                                   
frame #6: <unknown function> + 0xe62cc8 (0xfffff64a2cc8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                 
frame #7: <unknown function> + 0x121fb80 (0xfffff685fb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                
frame #8: <unknown function> + 0x6373cc (0xfffff5c773cc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                                                 
frame #9: <unknown function> + 0x637d24 (0xfffff5c77d24 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)                                                                                 
frame #10: /usr/bin/python3() [0x4fb0bc]                                                                    
frame #11: /usr/bin/python3() [0x5249e0]          
frame #12: /usr/bin/python3() [0x4fba1c]                                                                                                                                                                                   
frame #13: /usr/bin/python3() [0x5b3cac]                                                                                                                                                                                   
frame #14: /usr/bin/python3() [0x68c2b8]                                                                                                                                                                         
frame #15: Py_FinalizeEx + 0xb0 (0x67b0e0 in /usr/bin/python3)                                                                                                                                                             
frame #16: Py_Exit + 0x18 (0x67c518 in /usr/bin/python3)                                                                                                                                                                   
frame #17: /usr/bin/python3() [0x6811d0]                                                                                                                                                                                   
frame #18: /usr/bin/python3() [0x680f04]                                                                                                                                                                                   
frame #19: _PyRun_SimpleFileObject + 0x1d4 (0x680858 in /usr/bin/python3)                                                                                                                                                  
frame #20: _PyRun_AnyFileObject + 0x54 (0x6805e4 in /usr/bin/python3)                                                                                                                                                      
frame #21: Py_RunMain + 0x2dc (0x68b5dc in /usr/bin/python3)                                                                                                                                                               
frame #22: Py_BytesMain + 0x28 (0x68b198 in /usr/bin/python3)                                                                                                                                                              
frame #23: <unknown function> + 0x284c4 (0xfffff7c984c4 in /lib/aarch64-linux-gnu/libc.so.6)                                                                                                                                                                     
frame #24: __libc_start_main + 0x98 (0xfffff7c98598 in /lib/aarch64-linux-gnu/libc.so.6)                                                                                                                                   
frame #25: _start + 0x30 (0x5f66f0 in /usr/bin/python3)                                                                                                                                                                                                          
                                                     
Signal 6 (SIGABRT) received by PID 29096

Root cause

In torch/csrc/distributed/c10d/NCCLUtils.cpp, NCCLComm::abort():

#ifdef NCCL_HAS_COMM_REGISTER
  // Deregister all registered segments before aborting.
  for (auto& it : registeredSegmentHandles_) {
    void* handle = it.second;
    C10D_NCCL_CHECK(
        ::ncclCommDeregister(ncclComm_, handle),   // BUG: wrong API for window handles
        ...);
  }

registeredSegmentHandles_ stores both window-registered and non-window-registered handles in the same std::unordered_map<void*, void*> with no way to tell them apart. The abort path calls ncclCommDeregister on all of them, but window handles need ncclCommWindowDeregister.

Compare with the correct path in NCCLComm::deregisterSegment() which dispatches properly:

#ifdef NCCL_HAS_COMM_WINDOW_REGISTER
  if (window) {
    ncclCommWindowDeregister(comm, (ncclWindow_t)handle);  // correct for symm=True
  } else {
    ncclCommDeregister(comm, handle);                       // correct for symm=False
  }

Suggested fix

Track whether each handle was window-registered (e.g., change registeredSegmentHandles_ from map<void*, void*> to map<void*, pair<void*, bool>>) and use the correct deregister call in abort().

Versions

<details> <summary>collect_env output</summary>
Collecting environment information...
PyTorch version: 2.11.0a0+a6c236b9fd.nv26.03.46836102
Is debug build: False
CUDA used to build PyTorch: 13.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (aarch64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-1013-nvidia-64k-aarch64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.2.51
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GB200
GPU 1: NVIDIA GB200
GPU 2: NVIDIA GB200
GPU 3: NVIDIA GB200

Nvidia driver version: 580.126.20
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.20.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.20.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  144
On-line CPU(s) list:                     0-143
Vendor ID:                               ARM
BIOS Vendor ID:                          NVIDIA
Model name:                              Neoverse-V2
BIOS Model name:                         Grace A02 699-2G548-0201-TS3 CPU @ 3.3GHz
BIOS CPU family:                         258
Model:                                   0
Thread(s) per core:                      1
Core(s) per socket:                      72
Socket(s):                               2
Stepping:                                r0p0
Frequency boost:                         disabled
CPU(s) scaling MHz:                      100%
CPU max MHz:                             3375.0000
CPU min MHz:                             81.0000
BogoMIPS:                                2000.00
L1d cache:                               9 MiB (144 instances)
L1i cache:                               9 MiB (144 instances)
L2 cache:                                144 MiB (144 instances)
L3 cache:                                228 MiB (2 instances)
NUMA node(s):                            34
NUMA node0 CPU(s):                       0-71
NUMA node1 CPU(s):                       72-143

Versions of relevant libraries:
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.1.0
[pip3] nvidia-cuda-runtime-cu13==0.0.0a0
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvtx==0.2.15
[pip3] onnx==1.18.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] optree==0.19.0
[pip3] torch==2.11.0a0+a6c236b9fd.nv26.3.46836102
[pip3] torch_tensorrt==2.11.0a0
[pip3] torchao==0.17.0+gitd9881220
[pip3] torchdata==0.11.0
[pip3] torchtitan==0.2.1+git71517cf6
[pip3] torchvision==0.25.0a0+b7d91027.nv26.3.46836102
[pip3] triton==3.6.0+git5d72932fc5.nv26.3
[pip3] triton_kernels==1.0.0+git5d72932fc5.nv26.3
[conda] Could not collect
</details>

Versions

Bug reported using: Claude Opus 4.7

cc: @kwen2501

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @weifengpy

extent analysis

TL;DR

The most likely fix is to modify the NCCLComm::abort() function to correctly deregister window-registered handles using ncclCommWindowDeregister() instead of ncclCommDeregister().

Guidance

  • Identify the type of handle (window-registered or non-window-registered) before deregistering it in NCCLComm::abort().
  • Modify the registeredSegmentHandles_ data structure to store additional information about each handle, such as whether it was window-registered.
  • Use the correct deregister function (ncclCommWindowDeregister() or ncclCommDeregister()) based on the type of handle.
  • Verify that the fix works by running the reproducer code and checking for the absence of the NCCL WARN Deregister: Could not find handle error.

Example

// Modified NCCLComm::abort() function
for (auto& it : registeredSegmentHandles_) {
  void* handle = it.second;
  bool isWindowRegistered = it.third; // assuming a tuple of <handle, void*, bool>
  if (isWindowRegistered) {
    C10D_NCCL_CHECK(::ncclCommWindowDeregister(ncclComm_, (ncclWindow_t)handle), ...);
  } else {
    C10D_NCCL_CHECK(::ncclCommDeregister(ncclComm_, handle), ...);
  }
}

Notes

  • The fix requires modifying the PyTorch source code, specifically the NCCLComm::abort() function.
  • The reproducer code provided can be used to test the fix.

Recommendation

Apply the suggested fix to the NCCLComm::abort() function to correctly deregister window-registered handles. This fix should resolve the NCCL WARN Deregister: Could not find handle error and prevent the SIGABRT crash.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix NCCLComm::abort() crashes when deregistering window-registered (symmetric) memory segments [1 pull requests, 1 participants]