pytorch - ✅(Solved) Fix Case mismatch in `_new_process_group_helper` prevents default backend type from being set for custom out-of-tree backends [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178756Fetched 2026-04-08 01:52:26
View on GitHub
Comments
1
Participants
2
Timeline
31
Reactions
0
Timeline (top)
mentioned ×12subscribed ×12referenced ×3labeled ×2

Root Cause

Backend._plugins stores keys in uppercase (line 405 of distributed_c10d.py):

# In Backend.register_backend():
Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)

But BackendConfig normalizes all backend values to lowercase via Backend.__new__:

# In Backend.__new__():
value = getattr(Backend, name.upper(), Backend.UNDEFINED)
if value == Backend.UNDEFINED:
    value = name.lower()
return value

So for a registered backend like "mybackend", Backend._plugins has key "MYBACKEND", while backend_config.device_backend_map has value "mybackend".

In _new_process_group_helper, the multi-backend path (lines 2041–2049) tries to match them:

else:
    if Backend.NCCL in backend_config.device_backend_map.values():
        pg._set_default_backend(ProcessGroup.BackendType.NCCL)
    elif Backend._plugins.keys():
        custom_backend = next(iter(Backend._plugins.keys()))          # "MYBACKEND"
        if custom_backend in backend_config.device_backend_map.values():  # "MYBACKEND" in ["gloo", "mybackend"] → False
            pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
    else:
        pg._set_default_backend(ProcessGroup.BackendType.GLOO)

Because "MYBACKEND" != "mybackend", the comparison at line 2046 always fails. And because Backend._plugins.keys() is truthy, the elif branch is entered but _set_default_backend is never called — the else (GLOO fallback) is also skipped.

Fix Action

Fixed

PR fix notes

PR #178759: [distributed] Fix case mismatch preventing default backend type for custom OOT backends

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/178756

Problem

When creating a process group with a multi-backend string that includes a custom out-of-tree backend (e.g., "cpu:gloo,cuda:mybackend"), the default backend type is never set to CUSTOM. This causes hasHooks() to warn about a missing backend type during process group destruction, and getDefaultBackend() to fail entirely.

Root cause: register_backend() stores plugin keys in uppercase (Backend._plugins["MYBACKEND"]), but BackendConfig normalizes all backend values to lowercase via Backend.__new__() — so device_backend_map contains "mybackend". The comparison in the multi-backend path of _new_process_group_helper was case-sensitive, so "MYBACKEND" in ["gloo", "mybackend"] always evaluated to False.

Changes

1. Case-insensitive plugin key comparison (distributed_c10d.py): Normalize the plugin key with .lower() before comparing against backend_config.device_backend_map.values(), so "MYBACKEND".lower() correctly matches "mybackend" in the config.

2. Preserve default backend on ProcessGroup replacement (distributed_c10d.py): For Python-based custom backends (subclasses of ProcessGroup), the loop replaces the original pg object with the backend instance and breaks immediately. This discarded the default backend type that was just set on the original pg. Now we call _set_default_backend(backend_type) on the replacement pg so the backend type is preserved.

3. Regression test (test_c10d_common.py): Adds a test that registers a custom backend, verifies plugin keys match config values after case normalization, and runs a full init_process_group / barrier / destroy_process_group cycle with a custom multi-backend config.

Testing

Tested on a remote server using Docker (quay.io/aipcc/pytorch:rhel9_6_pytorch_main_git5bfd4be_cuda12_8):

  • Without fix: Warning: No backend of type 0 found (backend type stayed UNDEFINED)
  • With fix: Backend type correctly set to CUSTOM (type 6)

cc @awgu @wanchaol @fegin @wconstab

Changed files

  • test/distributed/test_c10d_common.py (modified, +33/-0)
  • torch/distributed/distributed_c10d.py (modified, +2/-1)

Code Example

# In Backend.register_backend():
Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)

---

# In Backend.__new__():
value = getattr(Backend, name.upper(), Backend.UNDEFINED)
if value == Backend.UNDEFINED:
    value = name.lower()
return value

---

else:
    if Backend.NCCL in backend_config.device_backend_map.values():
        pg._set_default_backend(ProcessGroup.BackendType.NCCL)
    elif Backend._plugins.keys():
        custom_backend = next(iter(Backend._plugins.keys()))          # "MYBACKEND"
        if custom_backend in backend_config.device_backend_map.values():  # "MYBACKEND" in ["gloo", "mybackend"]False
            pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
    else:
        pg._set_default_backend(ProcessGroup.BackendType.GLOO)

---

// ProcessGroup.hpp, hasHooks():
bool hasHooks() const {
    auto backend_iter = backendTypeToBackend_.find(backendType_);
    if (backend_iter == backendTypeToBackend_.end()) {
        TORCH_WARN(
            "No backend of type ", uint16_t(backendType_),
            " found for Process Group with name ", getBackendName(),
            ". Assuming no hooks are registered.");
        return false;
    }
    return backend_iter->second->hasHooks();
}

---

dist.init_process_group(backend="cpu:gloo,cuda:mybackend", ...)

---

if custom_backend.lower() in backend_config.device_backend_map.values():
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Bug Description

When creating a process group with a multi-backend string that includes a custom out-of-tree backend (e.g., "cpu:gloo,cuda:mybackend"), the default backend type is never set to CUSTOM. This causes hasHooks() to fail during process group destruction because it cannot find the default backend.

Root Cause

Backend._plugins stores keys in uppercase (line 405 of distributed_c10d.py):

# In Backend.register_backend():
Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)

But BackendConfig normalizes all backend values to lowercase via Backend.__new__:

# In Backend.__new__():
value = getattr(Backend, name.upper(), Backend.UNDEFINED)
if value == Backend.UNDEFINED:
    value = name.lower()
return value

So for a registered backend like "mybackend", Backend._plugins has key "MYBACKEND", while backend_config.device_backend_map has value "mybackend".

In _new_process_group_helper, the multi-backend path (lines 2041–2049) tries to match them:

else:
    if Backend.NCCL in backend_config.device_backend_map.values():
        pg._set_default_backend(ProcessGroup.BackendType.NCCL)
    elif Backend._plugins.keys():
        custom_backend = next(iter(Backend._plugins.keys()))          # "MYBACKEND"
        if custom_backend in backend_config.device_backend_map.values():  # "MYBACKEND" in ["gloo", "mybackend"] → False
            pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
    else:
        pg._set_default_backend(ProcessGroup.BackendType.GLOO)

Because "MYBACKEND" != "mybackend", the comparison at line 2046 always fails. And because Backend._plugins.keys() is truthy, the elif branch is entered but _set_default_backend is never called — the else (GLOO fallback) is also skipped.

Impact

The process group's backendType_ is never set to CUSTOM. Later, when hasHooks() is called (e.g., during process group destruction), it looks up backendType_ in backendTypeToBackend_:

// ProcessGroup.hpp, hasHooks():
bool hasHooks() const {
    auto backend_iter = backendTypeToBackend_.find(backendType_);
    if (backend_iter == backendTypeToBackend_.end()) {
        TORCH_WARN(
            "No backend of type ", uint16_t(backendType_),
            " found for Process Group with name ", getBackendName(),
            ". Assuming no hooks are registered.");
        return false;
    }
    return backend_iter->second->hasHooks();
}

Since backendType_ was never updated to CUSTOM, the lookup fails. Similarly, getDefaultBackend() will TORCH_CHECK fail if called on this process group.

Reproduction

  1. Create and register a custom out-of-tree backend via Backend.register_backend.
  2. Initialize the default process group with a multi-backend string including the custom backend:
    dist.init_process_group(backend="cpu:gloo,cuda:mybackend", ...)
  3. Destruction of this group triggers hasHooks() → warning/failure.

Suggested Fix

Normalize the case when comparing plugin keys against backend config values, e.g.:

if custom_backend.lower() in backend_config.device_backend_map.values():

Versions

PyTorch version: >=2.7.1.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To fix the issue, we need to normalize the case when comparing plugin keys against backend config values. Here are the steps:

  • Modify the comparison in _new_process_group_helper to use lowercase for both the custom backend and the backend config values:
if custom_backend.lower() in [value.lower() for value in backend_config.device_backend_map.values()]:
    pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
  • No other changes are required in the code.

Verification

To verify that the fix worked, you can:

  • Create and register a custom out-of-tree backend via Backend.register_backend.
  • Initialize the default process group with a multi-backend string including the custom backend:
dist.init_process_group(backend="cpu:gloo,cuda:mybackend", ...)
  • Destroy the process group and check that hasHooks() no longer fails.

Extra Tips

  • Make sure to test the fix with different custom backend names and multi-backend strings to ensure that the issue is fully resolved.
  • Consider adding a test case to the PyTorch test suite to prevent similar issues in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING