pytorch - ✅(Solved) Fix Case mismatch in `_new_process_group_helper` prevents default backend type from being set for custom out-of-tree backends [1 pull requests, 1 comments, 2 participants]

pytorch2026-03-30 09:04:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178756•Fetched 2026-04-08 01:52:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mikethegoblin

Participants

mikethegoblin

subinz1

Timeline (top)

mentioned ×12subscribed ×12referenced ×3labeled ×2

Root Cause

Backend._plugins stores keys in uppercase (line 405 of distributed_c10d.py):

# In Backend.register_backend():
Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)

But BackendConfig normalizes all backend values to lowercase via Backend.__new__:

# In Backend.__new__():
value = getattr(Backend, name.upper(), Backend.UNDEFINED)
if value == Backend.UNDEFINED:
    value = name.lower()
return value

So for a registered backend like "mybackend", Backend._plugins has key "MYBACKEND", while backend_config.device_backend_map has value "mybackend".

In _new_process_group_helper, the multi-backend path (lines 2041–2049) tries to match them:

else:
    if Backend.NCCL in backend_config.device_backend_map.values():
        pg._set_default_backend(ProcessGroup.BackendType.NCCL)
    elif Backend._plugins.keys():
        custom_backend = next(iter(Backend._plugins.keys()))          # "MYBACKEND"
        if custom_backend in backend_config.device_backend_map.values():  # "MYBACKEND" in ["gloo", "mybackend"] → False
            pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
    else:
        pg._set_default_backend(ProcessGroup.BackendType.GLOO)

Because "MYBACKEND" != "mybackend", the comparison at line 2046 always fails. And because Backend._plugins.keys() is truthy, the elif branch is entered but _set_default_backend is never called — the else (GLOO fallback) is also skipped.

Fix Action

Fixed

Fixed by PR: [distributed] Fix case mismatch preventing default backend type for custom OOT backends (https://github.com/pytorch/pytorch/pull/178759)

PR fix notes

PR #178759: [distributed] Fix case mismatch preventing default backend type for custom OOT backends

Repository: pytorch/pytorch
Author: subinz1
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/178759

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/178756

Problem

When creating a process group with a multi-backend string that includes a custom out-of-tree backend (e.g., "cpu:gloo,cuda:mybackend"), the default backend type is never set to CUSTOM. This causes hasHooks() to warn about a missing backend type during process group destruction, and getDefaultBackend() to fail entirely.

Root cause: register_backend() stores plugin keys in uppercase (Backend._plugins["MYBACKEND"]), but BackendConfig normalizes all backend values to lowercase via Backend.__new__() — so device_backend_map contains "mybackend". The comparison in the multi-backend path of _new_process_group_helper was case-sensitive, so "MYBACKEND" in ["gloo", "mybackend"] always evaluated to False.

Changes

1. Case-insensitive plugin key comparison (distributed_c10d.py): Normalize the plugin key with .lower() before comparing against backend_config.device_backend_map.values(), so "MYBACKEND".lower() correctly matches "mybackend" in the config.

2. Preserve default backend on ProcessGroup replacement (distributed_c10d.py): For Python-based custom backends (subclasses of ProcessGroup), the loop replaces the original pg object with the backend instance and breaks immediately. This discarded the default backend type that was just set on the original pg. Now we call _set_default_backend(backend_type) on the replacement pg so the backend type is preserved.

3. Regression test (test_c10d_common.py): Adds a test that registers a custom backend, verifies plugin keys match config values after case normalization, and runs a full init_process_group / barrier / destroy_process_group cycle with a custom multi-backend config.

Testing

Tested on a remote server using Docker (quay.io/aipcc/pytorch:rhel9_6_pytorch_main_git5bfd4be_cuda12_8):

Without fix: Warning: No backend of type 0 found (backend type stayed UNDEFINED)
With fix: Backend type correctly set to CUSTOM (type 6)

cc @awgu @wanchaol @fegin @wconstab

Changed files

test/distributed/test_c10d_common.py (modified, +33/-0)
torch/distributed/distributed_c10d.py (modified, +2/-1)

Code Example

# In Backend.register_backend():
Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)

---

# In Backend.__new__():
value = getattr(Backend, name.upper(), Backend.UNDEFINED)
if value == Backend.UNDEFINED:
    value = name.lower()
return value

---

else:
    if Backend.NCCL in backend_config.device_backend_map.values():
        pg._set_default_backend(ProcessGroup.BackendType.NCCL)
    elif Backend._plugins.keys():
        custom_backend = next(iter(Backend._plugins.keys()))          # "MYBACKEND"
        if custom_backend in backend_config.device_backend_map.values():  # "MYBACKEND" in ["gloo", "mybackend"] → False
            pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
    else:
        pg._set_default_backend(ProcessGroup.BackendType.GLOO)

---

// ProcessGroup.hpp, hasHooks():
bool hasHooks() const {
    auto backend_iter = backendTypeToBackend_.find(backendType_);
    if (backend_iter == backendTypeToBackend_.end()) {
        TORCH_WARN(
            "No backend of type ", uint16_t(backendType_),
            " found for Process Group with name ", getBackendName(),
            ". Assuming no hooks are registered.");
        return false;
    }
    return backend_iter->second->hasHooks();
}

---

dist.init_process_group(backend="cpu:gloo,cuda:mybackend", ...)

---

if custom_backend.lower() in backend_config.device_backend_map.values():

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Bug Description

When creating a process group with a multi-backend string that includes a custom out-of-tree backend (e.g., "cpu:gloo,cuda:mybackend"), the default backend type is never set to CUSTOM. This causes hasHooks() to fail during process group destruction because it cannot find the default backend.

Root Cause

Backend._plugins stores keys in uppercase (line 405 of distributed_c10d.py):

# In Backend.register_backend():
Backend._plugins[name.upper()] = Backend._BackendPlugin(func, extended_api)

But BackendConfig normalizes all backend values to lowercase via Backend.__new__:

# In Backend.__new__():
value = getattr(Backend, name.upper(), Backend.UNDEFINED)
if value == Backend.UNDEFINED:
    value = name.lower()
return value

So for a registered backend like "mybackend", Backend._plugins has key "MYBACKEND", while backend_config.device_backend_map has value "mybackend".

In _new_process_group_helper, the multi-backend path (lines 2041–2049) tries to match them:

else:
    if Backend.NCCL in backend_config.device_backend_map.values():
        pg._set_default_backend(ProcessGroup.BackendType.NCCL)
    elif Backend._plugins.keys():
        custom_backend = next(iter(Backend._plugins.keys()))          # "MYBACKEND"
        if custom_backend in backend_config.device_backend_map.values():  # "MYBACKEND" in ["gloo", "mybackend"] → False
            pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)
    else:
        pg._set_default_backend(ProcessGroup.BackendType.GLOO)

Impact

The process group's backendType_ is never set to CUSTOM. Later, when hasHooks() is called (e.g., during process group destruction), it looks up backendType_ in backendTypeToBackend_:

// ProcessGroup.hpp, hasHooks():
bool hasHooks() const {
    auto backend_iter = backendTypeToBackend_.find(backendType_);
    if (backend_iter == backendTypeToBackend_.end()) {
        TORCH_WARN(
            "No backend of type ", uint16_t(backendType_),
            " found for Process Group with name ", getBackendName(),
            ". Assuming no hooks are registered.");
        return false;
    }
    return backend_iter->second->hasHooks();
}

Since backendType_ was never updated to CUSTOM, the lookup fails. Similarly, getDefaultBackend() will TORCH_CHECK fail if called on this process group.

Reproduction

Create and register a custom out-of-tree backend via Backend.register_backend.
Initialize the default process group with a multi-backend string including the custom backend:
```
dist.init_process_group(backend="cpu:gloo,cuda:mybackend", ...)
```
Destruction of this group triggers hasHooks() → warning/failure.

Suggested Fix

Normalize the case when comparing plugin keys against backend config values, e.g.:

if custom_backend.lower() in backend_config.device_backend_map.values():

Versions

PyTorch version: >=2.7.1.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To fix the issue, we need to normalize the case when comparing plugin keys against backend config values. Here are the steps:

Modify the comparison in _new_process_group_helper to use lowercase for both the custom backend and the backend config values:

if custom_backend.lower() in [value.lower() for value in backend_config.device_backend_map.values()]:
    pg._set_default_backend(ProcessGroup.BackendType.CUSTOM)

No other changes are required in the code.

Verification

To verify that the fix worked, you can:

Create and register a custom out-of-tree backend via Backend.register_backend.
Initialize the default process group with a multi-backend string including the custom backend:

dist.init_process_group(backend="cpu:gloo,cuda:mybackend", ...)

Destroy the process group and check that hasHooks() no longer fails.

Extra Tips

Make sure to test the fix with different custom backend names and multi-backend strings to ensure that the issue is fully resolved.
Consider adding a test case to the PyTorch test suite to prevent similar issues in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix Case mismatch in `_new_process_group_helper` prevents default backend type from being set for custom out-of-tree backends [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #178759: [distributed] Fix case mismatch preventing default backend type for custom OOT backends

Description (problem / solution / changelog)

Problem

Changes

Testing

Changed files

Code Example

🐛 Describe the bug

Bug Description

Root Cause

Impact

Reproduction

Suggested Fix

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix Case mismatch in `_new_process_group_helper` prevents default backend type from being set for custom out-of-tree backends [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #178759: [distributed] Fix case mismatch preventing default backend type for custom OOT backends

Description (problem / solution / changelog)

Problem

Changes

Testing

Changed files

Code Example

🐛 Describe the bug

Bug Description

Root Cause

Impact

Reproduction

Suggested Fix

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING