pytorch - ✅(Solved) Fix all_gather_into_tensor in gloo backend does not support stacking [1 pull requests, 1 participants]

pytorch2026-03-30 18:50:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178798•Fetched 2026-04-08 01:52:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ppwwyyxx

Participants

ppwwyyxx

Timeline (top)

mentioned ×12subscribed ×12referenced ×3labeled ×2

Error Message

NCCL (no error, for comparison):

PR fix notes

PR #178844: [c10d] Support stacked output tensors in Gloo all_gather_into_tensor …

Repository: pytorch/pytorch
Author: saifmb0
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/178844

Description (problem / solution / changelog)

Summary
Fixes #178798

ProcessGroupGloo::_allgather_base chunked the output tensor along dim 0 and required every chunk to match the input shape exactly. For stacked outputs (world_size, *input.shape) the chunks retained the leading dimension — e.g. (1, N) vs (N,) — so the downstream allgather() validation rejected them with "invalid tensor size". NCCL never had this restriction because it only validates numel.

Fix
View the output tensor as the concatenated form before chunking, so both (world_size * N, ...) and (world_size, N, ...) layouts are accepted. Uses view() (not reshape()) to guarantee a zero-copy alias into the caller's buffer — if the output is non-contiguous in a way that prevents a view, we raise an explicit error instead of silently writing into a temporary copy.

Also adds a numel-based size check consistent with NCCL's validation.

Test
test_allgather_into_tensor_stacked — exercises both 1-D→2-D stacking and 2-D→3-D stacking with multi-rank data verification on the Gloo backend.

Changed files

test/distributed/test_c10d_gloo.py (modified, +25/-0)
torch/csrc/distributed/c10d/ProcessGroupGloo.cpp (modified, +31/-1)

Code Example

import sys

import torch
import torch.distributed as dist


def main():
    backend = "nccl" if "--nccl" in sys.argv else "gloo"
    dist.init_process_group(backend=backend)
    world_size = dist.get_world_size()
    device = "cuda" if backend == "nccl" else "cpu"

    buffer = torch.arange(12, dtype=torch.float32, device=device)
    print(f"[{backend}] buffer.shape = {buffer.shape}")  # (12,)

    # ── BUG: 2-D output for 1-D input ────────────────────────────────
    gathered_bad = torch.empty(world_size, *buffer.shape, dtype=buffer.dtype, device=device)
    print(f"[{backend}] gathered_bad.shape = {gathered_bad.shape}")  # (1, 12) when ws=1
    try:
        dist.all_gather_into_tensor(gathered_bad, buffer)
        print(f"[{backend}] 2-D output accepted (NCCL is lenient)")
    except RuntimeError as e:
        print(f"[{backend}] 2-D output rejected: {e!s:.80}")

    # ── FIX: flatten for the collective, reshape after ────────────────
    gathered_flat = torch.empty(world_size * buffer.numel(), dtype=buffer.dtype, device=device)
    dist.all_gather_into_tensor(gathered_flat, buffer.view(-1))
    gathered = gathered_flat.view(world_size, *buffer.shape)
    result = gathered.sum(dim=0)
    assert torch.equal(result, buffer), "Mismatch!"
    print(f"[{backend}] Flat path works correctly.")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

import sys

import torch
import torch.distributed as dist


def main():
    backend = "nccl" if "--nccl" in sys.argv else "gloo"
    dist.init_process_group(backend=backend)
    world_size = dist.get_world_size()
    device = "cuda" if backend == "nccl" else "cpu"

    buffer = torch.arange(12, dtype=torch.float32, device=device)
    print(f"[{backend}] buffer.shape = {buffer.shape}")  # (12,)

    # ── BUG: 2-D output for 1-D input ────────────────────────────────
    gathered_bad = torch.empty(world_size, *buffer.shape, dtype=buffer.dtype, device=device)
    print(f"[{backend}] gathered_bad.shape = {gathered_bad.shape}")  # (1, 12) when ws=1
    try:
        dist.all_gather_into_tensor(gathered_bad, buffer)
        print(f"[{backend}] 2-D output accepted (NCCL is lenient)")
    except RuntimeError as e:
        print(f"[{backend}] 2-D output rejected: {e!s:.80}")

    # ── FIX: flatten for the collective, reshape after ────────────────
    gathered_flat = torch.empty(world_size * buffer.numel(), dtype=buffer.dtype, device=device)
    dist.all_gather_into_tensor(gathered_flat, buffer.view(-1))
    gathered = gathered_flat.view(world_size, *buffer.shape)
    result = gathered.sum(dim=0)
    assert torch.equal(result, buffer), "Mismatch!"
    print(f"[{backend}] Flat path works correctly.")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Demonstrate why all_gather_into_tensor needs flat tensors with GLOO.

GLOO's all_gather_into_tensor requires the output tensor to have the same number of dimensions as the input. Passing a (world_size, *input.shape) output — which is 2-D when input is 1-D — raises:

RuntimeError: ProcessGroupGloo::allgather: invalid tensor size at index 0
              (expected (N), got (1, N))

NCCL does NOT have this restriction.

Usage:

# GLOO (reproduces the bug):
torchrun --nproc_per_node=1 gloo-allgather-shape.py

# NCCL (no error, for comparison):
torchrun --nproc_per_node=1 gloo-allgather-shape.py --nccl

Versions

2.11

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To fix the issue with all_gather_into_tensor requiring flat tensors with GLOO, follow these steps:

Flatten the input tensor before the collective operation
Reshape the output tensor after the collective operation

Example code:

# Flatten the input tensor
buffer_flat = buffer.view(-1)

# Create an output tensor with the correct shape
gathered_flat = torch.empty(world_size * buffer.numel(), dtype=buffer.dtype, device=device)

# Perform the all_gather_into_tensor operation
dist.all_gather_into_tensor(gathered_flat, buffer_flat)

# Reshape the output tensor
gathered = gathered_flat.view(world_size, *buffer.shape)

Verification

To verify that the fix worked, check that the gathered tensor has the correct shape and contents. You can do this by printing the shape and contents of the gathered tensor:

print(f"[{backend}] gathered.shape = {gathered.shape}")
print(f"[{backend}] gathered = {gathered}")

You can also verify that the result tensor, which is the sum of the gathered tensor along the first dimension, is equal to the original buffer tensor:

result = gathered.sum(dim=0)
assert torch.equal(result, buffer), "Mismatch!"

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix all_gather_into_tensor in gloo backend does not support stacking [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

NCCL (no error, for comparison):

PR fix notes

PR #178844: [c10d] Support stacked output tensors in Gloo all_gather_into_tensor …

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix all_gather_into_tensor in gloo backend does not support stacking [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

NCCL (no error, for comparison):

PR fix notes

PR #178844: [c10d] Support stacked output tensors in Gloo all_gather_into_tensor …

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING