pytorch - ✅(Solved) Fix all_gather_into_tensor in gloo backend does not support stacking [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178798Fetched 2026-04-08 01:52:08
View on GitHub
Comments
0
Participants
1
Timeline
30
Reactions
0
Author
Participants
Timeline (top)
mentioned ×12subscribed ×12referenced ×3labeled ×2

Error Message

NCCL (no error, for comparison):

PR fix notes

PR #178844: [c10d] Support stacked output tensors in Gloo all_gather_into_tensor …

Description (problem / solution / changelog)

Summary
Fixes #178798

ProcessGroupGloo::_allgather_base chunked the output tensor along dim 0 and required every chunk to match the input shape exactly. For stacked outputs (world_size, *input.shape) the chunks retained the leading dimension — e.g. (1, N) vs (N,) — so the downstream allgather() validation rejected them with "invalid tensor size". NCCL never had this restriction because it only validates numel.

Fix
View the output tensor as the concatenated form before chunking, so both (world_size * N, ...) and (world_size, N, ...) layouts are accepted. Uses view() (not reshape()) to guarantee a zero-copy alias into the caller's buffer — if the output is non-contiguous in a way that prevents a view, we raise an explicit error instead of silently writing into a temporary copy.

Also adds a numel-based size check consistent with NCCL's validation.

Test
test_allgather_into_tensor_stacked — exercises both 1-D→2-D stacking and 2-D→3-D stacking with multi-rank data verification on the Gloo backend.

Changed files

  • test/distributed/test_c10d_gloo.py (modified, +25/-0)
  • torch/csrc/distributed/c10d/ProcessGroupGloo.cpp (modified, +31/-1)

Code Example

import sys

import torch
import torch.distributed as dist


def main():
    backend = "nccl" if "--nccl" in sys.argv else "gloo"
    dist.init_process_group(backend=backend)
    world_size = dist.get_world_size()
    device = "cuda" if backend == "nccl" else "cpu"

    buffer = torch.arange(12, dtype=torch.float32, device=device)
    print(f"[{backend}] buffer.shape = {buffer.shape}")  # (12,)

    # ── BUG: 2-D output for 1-D input ────────────────────────────────
    gathered_bad = torch.empty(world_size, *buffer.shape, dtype=buffer.dtype, device=device)
    print(f"[{backend}] gathered_bad.shape = {gathered_bad.shape}")  # (1, 12) when ws=1
    try:
        dist.all_gather_into_tensor(gathered_bad, buffer)
        print(f"[{backend}] 2-D output accepted (NCCL is lenient)")
    except RuntimeError as e:
        print(f"[{backend}] 2-D output rejected: {e!s:.80}")

    # ── FIX: flatten for the collective, reshape after ────────────────
    gathered_flat = torch.empty(world_size * buffer.numel(), dtype=buffer.dtype, device=device)
    dist.all_gather_into_tensor(gathered_flat, buffer.view(-1))
    gathered = gathered_flat.view(world_size, *buffer.shape)
    result = gathered.sum(dim=0)
    assert torch.equal(result, buffer), "Mismatch!"
    print(f"[{backend}] Flat path works correctly.")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

import sys

import torch
import torch.distributed as dist


def main():
    backend = "nccl" if "--nccl" in sys.argv else "gloo"
    dist.init_process_group(backend=backend)
    world_size = dist.get_world_size()
    device = "cuda" if backend == "nccl" else "cpu"

    buffer = torch.arange(12, dtype=torch.float32, device=device)
    print(f"[{backend}] buffer.shape = {buffer.shape}")  # (12,)

    # ── BUG: 2-D output for 1-D input ────────────────────────────────
    gathered_bad = torch.empty(world_size, *buffer.shape, dtype=buffer.dtype, device=device)
    print(f"[{backend}] gathered_bad.shape = {gathered_bad.shape}")  # (1, 12) when ws=1
    try:
        dist.all_gather_into_tensor(gathered_bad, buffer)
        print(f"[{backend}] 2-D output accepted (NCCL is lenient)")
    except RuntimeError as e:
        print(f"[{backend}] 2-D output rejected: {e!s:.80}")

    # ── FIX: flatten for the collective, reshape after ────────────────
    gathered_flat = torch.empty(world_size * buffer.numel(), dtype=buffer.dtype, device=device)
    dist.all_gather_into_tensor(gathered_flat, buffer.view(-1))
    gathered = gathered_flat.view(world_size, *buffer.shape)
    result = gathered.sum(dim=0)
    assert torch.equal(result, buffer), "Mismatch!"
    print(f"[{backend}] Flat path works correctly.")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Demonstrate why all_gather_into_tensor needs flat tensors with GLOO.

GLOO's all_gather_into_tensor requires the output tensor to have the same number of dimensions as the input. Passing a (world_size, *input.shape) output — which is 2-D when input is 1-D — raises:

RuntimeError: ProcessGroupGloo::allgather: invalid tensor size at index 0
              (expected (N), got (1, N))

NCCL does NOT have this restriction.

Usage:

# GLOO (reproduces the bug):
torchrun --nproc_per_node=1 gloo-allgather-shape.py

# NCCL (no error, for comparison):
torchrun --nproc_per_node=1 gloo-allgather-shape.py --nccl

Versions

2.11

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan

extent analysis

Fix Plan

To fix the issue with all_gather_into_tensor requiring flat tensors with GLOO, follow these steps:

  • Flatten the input tensor before the collective operation
  • Reshape the output tensor after the collective operation

Example code:

# Flatten the input tensor
buffer_flat = buffer.view(-1)

# Create an output tensor with the correct shape
gathered_flat = torch.empty(world_size * buffer.numel(), dtype=buffer.dtype, device=device)

# Perform the all_gather_into_tensor operation
dist.all_gather_into_tensor(gathered_flat, buffer_flat)

# Reshape the output tensor
gathered = gathered_flat.view(world_size, *buffer.shape)

Verification

To verify that the fix worked, check that the gathered tensor has the correct shape and contents. You can do this by printing the shape and contents of the gathered tensor:

print(f"[{backend}] gathered.shape = {gathered.shape}")
print(f"[{backend}] gathered = {gathered}")

You can also verify that the result tensor, which is the sum of the gathered tensor along the first dimension, is equal to the original buffer tensor:

result = gathered.sum(dim=0)
assert torch.equal(result, buffer), "Mismatch!"

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING