pytorch - ✅(Solved) Fix FakeProcessGroup all_gather_into_tensor fails when input has requires_grad=True [1 pull requests, 1 participants]

pytorch2026-04-28 19:30:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181774•Fetched 2026-04-29 06:11:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ezyang

Participants

ezyang

Assignees

ezyang

Timeline (top)

mentioned ×14subscribed ×14labeled ×2assigned ×1

Root Cause

FakeProcessGroup's all_gather_into_tensor raises a RuntimeError from the autograd "leaf variable in-place" safety check when the input tensor has requires_grad=True. Real NCCL does not hit this because the collective is implemented as a C++ kernel that writes to the output buffer outside the autograd machinery.

Fix Action

Fixed

Fixed by PR: Fix FakeProcessGroup allgather on tensors that require grad (https://github.com/pytorch/pytorch/pull/181790)
Closed with commit: 9f812c3ca87134aa663128b6037a3289699cd9c2

PR fix notes

PR #181790: Fix FakeProcessGroup allgather on tensors that require grad

Repository: pytorch/pytorch
Author: ezyang
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/181790

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

-> #181790

Fixes https://github.com/pytorch/pytorch/issues/181774

FakeProcessGroup::_allgather_base and allgather_into_tensor_coalesced use output.chunk() to produce views and then copy_() into them. When the input requires grad, autograd's multi-output-view safety check rejects the inplace write and the collective fails with "A view was created in no_grad mode and is being modified inplace with grad mode enabled". Real backends (e.g. NCCL) do not hit this because their C++ kernels are invisible to autograd.

Wrap the chunk+copy in at::AutoDispatchBelowAutograd to match the real backends' behavior.

Authored with Claude.

Changed files

test/distributed/test_fake_pg.py (modified, +12/-0)
torch/csrc/distributed/c10d/FakeProcessGroup.hpp (modified, +8/-0)

Code Example

"""Minimal reproducer for fake PG all_gather_into_tensor failure.

The fake backend's _allgather_base uses output.chunk() internally,
creating views. When the input tensor has requires_grad=True, the
inplace write into those views triggers an autograd safety check.
Real NCCL doesn't hit this because it writes via a C++ kernel that
bypasses autograd.

Usage:
  # Fake PG (fails):
  python repro_fake_pg_allgather.py --backend fake
  # Real NCCL (works):
  torchrun --nproc_per_node 2 repro_fake_pg_allgather.py --backend nccl
"""
import argparse
import torch
import torch.distributed as dist


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--backend", choices=["fake", "nccl"], required=True)
    args = parser.parse_args()

    torch.cuda.set_device(0 if args.backend == "fake" else int(__import__("os").environ.get("LOCAL_RANK", 0)))

    if args.backend == "fake":
        dist.init_process_group("fake", rank=0, world_size=4)
    else:
        dist.init_process_group("nccl")

    world_size = dist.get_world_size()
    input_tensor = torch.randn(4, device="cuda", requires_grad=True)
    output_tensor = torch.empty(4 * world_size, device="cuda")

    try:
        dist.all_gather_into_tensor(output_tensor, input_tensor)
        print(f"backend={args.backend}, requires_grad=True: SUCCESS")
    except RuntimeError as e:
        print(f"backend={args.backend}, requires_grad=True: FAILED: {e}")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Description from Claude:

The fake backend's Python _allgather_base calls output.chunk(world_size) to split the output buffer; chunks are views. It then copies the input into each chunk, and an inplace write into a view tied to a tensor with requires_grad=True trips the safety check.

Repro

"""Minimal reproducer for fake PG all_gather_into_tensor failure.

The fake backend's _allgather_base uses output.chunk() internally,
creating views. When the input tensor has requires_grad=True, the
inplace write into those views triggers an autograd safety check.
Real NCCL doesn't hit this because it writes via a C++ kernel that
bypasses autograd.

Usage:
  # Fake PG (fails):
  python repro_fake_pg_allgather.py --backend fake
  # Real NCCL (works):
  torchrun --nproc_per_node 2 repro_fake_pg_allgather.py --backend nccl
"""
import argparse
import torch
import torch.distributed as dist


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--backend", choices=["fake", "nccl"], required=True)
    args = parser.parse_args()

    torch.cuda.set_device(0 if args.backend == "fake" else int(__import__("os").environ.get("LOCAL_RANK", 0)))

    if args.backend == "fake":
        dist.init_process_group("fake", rank=0, world_size=4)
    else:
        dist.init_process_group("nccl")

    world_size = dist.get_world_size()
    input_tensor = torch.randn(4, device="cuda", requires_grad=True)
    output_tensor = torch.empty(4 * world_size, device="cuda")

    try:
        dist.all_gather_into_tensor(output_tensor, input_tensor)
        print(f"backend={args.backend}, requires_grad=True: SUCCESS")
    except RuntimeError as e:
        print(f"backend={args.backend}, requires_grad=True: FAILED: {e}")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()

python repro_fake_pg_allgather.py --backend fake → FAILS
torchrun --nproc_per_node 2 repro_fake_pg_allgather.py --backend nccl → SUCCESS

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @weifengpy @H-Huang

extent analysis

TL;DR

Modify the FakeProcessGroup's _allgather_base to avoid inplace writes into views of tensors with requires_grad=True.

Guidance

Identify the source of the inplace write in _allgather_base, specifically the line where output.chunk(world_size) is used to split the output buffer.
Consider creating a temporary buffer to store the results of the all-gather operation, rather than writing directly into the output tensor.
Verify that the requires_grad attribute of the input tensor is the root cause of the issue by testing with requires_grad=False.
Investigate alternative implementations of the _allgather_base function that avoid creating views of the output tensor.

Example

# Create a temporary buffer to store the results of the all-gather operation
temp_buffer = torch.empty_like(output_tensor)
# Perform the all-gather operation into the temporary buffer
# ...
# Copy the results from the temporary buffer to the output tensor
output_tensor.copy_(temp_buffer)

Notes

The provided reproducer code and the description of the issue suggest that the problem lies in the implementation of the FakeProcessGroup's _allgather_base function. However, without the actual implementation of this function, it's difficult to provide a more specific solution.

Recommendation

Apply a workaround by modifying the _allgather_base function to avoid inplace writes into views of tensors with requires_grad=True, as this is the most likely cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#GPU setup #container setup #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix FakeProcessGroup all_gather_into_tensor fails when input has requires_grad=True [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #181790: Fix FakeProcessGroup allgather on tensors that require grad

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Repro

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix FakeProcessGroup all_gather_into_tensor fails when input has requires_grad=True [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #181790: Fix FakeProcessGroup allgather on tensors that require grad

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Repro

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING