pytorch - ✅(Solved) Fix FakeProcessGroup all_gather_into_tensor fails when input has requires_grad=True [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181774Fetched 2026-04-29 06:11:01
View on GitHub
Comments
0
Participants
1
Timeline
34
Reactions
1
Author
Participants
Assignees
Timeline (top)
mentioned ×14subscribed ×14labeled ×2assigned ×1

Root Cause

FakeProcessGroup's all_gather_into_tensor raises a RuntimeError from the autograd "leaf variable in-place" safety check when the input tensor has requires_grad=True. Real NCCL does not hit this because the collective is implemented as a C++ kernel that writes to the output buffer outside the autograd machinery.

Fix Action

Fixed

PR fix notes

PR #181790: Fix FakeProcessGroup allgather on tensors that require grad

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #181790

Fixes https://github.com/pytorch/pytorch/issues/181774

FakeProcessGroup::_allgather_base and allgather_into_tensor_coalesced use output.chunk() to produce views and then copy_() into them. When the input requires grad, autograd's multi-output-view safety check rejects the inplace write and the collective fails with "A view was created in no_grad mode and is being modified inplace with grad mode enabled". Real backends (e.g. NCCL) do not hit this because their C++ kernels are invisible to autograd.

Wrap the chunk+copy in at::AutoDispatchBelowAutograd to match the real backends' behavior.

Authored with Claude.

Changed files

  • test/distributed/test_fake_pg.py (modified, +12/-0)
  • torch/csrc/distributed/c10d/FakeProcessGroup.hpp (modified, +8/-0)

Code Example

"""Minimal reproducer for fake PG all_gather_into_tensor failure.

The fake backend's _allgather_base uses output.chunk() internally,
creating views. When the input tensor has requires_grad=True, the
inplace write into those views triggers an autograd safety check.
Real NCCL doesn't hit this because it writes via a C++ kernel that
bypasses autograd.

Usage:
  # Fake PG (fails):
  python repro_fake_pg_allgather.py --backend fake
  # Real NCCL (works):
  torchrun --nproc_per_node 2 repro_fake_pg_allgather.py --backend nccl
"""
import argparse
import torch
import torch.distributed as dist


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--backend", choices=["fake", "nccl"], required=True)
    args = parser.parse_args()

    torch.cuda.set_device(0 if args.backend == "fake" else int(__import__("os").environ.get("LOCAL_RANK", 0)))

    if args.backend == "fake":
        dist.init_process_group("fake", rank=0, world_size=4)
    else:
        dist.init_process_group("nccl")

    world_size = dist.get_world_size()
    input_tensor = torch.randn(4, device="cuda", requires_grad=True)
    output_tensor = torch.empty(4 * world_size, device="cuda")

    try:
        dist.all_gather_into_tensor(output_tensor, input_tensor)
        print(f"backend={args.backend}, requires_grad=True: SUCCESS")
    except RuntimeError as e:
        print(f"backend={args.backend}, requires_grad=True: FAILED: {e}")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Description from Claude:

FakeProcessGroup's all_gather_into_tensor raises a RuntimeError from the autograd "leaf variable in-place" safety check when the input tensor has requires_grad=True. Real NCCL does not hit this because the collective is implemented as a C++ kernel that writes to the output buffer outside the autograd machinery.

The fake backend's Python _allgather_base calls output.chunk(world_size) to split the output buffer; chunks are views. It then copies the input into each chunk, and an inplace write into a view tied to a tensor with requires_grad=True trips the safety check.

Repro

"""Minimal reproducer for fake PG all_gather_into_tensor failure.

The fake backend's _allgather_base uses output.chunk() internally,
creating views. When the input tensor has requires_grad=True, the
inplace write into those views triggers an autograd safety check.
Real NCCL doesn't hit this because it writes via a C++ kernel that
bypasses autograd.

Usage:
  # Fake PG (fails):
  python repro_fake_pg_allgather.py --backend fake
  # Real NCCL (works):
  torchrun --nproc_per_node 2 repro_fake_pg_allgather.py --backend nccl
"""
import argparse
import torch
import torch.distributed as dist


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--backend", choices=["fake", "nccl"], required=True)
    args = parser.parse_args()

    torch.cuda.set_device(0 if args.backend == "fake" else int(__import__("os").environ.get("LOCAL_RANK", 0)))

    if args.backend == "fake":
        dist.init_process_group("fake", rank=0, world_size=4)
    else:
        dist.init_process_group("nccl")

    world_size = dist.get_world_size()
    input_tensor = torch.randn(4, device="cuda", requires_grad=True)
    output_tensor = torch.empty(4 * world_size, device="cuda")

    try:
        dist.all_gather_into_tensor(output_tensor, input_tensor)
        print(f"backend={args.backend}, requires_grad=True: SUCCESS")
    except RuntimeError as e:
        print(f"backend={args.backend}, requires_grad=True: FAILED: {e}")

    dist.destroy_process_group()


if __name__ == "__main__":
    main()
  • python repro_fake_pg_allgather.py --backend fake → FAILS
  • torchrun --nproc_per_node 2 repro_fake_pg_allgather.py --backend nccl → SUCCESS

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @weifengpy @H-Huang

extent analysis

TL;DR

Modify the FakeProcessGroup's _allgather_base to avoid inplace writes into views of tensors with requires_grad=True.

Guidance

  • Identify the source of the inplace write in _allgather_base, specifically the line where output.chunk(world_size) is used to split the output buffer.
  • Consider creating a temporary buffer to store the results of the all-gather operation, rather than writing directly into the output tensor.
  • Verify that the requires_grad attribute of the input tensor is the root cause of the issue by testing with requires_grad=False.
  • Investigate alternative implementations of the _allgather_base function that avoid creating views of the output tensor.

Example

# Create a temporary buffer to store the results of the all-gather operation
temp_buffer = torch.empty_like(output_tensor)
# Perform the all-gather operation into the temporary buffer
# ...
# Copy the results from the temporary buffer to the output tensor
output_tensor.copy_(temp_buffer)

Notes

The provided reproducer code and the description of the issue suggest that the problem lies in the implementation of the FakeProcessGroup's _allgather_base function. However, without the actual implementation of this function, it's difficult to provide a more specific solution.

Recommendation

Apply a workaround by modifying the _allgather_base function to avoid inplace writes into views of tensors with requires_grad=True, as this is the most likely cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix FakeProcessGroup all_gather_into_tensor fails when input has requires_grad=True [1 pull requests, 1 participants]