pytorch - ✅(Solved) Fix Functional Collectives shouldn't assert contiguous inputs [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177902Fetched 2026-04-08 01:03:01
View on GitHub
Comments
2
Participants
2
Timeline
44
Reactions
0
Author
Timeline (top)
mentioned ×15subscribed ×15referenced ×7labeled ×3

Fix Action

Fixed

PR fix notes

PR #177965: [c10d] Remove contiguous assertions from functional collectives API

Description (problem / solution / changelog)

Instead of asserting that inputs are contiguous (and crashing), call .contiguous() in the C++ backend implementation. This is a no-op for already-contiguous tensors and produces correct behavior for non-contiguous ones.

Fixes #177902

Changed files

  • torch/csrc/distributed/c10d/Functional.cpp (modified, +15/-19)
  • torch/distributed/_functional_collectives.py (modified, +0/-2)
RAW_BUFFERClick to expand / collapse

Can we remove the contiguous assertions in the python layer functional collectives APIs, and let this be a backend implementation detail?

It turns out we have asymmetry for the eager implementation of funcol apis today.

all_reduce and broadcast are the ops that are actually in-place in the underlying NCCL API (they overwrite their input buffer with the result). So the functional wrapper has to clone the input to preserve functional semantics. The clone happens to make things contiguous as a side effect.

--> for these ops, we can easily remove the input contiguous assertion from the API layer

For all_gather / reduce_scatter, NCCL takes separate input and output buffers natively. The C++ code allocates a fresh output and passes the input directly — no clone needed for functional semantics. But this means the input's memory layout passes through to NCCL unguarded (except for the assertions). And NCCL does require contiguous inputs here, IIUC. cc @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @kwen2501

--> for these ops, we'd have to add a new input-clone to the eager backend impl. This would have a direct performance impact on users today (e.g. fsdp, dtensor), which is probably why it wasn't done in the first place

Thoughts?

cc @fmassa @ezyang

extent analysis

Fix Plan

To address the issue, we will remove contiguous assertions for all_reduce and broadcast operations in the Python layer functional collectives APIs. For all_gather and reduce_scatter, we will add input cloning in the eager backend implementation.

Steps for all_reduce and broadcast:

  • Remove input contiguous assertions from the API layer
  • Example code:
# Before
def all_reduce(input_tensor):
    assert input_tensor.is_contiguous()
    # ...

# After
def all_reduce(input_tensor):
    # Remove assertion
    # ...

Steps for all_gather and reduce_scatter:

  • Add input cloning in the eager backend implementation
  • Example code:
# Before
def all_gather(input_tensor):
    # Pass input directly to NCCL
    # ...

# After
def all_gather(input_tensor):
    cloned_input = input_tensor.contiguous().clone()
    # Pass cloned input to NCCL
    # ...

Verification

To verify the fix, test the functional collectives APIs with non-contiguous input tensors and ensure that the operations complete successfully without assertions.

Extra Tips

  • Monitor performance impact on users, especially for all_gather and reduce_scatter operations.
  • Consider adding documentation to explain the implementation details and performance implications.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING