pytorch - 💡(How to fix) Fix [Inductor][TP] Micro-pipeline TP fusion misses slice/cat collective patterns

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

import torch
import torch.distributed as dist
from torch.fx.experimental.proxy_tensor import make_fx
from torch.ops import _c10d_functional as c10d
from torch.distributed._functional_collectives import all_gather_tensor, reduce_scatter_tensor

# Run under a distributed test/process group. The important FX shapes are local;
# the same patterns appear after graph-level chunking around tensor-parallel
# all-gather and reduce-scatter collectives.

def ag_slice_cat_mm(x, w):
    y = all_gather_tensor(x, gather_dim=0, group=dist.group.WORLD)
    y = torch.cat([y.narrow(0, 0, 64), y.narrow(0, 64, 64)], dim=1)
    return y @ w

def mm_slice_cat_rs(x, w):
    y = x @ w
    y = torch.cat([y.narrow(1, 0, 8), y.narrow(1, 8, 8)], dim=0)
    return reduce_scatter_tensor(y, "avg", scatter_dim=0, group=dist.group.WORLD)

# micro_pipeline_tp_pass should recognize these as all-gather+matmul and
# matmul+reduce-scatter fusion opportunities, but today the slice/cat
# reassembly hides the collective from the existing pattern matcher.
RAW_BUFFERClick to expand / collapse

Minimal repro

import torch
import torch.distributed as dist
from torch.fx.experimental.proxy_tensor import make_fx
from torch.ops import _c10d_functional as c10d
from torch.distributed._functional_collectives import all_gather_tensor, reduce_scatter_tensor

# Run under a distributed test/process group. The important FX shapes are local;
# the same patterns appear after graph-level chunking around tensor-parallel
# all-gather and reduce-scatter collectives.

def ag_slice_cat_mm(x, w):
    y = all_gather_tensor(x, gather_dim=0, group=dist.group.WORLD)
    y = torch.cat([y.narrow(0, 0, 64), y.narrow(0, 64, 64)], dim=1)
    return y @ w

def mm_slice_cat_rs(x, w):
    y = x @ w
    y = torch.cat([y.narrow(1, 0, 8), y.narrow(1, 8, 8)], dim=0)
    return reduce_scatter_tensor(y, "avg", scatter_dim=0, group=dist.group.WORLD)

# micro_pipeline_tp_pass should recognize these as all-gather+matmul and
# matmul+reduce-scatter fusion opportunities, but today the slice/cat
# reassembly hides the collective from the existing pattern matcher.

Issue

Graph-level chunking can introduce a slice/cat reassembly around collective results or inputs while preserving the same logical tensor value consumed by tensor-parallel matmuls. The micro-pipeline TP pass currently recognizes direct all-gather/matmul and matmul/reduce-scatter shapes, but it misses these slice/cat forms.

Expected behavior

The TP micro-pipeline pass should fuse all-gather+matmul and matmul+reduce-scatter when the extra slice/cat nodes only reassemble the collective value into the layout expected by the matmul/reduce-scatter pattern.

Actual behavior

The pass leaves the all-gather or reduce-scatter unfused, which regresses the generated graph for chunked tensor-parallel workloads.

Suggested fix

Extend the existing pattern detection helpers to recognize the slice/cat collective forms, preserve the same safety checks around matmul users, and erase only the matched reassembly nodes after replacing the fused pattern.

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The TP micro-pipeline pass should fuse all-gather+matmul and matmul+reduce-scatter when the extra slice/cat nodes only reassemble the collective value into the layout expected by the matmul/reduce-scatter pattern.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING