pytorch - 💡(How to fix) Fix [Inductor][TP] Micro-pipeline TP fusion misses slice/cat collective patterns

pytorch2026-05-22 01:57:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

import torch
import torch.distributed as dist
from torch.fx.experimental.proxy_tensor import make_fx
from torch.ops import _c10d_functional as c10d
from torch.distributed._functional_collectives import all_gather_tensor, reduce_scatter_tensor

# Run under a distributed test/process group. The important FX shapes are local;
# the same patterns appear after graph-level chunking around tensor-parallel
# all-gather and reduce-scatter collectives.

def ag_slice_cat_mm(x, w):
    y = all_gather_tensor(x, gather_dim=0, group=dist.group.WORLD)
    y = torch.cat([y.narrow(0, 0, 64), y.narrow(0, 64, 64)], dim=1)
    return y @ w

def mm_slice_cat_rs(x, w):
    y = x @ w
    y = torch.cat([y.narrow(1, 0, 8), y.narrow(1, 8, 8)], dim=0)
    return reduce_scatter_tensor(y, "avg", scatter_dim=0, group=dist.group.WORLD)

# micro_pipeline_tp_pass should recognize these as all-gather+matmul and
# matmul+reduce-scatter fusion opportunities, but today the slice/cat
# reassembly hides the collective from the existing pattern matcher.

RAW_BUFFERClick to expand / collapse

Minimal repro

import torch
import torch.distributed as dist
from torch.fx.experimental.proxy_tensor import make_fx
from torch.ops import _c10d_functional as c10d
from torch.distributed._functional_collectives import all_gather_tensor, reduce_scatter_tensor

# Run under a distributed test/process group. The important FX shapes are local;
# the same patterns appear after graph-level chunking around tensor-parallel
# all-gather and reduce-scatter collectives.

def ag_slice_cat_mm(x, w):
    y = all_gather_tensor(x, gather_dim=0, group=dist.group.WORLD)
    y = torch.cat([y.narrow(0, 0, 64), y.narrow(0, 64, 64)], dim=1)
    return y @ w

def mm_slice_cat_rs(x, w):
    y = x @ w
    y = torch.cat([y.narrow(1, 0, 8), y.narrow(1, 8, 8)], dim=0)
    return reduce_scatter_tensor(y, "avg", scatter_dim=0, group=dist.group.WORLD)

# micro_pipeline_tp_pass should recognize these as all-gather+matmul and
# matmul+reduce-scatter fusion opportunities, but today the slice/cat
# reassembly hides the collective from the existing pattern matcher.

Issue

Graph-level chunking can introduce a slice/cat reassembly around collective results or inputs while preserving the same logical tensor value consumed by tensor-parallel matmuls. The micro-pipeline TP pass currently recognizes direct all-gather/matmul and matmul/reduce-scatter shapes, but it misses these slice/cat forms.

Expected behavior

The TP micro-pipeline pass should fuse all-gather+matmul and matmul+reduce-scatter when the extra slice/cat nodes only reassemble the collective value into the layout expected by the matmul/reduce-scatter pattern.

Actual behavior

The pass leaves the all-gather or reduce-scatter unfused, which regresses the generated graph for chunked tensor-parallel workloads.

Suggested fix

Extend the existing pattern detection helpers to recognize the slice/cat collective forms, preserve the same safety checks around matmul users, and erase only the matched reassembly nodes after replacing the fused pattern.

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [Inductor][TP] Micro-pipeline TP fusion misses slice/cat collective patterns

Recommended Tools

GitHub issue graph ai analysis

Code Example

Minimal repro

Issue

Expected behavior

Actual behavior

Suggested fix

FAQ

Expected behavior

Still need to ship something?

TRENDING