pytorch - ๐Ÿ’ก(How to fix) Fix torch.compile doesn't use async all_to_all

Official PRs (โ€ฆ)
ON THIS PAGE

Recommended Tools

ร—6

Utilities matched from this issueโ€™s tags and category โ€” try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful ยท Quick feedback

Loadingโ€ฆ

Error Message

Error logs

Code Example

def matmul_and_reshard_heads(*, xs, ws, device_mesh):
    num_heads = 16
    ys = []
    for x, w in zip(xs, ws):
        y = torch.matmul(x, w)
        batch_size, num_tokens, out_size = y.shape
        y = y.view(batch_size, num_tokens, num_heads, out_size // num_heads)
        y = DTensor.from_local(
            y, device_mesh=device_mesh, placements=(Shard(1),)
        )
        y = y.redistribute(
            # I tried async_op=True and async_op=False, both settings produce the same trace
            device_mesh, placements=(Shard(-2),), async_op=True
        )
        y = y.to_local()
        ys.append(y)
    return ys

# Run this function on 4 GPUs
def repro():
    device = torch.device("cuda")
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    global_rank = dist.get_rank()
    num_gpus = dist.get_world_size()

    mesh = dist.init_device_mesh(
        "cuda",
        (1, 4),
        mesh_dim_names=("dp", "sp"),
    )

    xs = [torch.randn((16, 16192, 4096), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    ws = [torch.randn((16, 4096, 8192), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    
    matmul_and_reshard_heads_compiled = torch.compile(matmul_and_reshard_heads, fullgraph=True)

    for i in range(100):
        ys = matmul_and_reshard_heads_compiled(xs=xs, ws=ws, device_mesh=mesh["sp"])
        [y.sum() for y in ys]
    torch.cuda.synchronize()
RAW_BUFFERClick to expand / collapse

๐Ÿ› Describe the bug

I am interleaving matmuls with all_to_all collectives. I was hoping that torch.compile would overlap compute and comms given this comment from @xmfan:

In today's graphs, Dynamo rewrites synchronous collectives as asynchronous collectives whenever possible. There's always a stream sync at the end of each graph. So if your async collective usage is fully contained within a single graph (without graph breaks), I expect async_op=False + graph captured to give you similar overlap.

Unfortunately it looks like the all_to_alls are synchronous. Maybe there is a more idiomatic way to express my function which would work better with torch.compile?

Please ignore the reshape kernels, that's a separate issue... I'm not surprised that PyTorch is adding the reshape kernels, but I am surprised that the all_to_all op is synchronous.

<img width="2474" height="212" alt="Image" src="https://github.com/user-attachments/assets/a2c9dcf7-e475-46d2-b1ce-2972e52ee940" />

Here is my code:

def matmul_and_reshard_heads(*, xs, ws, device_mesh):
    num_heads = 16
    ys = []
    for x, w in zip(xs, ws):
        y = torch.matmul(x, w)
        batch_size, num_tokens, out_size = y.shape
        y = y.view(batch_size, num_tokens, num_heads, out_size // num_heads)
        y = DTensor.from_local(
            y, device_mesh=device_mesh, placements=(Shard(1),)
        )
        y = y.redistribute(
            # I tried async_op=True and async_op=False, both settings produce the same trace
            device_mesh, placements=(Shard(-2),), async_op=True
        )
        y = y.to_local()
        ys.append(y)
    return ys

# Run this function on 4 GPUs
def repro():
    device = torch.device("cuda")
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    global_rank = dist.get_rank()
    num_gpus = dist.get_world_size()

    mesh = dist.init_device_mesh(
        "cuda",
        (1, 4),
        mesh_dim_names=("dp", "sp"),
    )

    xs = [torch.randn((16, 16192, 4096), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    ws = [torch.randn((16, 4096, 8192), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    
    matmul_and_reshard_heads_compiled = torch.compile(matmul_and_reshard_heads, fullgraph=True)

    for i in range(100):
        ys = matmul_and_reshard_heads_compiled(xs=xs, ws=ws, device_mesh=mesh["sp"])
        [y.sum() for y in ys]
    torch.cuda.synchronize()

Error logs

No response

Versions

PyTorch version: 2.9.0 CUDA used to build PyTorch: 13.0

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @azahed98 @tianyu-l @XilunWu @SherlockNoMad @ppwwyyxx

extent analysis

TL;DR

The most likely fix is to rewrite the function to use asynchronous collectives, allowing torch.compile to overlap compute and communications.

Guidance

  • Review the matmul_and_reshard_heads function to identify opportunities for asynchronous collective operations, as suggested by @xmfan's comment.
  • Verify that the async_op=True parameter in the redistribute method is correctly implemented and effective.
  • Consider refactoring the function to minimize synchronous collective operations, which may be preventing torch.compile from overlapping compute and communications.
  • Investigate the use of torch.distributed.async_op to enable asynchronous collective operations.

Example

No code snippet is provided, as the issue requires a more idiomatic expression of the function to work better with torch.compile.

Notes

The provided code and comment from @xmfan suggest that rewriting the function to use asynchronous collectives may resolve the issue. However, without further information or testing, it is uncertain whether this will fully address the problem.

Recommendation

Apply a workaround by rewriting the function to use asynchronous collectives, as this may allow torch.compile to overlap compute and communications, improving performance.

Vote matrix ยท Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loadingโ€ฆ

Still need to ship something?

ร—6

Another batch ranked right after the header list โ€” different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ๐Ÿ’ก(How to fix) Fix torch.compile doesn't use async all_to_all