pytorch - 💡(How to fix) Fix torch.compile doesn't use async all_to

Code Example

def matmul_and_reshard_heads(*, xs, ws, device_mesh):
    num_heads = 16
    ys = []
    for x, w in zip(xs, ws):
        y = torch.matmul(x, w)
        batch_size, num_tokens, out_size = y.shape
        y = y.view(batch_size, num_tokens, num_heads, out_size // num_heads)
        y = DTensor.from_local(
            y, device_mesh=device_mesh, placements=(Shard(1),)
        )
        y = y.redistribute(
            # I tried async_op=True and async_op=False, both settings produce the same trace
            device_mesh, placements=(Shard(-2),), async_op=True
        )
        y = y.to_local()
        ys.append(y)
    return ys

# Run this function on 4 GPUs
def repro():
    device = torch.device("cuda")
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    global_rank = dist.get_rank()
    num_gpus = dist.get_world_size()

    mesh = dist.init_device_mesh(
        "cuda",
        (1, 4),
        mesh_dim_names=("dp", "sp"),
    )

    xs = [torch.randn((16, 16192, 4096), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    ws = [torch.randn((16, 4096, 8192), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    
    matmul_and_reshard_heads_compiled = torch.compile(matmul_and_reshard_heads, fullgraph=True)

    for i in range(100):
        ys = matmul_and_reshard_heads_compiled(xs=xs, ws=ws, device_mesh=mesh["sp"])
        [y.sum() for y in ys]
    torch.cuda.synchronize()

🐛 Describe the bug

I am interleaving matmuls with all_to_all collectives. I was hoping that torch.compile would overlap compute and comms given this comment from @xmfan:

In today's graphs, Dynamo rewrites synchronous collectives as asynchronous collectives whenever possible. There's always a stream sync at the end of each graph. So if your async collective usage is fully contained within a single graph (without graph breaks), I expect async_op=False + graph captured to give you similar overlap.

Unfortunately it looks like the all_to_alls are synchronous. Maybe there is a more idiomatic way to express my function which would work better with torch.compile?

Please ignore the reshape kernels, that's a separate issue... I'm not surprised that PyTorch is adding the reshape kernels, but I am surprised that the all_to_all op is synchronous.

Here is my code:

def matmul_and_reshard_heads(*, xs, ws, device_mesh):
    num_heads = 16
    ys = []
    for x, w in zip(xs, ws):
        y = torch.matmul(x, w)
        batch_size, num_tokens, out_size = y.shape
        y = y.view(batch_size, num_tokens, num_heads, out_size // num_heads)
        y = DTensor.from_local(
            y, device_mesh=device_mesh, placements=(Shard(1),)
        )
        y = y.redistribute(
            # I tried async_op=True and async_op=False, both settings produce the same trace
            device_mesh, placements=(Shard(-2),), async_op=True
        )
        y = y.to_local()
        ys.append(y)
    return ys

# Run this function on 4 GPUs
def repro():
    device = torch.device("cuda")
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    global_rank = dist.get_rank()
    num_gpus = dist.get_world_size()

    mesh = dist.init_device_mesh(
        "cuda",
        (1, 4),
        mesh_dim_names=("dp", "sp"),
    )

    xs = [torch.randn((16, 16192, 4096), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    ws = [torch.randn((16, 4096, 8192), dtype=torch.bfloat16, device="cuda") for _ in range(8)]
    
    matmul_and_reshard_heads_compiled = torch.compile(matmul_and_reshard_heads, fullgraph=True)

    for i in range(100):
        ys = matmul_and_reshard_heads_compiled(xs=xs, ws=ws, device_mesh=mesh["sp"])
        [y.sum() for y in ys]
    torch.cuda.synchronize()

Error logs

No response

Versions

PyTorch version: 2.9.0 CUDA used to build PyTorch: 13.0

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo @azahed98 @tianyu-l @XilunWu @SherlockNoMad @ppwwyyxx

extent analysis

TL;DR

The most likely fix is to rewrite the function to use asynchronous collectives, allowing torch.compile to overlap compute and communications.

Guidance

Review the matmul_and_reshard_heads function to identify opportunities for asynchronous collective operations, as suggested by @xmfan's comment.
Verify that the async_op=True parameter in the redistribute method is correctly implemented and effective.
Consider refactoring the function to minimize synchronous collective operations, which may be preventing torch.compile from overlapping compute and communications.
Investigate the use of torch.distributed.async_op to enable asynchronous collective operations.

Example

No code snippet is provided, as the issue requires a more idiomatic expression of the function to work better with torch.compile.

Notes

The provided code and comment from @xmfan suggest that rewriting the function to use asynchronous collectives may resolve the issue. However, without further information or testing, it is uncertain whether this will fully address the problem.

Recommendation

Apply a workaround by rewriting the function to use asynchronous collectives, as this may allow torch.compile to overlap compute and communications, improving performance.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix torch.compile doesn't use async all_to_all

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix torch.compile doesn't use async all_to_all

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error logs

Code Example

🐛 Describe the bug

Error logs

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING