pytorch - ✅(Solved) Fix Add Dynamo support for dist.record_comm (capture_profiler_record_comm) [1 pull requests, 1 comments, 2 participants]

pytorch2026-03-30 23:02:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178820•Fetched 2026-04-08 01:52:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×27subscribed ×27labeled ×7referenced ×3

Root Cause

Problem

dist.record_comm inside torch.compile causes a graph break because its internal calls to torch._C._distributed_c10d._get_comm_profiling_name() and _set_comm_profiling_name() are pybind11 C++ functions with no Python source. Dynamo classifies them as SkipFunctionVariable and graph-breaks.

PR fix notes

PR #179093: Make dist.record_comm dynamo traceable

Repository: pytorch/pytorch
Author: aditvenk
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/179093

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/178820

Insert nodes in the graph for record comm enter/exit. Use opaque type to ensure names survive AOTAutograd. Add unit tests and validated using profiler trace that comm name survives in the compiled region

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @azahed98

Changed files

test/distributed/test_dynamo_distributed.py (modified, +128/-0)
torch/_dynamo/config.py (modified, +6/-0)
torch/_dynamo/variables/ctx_manager.py (modified, +68/-0)
torch/_dynamo/variables/torch.py (modified, +8/-0)
torch/distributed/_functional_collectives.py (modified, +84/-0)
torch/distributed/distributed_c10d.py (modified, +12/-3)

Code Example

@torch.compile
  def fn(x):
      with dist.record_comm("my_collective"):
          work = dist.all_reduce(x, async_op=True)
      work.wait()
      return x

RAW_BUFFERClick to expand / collapse

Problem

  @torch.compile
  def fn(x):
      with dist.record_comm("my_collective"):
          work = dist.all_reduce(x, async_op=True)
      work.wait()
      return x

This is the same class of problem that torch.profiler.record_function had before
capture_profiler_record_function was added.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo

extent analysis

Fix Plan

To resolve the issue, we need to add support for dist.record_comm in torch.compile by creating a custom wrapper function that can be compiled.

Step-by-Step Solution

Create a custom wrapper function for dist.record_comm:

import torch import torch.distributed as dist

def record_comm_wrapper(name): def wrapper(func): def inner(*args, **kwargs): with dist.record_comm(name): return func(*args, **kwargs) return inner return wrapper

2. **Apply the wrapper to the function**:
   ```python
@torch.compile
@record_comm_wrapper("my_collective")
def fn(x):
    work = dist.all_reduce(x, async_op=True)
    work.wait()
    return x

Verify the fix by checking if the function compiles without errors and the collective communication is recorded correctly.

Verification

To verify the fix, you can use the following code:

import torch
import torch.distributed as dist

# Initialize the distributed backend
dist.init_process_group("nccl", init_method="env://")

# Define the function with the wrapper
@torch.compile
@record_comm_wrapper("my_collective")
def fn(x):
    work = dist.all_reduce(x, async_op=True)
    work.wait()
    return x

# Test the function
x = torch.tensor([1.0])
result = fn(x)
print(result)

This should compile and run without errors, and the collective communication should be recorded correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API routing #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix Add Dynamo support for dist.record_comm (capture_profiler_record_comm) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Problem

PR fix notes

PR #179093: Make dist.record_comm dynamo traceable

Description (problem / solution / changelog)

Changed files

Code Example

Problem

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix Add Dynamo support for dist.record_comm (capture_profiler_record_comm) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Problem

PR fix notes

PR #179093: Make dist.record_comm dynamo traceable

Description (problem / solution / changelog)

Changed files

Code Example

Problem

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING