pytorch - ✅(Solved) Fix [ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures [2 pull requests, 3 comments, 4 participants]

pytorch2026-03-12 20:35:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177309•Fetched 2026-04-08 00:42:13

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×25subscribed ×25referenced ×4commented ×3

Error Message

Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.

Fix Action

Fix / Workaround

In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g., hipErrorStreamCaptureUnsupported) in cross-thread capture windows, even when watchdog thread mode is set to ThreadLocal/Relaxed. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.

This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).

HIP/CUDA APIs expose stream-scoped capture status (StreamIsCapturing, StreamGetCaptureInfo[_v2]) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.

PR fix notes

PR #176251: [ROCm] Avoid watchdog event queries during graph capture

Repository: pytorch/pytorch
Author: chinmaydk99
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/176251

Description (problem / solution / changelog)

This PR introduces a workaround for the HIP runtime bug (https://github.com/pytorch/pytorch/issues/177309) where hipEventQuery from a non-capturing thread invalidates graph captures on other threads, even in THREAD_LOCAL mode(https://github.com/ROCm/rocm-systems/pull/3176). The NCCL/RCCL watchdog's polling queries hit this.

Code Changes

`ProcessGroupNCCL.cpp`

queryEventWithRocmWatchdogCaptureWorkaround() wraps CUDAEvent::query() logic:
- Watchdog calling during active capture: skips the query, returns false (not ready)
- Otherwise queries normally, but catches hipErrorCapturedEvent / hipErrorStreamCaptureUnsupported from the watchdog and maps them to "not ready" for race conditions
RocmWatchdogEventQueryContextGuard thread-local guard set in runLoop() so the skip path only activates on the watchdog — main-thread wait()/isCompleted() unchanged
Timeout checks gated on !is_graph_capture_active() to avoid false positives while queries are skipped

`CUDAGraph.cpp/h`

is_graph_capture_active() reads the existing _currently_capturing_graphs map under its mutex
capture_end() erases the map entry before AT_CUDA_CHECK so the watchdog never sees stale state on error paths

All #ifdef USE_ROCM. TODO to remove once the HIP runtime fix ships.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

Changed files

aten/src/ATen/cuda/CUDAGraph.cpp (modified, +16/-2)
aten/src/ATen/cuda/CUDAGraph.h (modified, +10/-0)
torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (modified, +91/-4)

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

ProcessGroupNCCL watchdog polling can fail in ROCm when CUDA/HIP graph capture is active on another thread, due to hipEventQuery / hipEventSynchronize stream-capture mode restrictions.

This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).

Minimal repro shape

Thread A starts stream capture with GLOBAL mode and keeps capture active.
Thread B (watchdog-like thread) calls hipEventQuery / hipEventSynchronize on an event associated with collective work.
Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.

Observed behavior

Watchdog-side event polling is not reliably safe during cross-thread capture windows.
Runtime can return capture restriction errors and trigger false failure handling in the framework.

Expected behavior

Cross-thread capture interaction should follow documented stream-capture mode semantics.
Framework watchdog polling should not require extra conservative skipping logic to avoid runtime-induced failures.

Notes

References:

PyTorch workaround context: https://github.com/pytorch/pytorch/pull/176251

Versions

Reproduced on ROCm/HIP runtime 7.2.26015 with PyTorch 2.12.0a0+gitcb798d7

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang

extent analysis

Fix Plan

To address the issue of watchdog polling failures in ROCm when CUDA/HIP graph capture is active, we need to implement a more robust handling of cross-thread capture interactions.

Step-by-Step Solution:

Check for active capture: Before calling hipEventQuery or hipEventSynchronize, check if a capture is active using StreamIsCapturing or StreamGetCaptureInfo[_v2].
Defer event polling: If a capture is active, defer the event polling to avoid capture-related errors.
Implement a retry mechanism: If an error occurs during event polling, retry the operation after a short delay to account for temporary capture windows.

Example Code:

// Check if a capture is active
hipError_t status = hipStreamIsCapturing(stream, &isCapturing);
if (status != hipSuccess || isCapturing) {
    // Defer event polling if capture is active
    deferEventPolling(event);
} else {
    // Proceed with event polling
    hipEventQuery(event);
}

// Defer event polling function
void deferEventPolling(hipEvent_t event) {
    // Retry event polling after a short delay
    usleep(1000); // 1ms delay
    hipEventQuery(event);
}

Verification

To verify the fix, test the watchdog polling functionality with active capture windows and ensure that false failures/timeouts are no longer triggered.

Extra Tips

Consider using a more robust synchronization mechanism, such as hipStreamWaitEvent, to avoid relying on hipEventQuery and hipEventSynchronize.
Monitor the ROCm and PyTorch issue trackers for updates on this issue and potential future fixes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #agent execution #callback error #memory management #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures [2 pull requests, 3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #176251: [ROCm] Avoid watchdog event queries during graph capture

Description (problem / solution / changelog)

Code Changes

`ProcessGroupNCCL.cpp`

`CUDAGraph.cpp/h`

Changed files

🐛 Describe the bug

Minimal repro shape

Observed behavior

Expected behavior

Notes

Versions

extent analysis

Fix Plan

Step-by-Step Solution:

Example Code:

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures [2 pull requests, 3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #176251: [ROCm] Avoid watchdog event queries during graph capture

Description (problem / solution / changelog)

Code Changes

ProcessGroupNCCL.cpp

CUDAGraph.cpp/h

Changed files

🐛 Describe the bug

Minimal repro shape

Observed behavior

Expected behavior

Notes

Versions

extent analysis

Fix Plan

Step-by-Step Solution:

Example Code:

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`ProcessGroupNCCL.cpp`

`CUDAGraph.cpp/h`