pytorch - ✅(Solved) Fix [ROCm][NCCL watchdog] Cross-thread stream-capture mode restrictions in hipEventQuery/hipEventSynchronize cause false watchdog failures [2 pull requests, 3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177309Fetched 2026-04-08 00:42:13
View on GitHub
Comments
3
Participants
4
Timeline
62
Reactions
0
Timeline (top)
mentioned ×25subscribed ×25referenced ×4commented ×3

Error Message

  1. Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.

Fix Action

Fix / Workaround

In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g., hipErrorStreamCaptureUnsupported) in cross-thread capture windows, even when watchdog thread mode is set to ThreadLocal/Relaxed. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.

This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).

HIP/CUDA APIs expose stream-scoped capture status (StreamIsCapturing, StreamGetCaptureInfo[_v2]) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.

PR fix notes

PR #176251: [ROCm] Avoid watchdog event queries during graph capture

Description (problem / solution / changelog)

This PR introduces a workaround for the HIP runtime bug (https://github.com/pytorch/pytorch/issues/177309) where hipEventQuery from a non-capturing thread invalidates graph captures on other threads, even in THREAD_LOCAL mode(https://github.com/ROCm/rocm-systems/pull/3176). The NCCL/RCCL watchdog's polling queries hit this.

Code Changes

ProcessGroupNCCL.cpp

  • queryEventWithRocmWatchdogCaptureWorkaround() wraps CUDAEvent::query() logic:
    • Watchdog calling during active capture: skips the query, returns false (not ready)
    • Otherwise queries normally, but catches hipErrorCapturedEvent / hipErrorStreamCaptureUnsupported from the watchdog and maps them to "not ready" for race conditions
  • RocmWatchdogEventQueryContextGuard thread-local guard set in runLoop() so the skip path only activates on the watchdog — main-thread wait()/isCompleted() unchanged
  • Timeout checks gated on !is_graph_capture_active() to avoid false positives while queries are skipped

CUDAGraph.cpp/h

  • is_graph_capture_active() reads the existing _currently_capturing_graphs map under its mutex
  • capture_end() erases the map entry before AT_CUDA_CHECK so the watchdog never sees stale state on error paths

All #ifdef USE_ROCM. TODO to remove once the HIP runtime fix ships.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

Changed files

  • aten/src/ATen/cuda/CUDAGraph.cpp (modified, +16/-2)
  • aten/src/ATen/cuda/CUDAGraph.h (modified, +10/-0)
  • torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (modified, +91/-4)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

ProcessGroupNCCL watchdog polling can fail in ROCm when CUDA/HIP graph capture is active on another thread, due to hipEventQuery / hipEventSynchronize stream-capture mode restrictions.

In this path, watchdog polling relies on event query/sync from a side thread. On affected HIP runtime behavior, those calls can return capture-related errors (e.g., hipErrorStreamCaptureUnsupported) in cross-thread capture windows, even when watchdog thread mode is set to ThreadLocal/Relaxed. This leads to false watchdog failures/timeouts or requires a conservative framework workaround.

This is the runtime issue that forced the PyTorch workaround in ProcessGroupNCCL (skip watchdog event query during active capture and defer timeout checks in that window).

Minimal repro shape

  1. Thread A starts stream capture with GLOBAL mode and keeps capture active.
  2. Thread B (watchdog-like thread) calls hipEventQuery / hipEventSynchronize on an event associated with collective work.
  3. Observe capture-related error returns in scenarios where cross-thread behavior should not fail this way.
Observed behavior
  • Watchdog-side event polling is not reliably safe during cross-thread capture windows.
  • Runtime can return capture restriction errors and trigger false failure handling in the framework.
Expected behavior
  • Cross-thread capture interaction should follow documented stream-capture mode semantics.
  • Framework watchdog polling should not require extra conservative skipping logic to avoid runtime-induced failures.

Notes

HIP/CUDA APIs expose stream-scoped capture status (StreamIsCapturing, StreamGetCaptureInfo[_v2]) but not a process-wide "is any capture active" query, so framework-side mitigation is necessarily best-effort.

References:

Versions

Reproduced on ROCm/HIP runtime 7.2.26015 with PyTorch 2.12.0a0+gitcb798d7

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang

extent analysis

Fix Plan

To address the issue of watchdog polling failures in ROCm when CUDA/HIP graph capture is active, we need to implement a more robust handling of cross-thread capture interactions.

Step-by-Step Solution:

  1. Check for active capture: Before calling hipEventQuery or hipEventSynchronize, check if a capture is active using StreamIsCapturing or StreamGetCaptureInfo[_v2].
  2. Defer event polling: If a capture is active, defer the event polling to avoid capture-related errors.
  3. Implement a retry mechanism: If an error occurs during event polling, retry the operation after a short delay to account for temporary capture windows.

Example Code:

// Check if a capture is active
hipError_t status = hipStreamIsCapturing(stream, &isCapturing);
if (status != hipSuccess || isCapturing) {
    // Defer event polling if capture is active
    deferEventPolling(event);
} else {
    // Proceed with event polling
    hipEventQuery(event);
}

// Defer event polling function
void deferEventPolling(hipEvent_t event) {
    // Retry event polling after a short delay
    usleep(1000); // 1ms delay
    hipEventQuery(event);
}

Verification

To verify the fix, test the watchdog polling functionality with active capture windows and ensure that false failures/timeouts are no longer triggered.

Extra Tips

  • Consider using a more robust synchronization mechanism, such as hipStreamWaitEvent, to avoid relying on hipEventQuery and hipEventSynchronize.
  • Monitor the ROCm and PyTorch issue trackers for updates on this issue and potential future fixes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING