pytorch - ✅(Solved) Fix DISABLED test_cudagraph_memory_cleanup (__main__.TestCustomOpAutoTune) [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179945Fetched 2026-04-11 06:11:31
View on GitHub
Comments
1
Participants
2
Timeline
51
Reactions
0
Timeline (top)
mentioned ×20subscribed ×20labeled ×10commented ×1

Root Cause

This test was disabled because it hangs in ROCm CI.

PR fix notes

PR #179892: [ROCm] Resolve timeouts caused due to hipblasLT module creation during graph capture

Description (problem / solution / changelog)

Fixes #179943. Fixes #179945. Fixes #179947.

After #179053 , ROCm hipBLASLt handle caching changed from per-device to per-(device, stream). That means first use on a capture stream can now trigger lazy hipblasLtCreate on that same stream; on ROCm this init path does capture-unsafe internal allocation/setup, which can fail with stream-capture errors (and sometimes hang) if it runs during capture.

This PR fixes that by pre-initializing the hipBLASLt handle for the target capture stream immediately before capture_begin, so handle creation never occurs inside capture.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @azahed98

Changed files

  • aten/src/ATen/cuda/CUDAGraph.cpp (modified, +10/-0)

PR #180692: [ROCm] Resolve timeouts caused due to hipblasLT module creation during graph capture

Description (problem / solution / changelog)

Fixes #179943. Fixes #179945. Fixes #179947.

After #179053 , ROCm hipBLASLt handle caching changed from per-device to per-(device, stream). That means first use on a capture stream can now trigger lazy hipblasLtCreate on that same stream; on ROCm this init path does capture-unsafe internal allocation/setup, which can fail with stream-capture errors (and sometimes hang) if it runs during capture.

This PR fixes that by pre-initializing the hipBLASLt handle for the target capture stream immediately before capture_begin, so handle creation never occurs inside capture.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela @azahed98

Changed files

  • aten/src/ATen/cuda/CUDAGraph.cpp (modified, +10/-0)
RAW_BUFFERClick to expand / collapse

Platforms: rocm

This test was disabled because it hangs in ROCm CI.

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @mruberry @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @bdhirsh @bobrenjc93 @aorenste

extent analysis

TL;DR

  • The test may need to be re-enabled and modified to run successfully in the ROCm CI environment.

Guidance

  • Investigate the specific conditions under which the test hangs in ROCm CI to identify the root cause.
  • Consider modifying the test to include timeouts or other failure detection mechanisms to prevent hangs.
  • Collaborate with the listed individuals, especially those associated with ROCm (@ROCmSupport, @jeffdaily), to understand environment-specific issues.

Notes

  • The solution depends on the specifics of the test and the ROCm CI environment, which are not provided in the issue.
  • Modifying the test or the CI environment may require significant changes and should be approached with caution.

Recommendation

  • Apply workaround: Modify the test to include failure detection mechanisms, as this is a more immediate and potentially less invasive solution than re-enabling the test without changes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING