pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10-mi355 / test (default) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179911Fetched 2026-04-11 06:11:52
View on GitHub
Comments
1
Participants
2
Timeline
65
Reactions
0
Timeline (top)
mentioned ×27subscribed ×27labeled ×7added_to_project_v2 ×2

PR fix notes

PR #179930: Use ROCM_VERSION instead of TORCH_HIP_VERSION to gate expandable segments (#179930)

Description (problem / solution / changelog)

Summary:

Use ROCM_VERSION (format: major*10000 + minor*100 + patch) instead of TORCH_HIP_VERSION (format: major*100 + minor) to gate expandable segments on ROCm.

The previous gate TORCH_HIP_VERSION < 702 was accidentally disabling expandable segments on ROCm versions 7.0.x (since TORCH_HIP_VERSION for HIP 7.0 = 700 < 702). The corrected gate ROCM_VERSION < 70000 properly limits the disable to ROCm < 7.0.0.

Also adds a test skip for test_graph_rng_after_failed_capture on ROCm when expandable segments are active. This is a pre-existing issue (tracked in pytorch/pytorch#179911) where CUDA graph capture failure recovery doesn't work correctly with expandable segments on ROCm. The skip is conditioned on TEST_WITH_ROCM and self.expandable_segments so the test still runs on ROCm without expandable segments.

Test Plan:

  • CI (internal + GitHub Actions)
  • Build verification: buck build fbcode//caffe2/c10/cuda:cuda --config caffe2.enable_hip=true passes
  • Test skip verified: test_graph_rng_after_failed_capture will be skipped only when ROCm + expandable segments are both active

Reviewed By: mrajpal, joshuuuasu, banitag1

Differential Revision: D100184839

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

Changed files

  • c10/cuda/CUDAAllocatorConfig.h (modified, +1/-1)
  • test/test_cuda.py (modified, +4/-0)
RAW_BUFFERClick to expand / collapse

ROCm trunk jobs were having test failures since Mar 11, but the exit code wasn't being caught due to a change in Kineto that came via a submodule update.

However, more recently, Kineto had another submodule update which removed the _atexit hack, thus re-exposing the CI failures in ROCm CI runs: https://hud.pytorch.org/hud/pytorch/pytorch/404c3118fe364e66eb0d70a0e6f53b3beedd26e5/1?per_page=50&name_filter=trunk.*rocm&useRegexFilter=true&mergeEphemeralLF=true

Marking the trunk ROCm jobs to unstable while we work to get the signal back to green.

Related issue: https://github.com/pytorch/pytorch/issues/179723

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @seemethere @malfet @pytorch/pytorch-dev-infra @malfet @atalman

extent analysis

TL;DR

  • The ROCm trunk jobs are experiencing test failures due to a change in Kineto, and a recent submodule update has re-exposed these CI failures.

Guidance

  • Investigate the recent submodule update in Kineto that removed the _atexit hack to understand its impact on the ROCm CI runs.
  • Review the related issue https://github.com/pytorch/pytorch/issues/179723 for potential solutions or workarounds.
  • Check the HUD links provided for more information on the test failures and CI runs.
  • Consider reverting or modifying the recent submodule update to mitigate the CI failures.

Example

  • No code snippet is provided as the issue does not contain explicit code changes.

Notes

  • The issue is specific to the ROCm trunk jobs and may not affect other areas of the project.
  • The recent submodule update in Kineto has introduced a change that is causing the CI failures.

Recommendation

  • Apply workaround: The recent submodule update has introduced a change that is causing the CI failures, so applying a workaround to mitigate this issue is recommended until a permanent fix is found.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING