pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10-mi355 / test (default) [1 pull requests, 1 comments, 2 participants]

pytorch2026-04-10 13:41:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#179911•Fetched 2026-04-11 06:11:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jithunnair-amd

Participants

jithunnair-amd

pytorch-bot[bot]

Timeline (top)

mentioned ×27subscribed ×27labeled ×7added_to_project_v2 ×2

PR fix notes

PR #179930: Use ROCM_VERSION instead of TORCH_HIP_VERSION to gate expandable segments (#179930)

Repository: pytorch/pytorch
Author: gnuthor
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/179930

Description (problem / solution / changelog)

Summary:

Use ROCM_VERSION (format: major*10000 + minor*100 + patch) instead of TORCH_HIP_VERSION (format: major*100 + minor) to gate expandable segments on ROCm.

The previous gate TORCH_HIP_VERSION < 702 was accidentally disabling expandable segments on ROCm versions 7.0.x (since TORCH_HIP_VERSION for HIP 7.0 = 700 < 702). The corrected gate ROCM_VERSION < 70000 properly limits the disable to ROCm < 7.0.0.

Also adds a test skip for test_graph_rng_after_failed_capture on ROCm when expandable segments are active. This is a pre-existing issue (tracked in pytorch/pytorch#179911) where CUDA graph capture failure recovery doesn't work correctly with expandable segments on ROCm. The skip is conditioned on TEST_WITH_ROCM and self.expandable_segments so the test still runs on ROCm without expandable segments.

Test Plan:

CI (internal + GitHub Actions)
Build verification: buck build fbcode//caffe2/c10/cuda:cuda --config caffe2.enable_hip=true passes
Test skip verified: test_graph_rng_after_failed_capture will be skipped only when ROCm + expandable segments are both active

Reviewed By: mrajpal, joshuuuasu, banitag1

Differential Revision: D100184839

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

Changed files

c10/cuda/CUDAAllocatorConfig.h (modified, +1/-1)
test/test_cuda.py (modified, +4/-0)

RAW_BUFFERClick to expand / collapse

ROCm trunk jobs were having test failures since Mar 11, but the exit code wasn't being caught due to a change in Kineto that came via a submodule update.

However, more recently, Kineto had another submodule update which removed the _atexit hack, thus re-exposing the CI failures in ROCm CI runs: https://hud.pytorch.org/hud/pytorch/pytorch/404c3118fe364e66eb0d70a0e6f53b3beedd26e5/1?per_page=50&name_filter=trunk.*rocm&useRegexFilter=true&mergeEphemeralLF=true

Marking the trunk ROCm jobs to unstable while we work to get the signal back to green.

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @seemethere @malfet @pytorch/pytorch-dev-infra @malfet @atalman

extent analysis

TL;DR

The ROCm trunk jobs are experiencing test failures due to a change in Kineto, and a recent submodule update has re-exposed these CI failures.

Guidance

Investigate the recent submodule update in Kineto that removed the _atexit hack to understand its impact on the ROCm CI runs.
Review the related issue https://github.com/pytorch/pytorch/issues/179723 for potential solutions or workarounds.
Check the HUD links provided for more information on the test failures and CI runs.
Consider reverting or modifying the recent submodule update to mitigate the CI failures.

Example

No code snippet is provided as the issue does not contain explicit code changes.

Notes

The issue is specific to the ROCm trunk jobs and may not affect other areas of the project.
The recent submodule update in Kineto has introduced a change that is causing the CI failures.

Recommendation

Apply workaround: The recent submodule update has introduced a change that is causing the CI failures, so applying a workaround to mitigate this issue is recommended until a permanent fix is found.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#environment setup #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10-mi355 / test (default) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #179930: Use ROCM_VERSION instead of TORCH_HIP_VERSION to gate expandable segments (#179930)

Description (problem / solution / changelog)

Changed files

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10-mi355 / test (default) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #179930: Use ROCM_VERSION instead of TORCH_HIP_VERSION to gate expandable segments (#179930)

Description (problem / solution / changelog)

Changed files

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING