pytorch - ✅(Solved) Fix [CI] Test failure may not surface as GHA job failure [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179723Fetched 2026-04-09 07:50:18
View on GitHub
Comments
1
Participants
2
Timeline
60
Reactions
0
Author
Timeline (top)
subscribed ×43mentioned ×10labeled ×4added_to_project_v2 ×1

Fix Action

Fixed

PR fix notes

PR #179774: [ROCm] Fix MultiProcessTestCase exit handling

Description (problem / solution / changelog)

Make MultiProcessTestCase preserve true worker exit status by using os._exit() for skip/error/SystemExit paths, improve parent-side error collection when traceback pipes are empty, and require consistent all-rank skip/exit codes. Reset MultiProcContinuousTest class state between subclasses and prioritize real failures over skips when draining completion queues to avoid stale worker state and masked errors.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang

Changed files

  • torch/testing/_internal/common_distributed.py (modified, +96/-34)
RAW_BUFFERClick to expand / collapse

Test failures in CI can be silently swallowed, causing GHA jobs to report success despite having consistently failing tests. This was observed by @ jithunnair-amd on a ROCm CI job where test_load_njt_weights_only_should_import_False failed consistently (3 attempts, all failed) but the job concluded as success.

cc @ezyang @gchanan @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

extent analysis

TL;DR

Investigate and adjust the CI job configuration to ensure test failures are properly reported and not silently swallowed.

Guidance

  • Review the ROCm CI job configuration to identify why test failures are not being reported correctly.
  • Check the job's logging and error handling settings to ensure that failures are being properly captured and reported.
  • Verify that the test test_load_njt_weights_only_should_import_False is correctly configured and executed within the CI job.
  • Consider adding additional logging or debugging statements to help diagnose the issue.

Notes

The root cause of the issue is unclear, but it appears to be related to the CI job configuration or test execution. Further investigation is needed to determine the exact cause and implement a fix.

Recommendation

Apply workaround: Modify the CI job configuration to improve test failure reporting and handling, such as adding additional logging or error handling, until the root cause can be identified and addressed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING