pytorch - ✅(Solved) Fix [ROCm][Windows] Track proper fix for HIP per-config flag injection bypass (follow-up to #183856) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#184397Fetched 2026-05-20 03:38:57
View on GitHub
Comments
0
Participants
1
Timeline
64
Reactions
3
Author
Participants
Timeline (top)
mentioned ×28subscribed ×28labeled ×6added_to_project_v2 ×1

Root Cause

Consequences of leaving the root cause unfixed:

Fix Action

Fix / Workaround

#183856 landed a single-config hotfix for the ~99% conv2d FP16 perf regression on Windows + ROCm (gfx1150) tracked in #183853 / ROCm/TheRock#5157. The hotfix moves -O3 -DNDEBUG (and the other per-config flags) out of CMAKE_HIP_FLAGS_<CONFIG> and into CMAKE_HIP_FLAGS, gated on CMAKE_BUILD_TYPE. That restored the optimization level on the Windows ROCm Ninja build that PyTorch CI actually uses, but it does not address the underlying bug. This issue tracks the proper fix.

Why the merged change is a workaround

  1. Stop overriding CMAKE_HIP_COMPILE_OBJECT entirely. Investigate which combination of platform variables (CMAKE_HIP_COMPILER_FRONTEND_VARIANT, include-flag variables, depfile-format variables, MSVC runtime mappings, ...) causes the Windows-Clang-HIP platform module to emit GNU-style rules directly. If that is achievable, the per-config flags flow through CMake's normal pipeline and the multi-config case works too.
  2. Keep the override but route per-config flags through it. Expand ${CMAKE_HIP_FLAGS_${CMAKE_BUILD_TYPE}} (single-config) or the appropriate per-config equivalent into the template. This is still a workaround; it does not fix multi-config without additional plumbing.
  3. File the underlying issue upstream against CMake if (1) turns out to require a CMake-side change. The current override exists because Windows-Clang-HIP.cmake emits MSVC-style rules that clang++ (GNU frontend) for HIP rejects; if that platform module needs to grow GNU-frontend support, that work belongs in upstream CMake.

PR fix notes

PR #183856: [ROCm][Windows] Apply per-config HIP optimization flags via CMAKE_HIP_FLAGS

Description (problem / solution / changelog)

Issue

Fixes ROCm/TheRock#5157. Fixes https://github.com/pytorch/pytorch/issues/183853.

Windows ROCm HIP code was effectively compiled at -O0 after #180485 + #183365 landed, causing a ~99% conv2d FP16 perf regression on gfx1150 (bisected and validated by @astrelsky and @adityas-amd at ROCm/TheRock#5157).

Diagnosis

After #183365 stripped the MSVC-style /O2 /Ob2 /DNDEBUG out of CMAKE_HIP_FLAGS_RELEASE and re-appended GNU -O3 -DNDEBUG, the actual build command still came up unoptimized. Inspecting the generated build.ninja shows why:

  • CMAKE_HIP_FLAGS contents (e.g. --offload-compress -std=c++20 --offload-arch=...) do reach the $FLAGS ninja variable used by every HIP compile.
  • CMAKE_HIP_FLAGS_<CONFIG> contents (e.g. -O3 -DNDEBUG) do not.

That's the actual cause of the -O0 fallback — CMAKE_HIP_FLAGS_RELEASE is dead code on the Windows HIP compile path.

Fix

  • Drop the dead string(APPEND CMAKE_HIP_FLAGS_<CONFIG> ...) lines.
  • Inject the appropriate -O{3,2,0,s} plus -DNDEBUG/-g directly into CMAKE_HIP_FLAGS, gated on CMAKE_BUILD_TYPE. This works for single-config generators (Ninja, Makefiles), which is what the Windows ROCm build uses.
  • Drop the -O3 from the compile rule template so non-Release configs aren't tainted.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @adityas-amd @astrelsky

Changed files

  • cmake/public/LoadHIP.cmake (modified, +23/-6)

Code Example

set(CMAKE_HIP_COMPILE_OBJECT
  "<CMAKE_HIP_COMPILER> <DEFINES> <INCLUDES> <FLAGS> -o <OBJECT> -x hip -c <SOURCE>")
RAW_BUFFERClick to expand / collapse

Background

#183856 landed a single-config hotfix for the ~99% conv2d FP16 perf regression on Windows + ROCm (gfx1150) tracked in #183853 / ROCm/TheRock#5157. The hotfix moves -O3 -DNDEBUG (and the other per-config flags) out of CMAKE_HIP_FLAGS_<CONFIG> and into CMAKE_HIP_FLAGS, gated on CMAKE_BUILD_TYPE. That restored the optimization level on the Windows ROCm Ninja build that PyTorch CI actually uses, but it does not address the underlying bug. This issue tracks the proper fix.

Why the merged change is a workaround

In cmake/public/LoadHIP.cmake the Windows path overrides the HIP compile rule template:

set(CMAKE_HIP_COMPILE_OBJECT
  "<CMAKE_HIP_COMPILER> <DEFINES> <INCLUDES> <FLAGS> -o <OBJECT> -x hip -c <SOURCE>")

CMake's default rule writer concatenates CMAKE_HIP_FLAGS and CMAKE_HIP_FLAGS_<CONFIG> into the FLAGS ninja variable that <FLAGS> expands to. Overriding CMAKE_HIP_COMPILE_OBJECT replaces that rule-generation machinery, and on this code path CMAKE_HIP_FLAGS_<CONFIG> no longer reaches <FLAGS> (verified via the generated build.ninja). The merged fix sidesteps this by writing the per-config flags directly into CMAKE_HIP_FLAGS once CMAKE_BUILD_TYPE is known.

Consequences of leaving the root cause unfixed:

  • Multi-config generators are not fixed. The if(NOT CMAKE_CONFIGURATION_TYPES) guard correctly avoids baking a single config's flags into a multi-config build tree (Visual Studio, Ninja Multi-Config), but those generators still receive no per-config HIP optimization flags. PyTorch's Windows ROCm CI uses single-config Ninja, so this is latent rather than active.
  • Any other CMAKE_HIP_FLAGS_<CONFIG> contribution is silently dropped. Anything downstream that appends to CMAKE_HIP_FLAGS_RELEASE (e.g. a user-supplied toolchain file, a future PyTorch change, or a CMake module that sets per-config defaults) will not take effect on the Windows HIP compile line.
  • The override drifts away from CMake's intended model. As CMake's native HIP support evolves, this divergence will keep accumulating special cases.

Possible proper fixes

In rough order of preference:

  1. Stop overriding CMAKE_HIP_COMPILE_OBJECT entirely. Investigate which combination of platform variables (CMAKE_HIP_COMPILER_FRONTEND_VARIANT, include-flag variables, depfile-format variables, MSVC runtime mappings, ...) causes the Windows-Clang-HIP platform module to emit GNU-style rules directly. If that is achievable, the per-config flags flow through CMake's normal pipeline and the multi-config case works too.
  2. Keep the override but route per-config flags through it. Expand ${CMAKE_HIP_FLAGS_${CMAKE_BUILD_TYPE}} (single-config) or the appropriate per-config equivalent into the template. This is still a workaround; it does not fix multi-config without additional plumbing.
  3. File the underlying issue upstream against CMake if (1) turns out to require a CMake-side change. The current override exists because Windows-Clang-HIP.cmake emits MSVC-style rules that clang++ (GNU frontend) for HIP rejects; if that platform module needs to grow GNU-frontend support, that work belongs in upstream CMake.

Validation

Whatever the fix, it should:

  • Restore conv2d FP16 perf on gfx1150 (the regression in #183853) on a clean Windows ROCm build with no per-config flag injection in CMAKE_HIP_FLAGS.
  • Verify by inspecting the generated build.ninja that -O3 -DNDEBUG (Release) appears in the FLAGS variable used by the HIP compile rule via CMake's standard mechanism, not via direct injection into CMAKE_HIP_FLAGS.
  • Spot-check a multi-config generator if one is reachable in CI or locally (Ninja Multi-Config is the cheapest).

References

cc @malfet @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @nkhasbag-nv @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @adityas-amd @astrelsky @jeffdaily

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING