pytorch - ✅(Solved) Fix RFC: Power (ppc64le) CI and Build Infrastructure Plan [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180362Fetched 2026-04-16 06:35:04
View on GitHub
Comments
0
Participants
1
Timeline
27
Reactions
0
Participants
Timeline (top)
subscribed ×15labeled ×7mentioned ×4added_to_project_v2 ×1

This RFC describes a phased approach for enabling Power (ppc64le) CI testing and build infrastructure in PyTorch. The plan incrementally introduces upstream CI coverage—starting from fork validation, moving to upstream on‑demand pull‑request testing, and eventually enabling nightly builds—while minimizing risk and operational overhead.

The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM.


Root Cause

This RFC describes a phased approach for enabling Power (ppc64le) CI testing and build infrastructure in PyTorch. The plan incrementally introduces upstream CI coverage—starting from fork validation, moving to upstream on‑demand pull‑request testing, and eventually enabling nightly builds—while minimizing risk and operational overhead.

The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM.


Fix Action

Fix / Workaround

Risks and Mitigations

RiskMitigation
CI flakinessStart with non‑blocking, opt‑in execution
Increased CI noiseLabel‑based, on‑demand triggering
Unclear ownershipExplicit IBM‑only ownership and maintenance

PR fix notes

PR #173519: Add on-demand ppc64le wheel build support

Description (problem / solution / changelog)

Fixes #ISSUE_NUMBEPR Description

This PR introduces initial support for building ppc64le wheels in the CI/CD pipeline using a self-hosted runner provisioned under the existing s390x account, following guidance from the CI team.

The goal is to enable Power (ppc64le) architecture compatibility for wheel builds while keeping the impact transparent and opt-in for the wider developer community.

Changes Introduced

✅ Added a dedicated on-demand workflow for ppc64le wheel builds ✅ Integrated ppc64le build environment into the reusable Linux build workflow ✅ Added manylinux ppc64le Dockerfile and build scripts ✅ Added ppc64le workflow configuration (ppc64le.yml) ✅ Added scripts to configure an ephemeral self-hosted ppc64le runner, modeled after the existing s390x setup ✅ Registered one ppc64le runner using the s390x credentials, as approved by the CI teamR

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

Changed files

  • .ci/docker/manywheel/Dockerfile_ppc64le (added, +124/-0)
  • .ci/docker/manywheel/build.sh (modified, +7/-1)
  • .ci/docker/manywheel/build_scripts/build.sh (modified, +6/-3)
  • .ci/docker/manywheel/build_scripts/manylinux1-check.py (modified, +1/-1)
  • .ci/pytorch/build.sh (modified, +3/-3)
  • .github/pytorch-probot.yml (modified, +1/-0)
  • .github/workflows/_linux-build.yml (modified, +41/-25)
  • .github/workflows/ppc64le.yml (added, +22/-0)
  • aten/src/ATen/native/quantized/cpu/qconv.cpp (modified, +7/-2)
  • aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp (modified, +6/-2)
RAW_BUFFERClick to expand / collapse

Author: Sandeep Gupta
Maintained By: IBM Power team
Stakeholders: PyTorch CI Infra team (review and approval only)

CC: @afrittoli @malfet @atalman


Summary

This RFC describes a phased approach for enabling Power (ppc64le) CI testing and build infrastructure in PyTorch. The plan incrementally introduces upstream CI coverage—starting from fork validation, moving to upstream on‑demand pull‑request testing, and eventually enabling nightly builds—while minimizing risk and operational overhead.

The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM.


Motivation

PyTorch is widely used across multiple hardware architectures, including Power. Today, CI testing and build validation for Power systems are largely maintained outside of upstream PyTorch CI, typically in downstream or vendor‑managed forks.

Bringing Power CI upstream provides:

  • Earlier detection of architecture‑specific regressions
  • Increased confidence for users running PyTorch on Power systems
  • Reduced long‑term maintenance cost of downstream forks
  • Alignment with PyTorch’s multi‑architecture support goals

At the same time, introducing new architecture coverage into upstream CI must be done carefully to avoid instability, unexpected CI noise, or community maintenance burden. This RFC proposes a measured, incremental plan to reach consensus on scope and execution.


Current Status

As of today:

  • ✅ Power build and test jobs are validated on a fork
  • ✅ CI workflows and ciflow/ppc64le labels have been updated and validated
  • ✅ Jobs are passing in controlled environments
  • ❌ No Power CI coverage exists upstream in pytorch/pytorch
  • ❌ No nightly artifacts are published for Power

Technical feasibility has been established; upstream integration and governance are the next steps.


Proposal Overview

Power CI support will be introduced through incremental CI roles, each with clearly scoped responsibilities, triggers, and expectations.

Guiding Principles

  • Start with non‑blocking, opt‑in CI execution
  • Expand scope only after stability and signal quality are demonstrated
  • Avoid introducing new merge or release gating
  • Keep ownership and operational responsibility explicit

All Power CI jobs discussed below run on IBM‑operated, vendor‑managed runner infrastructure.


Technical Implementation: Runners and Infrastructure

This section summarizes how Power CI runners are implemented and operated. The design follows the same general model used for other vendor‑managed architectures, with IBM retaining full ownership and operational responsibility.

Docker Images Overview

Power CI uses two distinct container images, each serving a separate purpose:

  1. Ephemeral Runner Image
  2. PyTorch Build and Test Image

Ephemeral Runner Image

  • The ephemeral runner image is used to provision GitHub self‑hosted runners dynamically
  • This image contains:
    • GitHub Actions runner binaries
    • Runner registration and teardown logic
    • Minimal system dependencies required for runner initialization
  • The runner image is maintained outside of the pytorch/pytorch repository and is owned and operated by the IBM Power team
  • Runners are dynamically registered with GitHub at job start and deregistered on completion

PyTorch Build and Test Image

  • A separate container image is used for building and testing PyTorch on Power (ppc64le)
  • This image is defined by the Dockerfile located at: .ci/docker/manywheel/Dockerfile_ppc64le
  • The image includes:
    • Required system dependencies for Power
    • PyTorch build toolchains
    • Runtime and test dependencies used by CI
  • The Dockerfile and related configuration are versioned directly in the pytorch/pytorch repository as part of the Power CI changes
  • Keeping the build image definition in‑repo ensures the Power build environment is transparent, reviewable, and aligned with upstream CI expectations

Image Maintenance and Updates

  • Both images are maintained by the IBM Power team
  • Updates are performed when:
    • Base OS packages require updates
    • Build or test dependencies change
    • Toolchain updates are needed
  • Changes to the PyTorch build image are validated via fork‑based CI before being used in upstream workflows

Runner Provisioning and Lifecycle

  • Runner provisioning, registration, and teardown are handled by IBM‑managed automation outside of the PyTorch repository
  • Runners are provisioned on demand and destroyed after job completion
  • IBM is responsible for runner capacity planning, scaling, and health

Integration with Upstream PyTorch

  • Runners are currently integrated with pytorch/pytorch using the existing PyTorch GitHub App, consistent with the s390 setup
  • CI jobs are triggered via standard GitHub Actions workflows
  • Power CI is opt‑in and label‑driven (ciflow/ppc64le)
  • As PyTorch CI infrastructure evolves (e.g. GHARTS when available), the Power CI integration may be aligned with those mechanisms without changing ownership or execution semantics

Runner Roles, Triggers, and Jobs

Phase 1: Tester – Fork Validation ✅ (Completed)

Role: Tester
Triggers: Manual execution on forks only

Jobs:

  • Build PyTorch (core)
  • Execute tests as defined by the Power CI workflow

Purpose:
Validate CI workflows, runner configuration, and correctness without impacting upstream PyTorch CI.


Phase 2: Tester – On‑Demand PR Testing (Upstream, Non‑Blocking)

Role: Tester

Triggers:

  • Manual, opt‑in execution on pull requests via the ciflow/ppc64le label
  • Not enabled by default on all PRs

Jobs:

  • Power CI workflow defined in .github/workflows/ppc64le.yml
  • Build PyTorch
  • Execute tests as defined by the workflow

Infrastructure:

  • Jobs run exclusively on IBM‑managed, vendor‑provided runner pools

Characteristics:

  • Non‑blocking
  • Failures are reported for visibility only
  • Intended to validate upstream CI integration and detect regressions

Current Implementation:

Purpose:

  • Validate Power CI behavior on real upstream PRs
  • Collect reliability and signal‑quality data before introducing scheduled jobs

Phase 3: Publisher – Nightly Builds

Role: Publisher

Triggers: Scheduled nightly workflows

Jobs:

  • Build and publish Power nightly artifacts
  • Update nightly build workflows to include Power
  • Update common reusable CI workflows in alignment with PyTorch CI design guidelines and patterns

Infrastructure:

  • All jobs run on IBM‑operated, vendor‑managed runner pools

Benefits:

  • Enables broader downstream and user testing on Power systems
  • Provides early signal for architecture‑specific issues
  • Aligns Power CI with existing PyTorch CI workflow design

Prerequisite:

  • Demonstrated stability and reliability from Phase 2 on‑demand PR testing

Non‑Goals (Initial Phases)

  • Making Power CI a required merge gate for pull requests
  • Including Power in official release blocking criteria
  • Achieving full test‑matrix parity with tier‑1 architectures

These may be revisited after sustained stability is demonstrated.


Risks and Mitigations

RiskMitigation
CI flakinessStart with non‑blocking, opt‑in execution
Increased CI noiseLabel‑based, on‑demand triggering
Unclear ownershipExplicit IBM‑only ownership and maintenance

Ownership and Maintenance

The Power CI and build infrastructure is fully owned and maintained by IBM.

IBM responsibilities include:

  • Runner provisioning and capacity management
  • CI job reliability, debugging, and failure triage
  • Workflow and configuration maintenance
  • Build and test infrastructure upkeep

There is no expectation for the PyTorch community to operate, debug, or maintain the Power CI infrastructure.


Open Questions

  • Appropriate test scope for future nightly Power builds
  • Long‑term criteria for expanding Power CI coverage
  • Conditions for any future release‑tier consideration

These topics are intentionally deferred until upstream signal is available.


Next Steps

  1. Gather feedback on this RFC issue from CI Infra and stakeholders
  2. Iterate on Phase 2 based on upstream PR signal
  3. Introduce nightly build workflows for Power when Phase 2 stabilizes
  4. Reassess scope and coverage based on observed results

cc @malfet @seemethere @pytorch/pytorch-dev-infra

extent analysis

TL;DR

To integrate Power CI testing and build infrastructure into PyTorch, start by implementing Phase 2, which involves on-demand PR testing using the ciflow/ppc64le label, ensuring non-blocking and opt-in execution to validate upstream CI integration and detect regressions.

Guidance

  • Implement Phase 2 of the Power CI integration plan, focusing on on-demand PR testing to validate CI workflows and detect regressions without impacting upstream PyTorch CI.
  • Ensure that all Power CI jobs run exclusively on IBM-managed, vendor-provided runner pools to maintain ownership and operational responsibility.
  • Monitor the reliability and signal quality of Power CI during Phase 2 to inform the decision to proceed with Phase 3, which involves scheduled nightly builds.
  • Review and iterate on the Power CI workflow defined in .github/workflows/ppc64le.yml to ensure it aligns with PyTorch CI design guidelines and patterns.

Example

No specific code snippet is provided as the issue focuses on the overall strategy and plan for integrating Power CI into PyTorch rather than specific code changes.

Notes

The integration plan is designed to be incremental, starting with non-blocking, opt-in execution to minimize risk and operational overhead. The plan's success depends on the demonstration of stability and reliability in each phase before proceeding to the next.

Recommendation

Apply the workaround by starting with Phase 2 of the Power CI integration plan, which allows for controlled testing and validation of the CI workflows without immediately introducing nightly builds or making Power CI a required merge gate for pull requests. This approach enables the collection of reliability and signal-quality data before expanding the scope of Power CI.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix RFC: Power (ppc64le) CI and Build Infrastructure Plan [1 pull requests, 1 participants]