pytorch - ✅(Solved) Fix RFC: Power (ppc64le) CI and Build Infrastructure Plan [1 pull requests, 1 participants]

sandeepgupta12 · 2026-04-14T18:20:14Z

[pytorch] This RFC describes a phased approach for enabling Power ppc64le CI testing and build infrastructure in PyTorch. The plan incrementally introduces ups… This RFC describes a phased approach for enabling **Power (ppc64le) CI testing and build infrastructure** in PyTorch. The plan incrementally introduces upstream CI coverage—starting from fork validation, moving to upstream on‑demand pull‑request testing, and eventually enabling nightly builds—while minimizing risk and operational overhead. The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM. --- # PR #173519: Add on-demand ppc64le wheel build support - Repository: pytorch/pytorch - Author: sandeepgupta12 - State: open | merged: False - Link: https://github.com/pytorch/pytorch/pull/173519 ## Description (problem / solution / changelog) Fixes #ISSUE_NUMBEPR Description This PR introduces initial support for building ppc64le wheels in the CI/CD pipeline using a self-hosted runner provisioned under the existing s390x account, following guidance from the CI team. The goal is to enable Power (ppc64le) architecture compatibility for wheel builds while keeping the impact transparent and opt-in for the wider developer community. Changes Introduced ✅ Added a dedicated on-demand workflow for ppc64le wheel builds ✅ Integrated ppc64le build environment into the reusable Linux build workflow ✅ Added manylinux ppc64le Dockerfile and build scripts ✅ Added ppc64le workflow configuration (ppc64le.yml) ✅ Added scripts to configure an ephemeral self-hosted ppc64le runner, modeled after the existing s390x setup ✅ Registered one ppc64le runner using the s390x credentials, as approved by the CI teamR cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 ## Changed files - `.ci/docker/manywheel/Dockerfile_ppc64le` (added, +124/-0) - `.ci/docker/manywheel/build.sh` (modified, +7/-1) - `.ci/docker/manywheel/build_scripts/build.sh` (modified, +6/-3) - `.ci/docker/manywheel/build_scripts/manylinux1-check.py` (modified, +1/-1) - `.ci/pytorch/build.sh` (modified, +3/-3) - `.github/pytorch-probot.yml` (modified, +1/-0) - `.github/workflows/_linux-build.yml` (modified, +41/-25) - `.github/workflows/ppc64le.yml` (added, +22/-0) - `aten/src/ATen/native/quantized/cpu/qconv.cpp` (modified, +7/-2) - `aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp` (modified, +6/-2) ## Fix / Workaround ## Risks and Mitigations | Risk | Mitigation | |-----|-----------| | CI flakiness | Start with non‑blocking, opt‑in execution | | Increased CI noise | Label‑based, on‑demand triggering | | Unclear ownership | Explicit IBM‑only ownership and maintenance | **Author:** Sandeep Gupta **Maintained By:** IBM Power team **Stakeholders:** PyTorch CI Infra team (review and approval only) CC: @afrittoli @malfet @atalman --- ## Summary This RFC describes a phased approach for enabling **Power (ppc64le) CI testing and build infrastructure** in PyTorch. The plan incrementally introduces upstream CI coverage—starting from fork validation, moving to upstream on‑demand pull‑request testing, and eventually enabling nightly builds—while minimizing risk and operational overhead. The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM. --- ## Motivation PyTorch is widely used across multiple hardware architectures, including Power. Today, CI testing and build validation for Power systems are largely maintained outside of upstream PyTorch CI, typically in downstream or vendor‑managed forks. Bringing Power CI upstream provides: - Earlier detection of architecture‑specific regressions - Increased confidence for users running PyTorch on Power systems - Reduced long‑term maintenance cost of downstream forks - Alignment with PyTorch’s multi‑architecture support goals At the same time, introducing new architecture coverage into upstream CI must be done carefully to avoid instability, unexpected CI noise, or community maintenance burden. This RFC proposes a **measured, incremental plan** to reach consensus on scope and execution. --- ## Current Status As of today: - ✅ Power build and test jobs are validated on a fork - ✅ CI workflows and `ciflow/ppc64le` labels have been updated and validated - ✅ Jobs are passing in controlled environments - ❌ No Power CI coverage exists upstream in `pytorch/pytorch` - ❌ No nightly artifacts are published for Power Technical feasibility has been established; upstream integration and governance are the next steps. --- ## Proposal Overview Power CI support will be introduced through **incremental CI roles**, each with clearly scoped responsibilities, triggers, and expectations. ### Guiding Principles - Start with **non‑blocking**, opt‑in CI execution - Expand scope only after stability and signal quality are demonstrated - Avoid introducing new merge or release gating - Keep ownership and operational responsi

pytorch2026-04-14 18:20:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180362•Fetched 2026-04-16 06:35:04

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sandeepgupta12

Participants

sandeepgupta12

Timeline (top)

subscribed ×15labeled ×7mentioned ×4added_to_project_v2 ×1

This RFC describes a phased approach for enabling Power (ppc64le) CI testing and build infrastructure in PyTorch. The plan incrementally introduces upstream CI coverage—starting from fork validation, moving to upstream on‑demand pull‑request testing, and eventually enabling nightly builds—while minimizing risk and operational overhead.

The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM.

Root Cause

The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM.

Fix Action

Fix / Workaround

Risks and Mitigations

Risk	Mitigation
CI flakiness	Start with non‑blocking, opt‑in execution
Increased CI noise	Label‑based, on‑demand triggering
Unclear ownership	Explicit IBM‑only ownership and maintenance

PR fix notes

PR #173519: Add on-demand ppc64le wheel build support

Repository: pytorch/pytorch
Author: sandeepgupta12
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/173519

Description (problem / solution / changelog)

Fixes #ISSUE_NUMBEPR Description

This PR introduces initial support for building ppc64le wheels in the CI/CD pipeline using a self-hosted runner provisioned under the existing s390x account, following guidance from the CI team.

The goal is to enable Power (ppc64le) architecture compatibility for wheel builds while keeping the impact transparent and opt-in for the wider developer community.

Changes Introduced

✅ Added a dedicated on-demand workflow for ppc64le wheel builds ✅ Integrated ppc64le build environment into the reusable Linux build workflow ✅ Added manylinux ppc64le Dockerfile and build scripts ✅ Added ppc64le workflow configuration (ppc64le.yml) ✅ Added scripts to configure an ephemeral self-hosted ppc64le runner, modeled after the existing s390x setup ✅ Registered one ppc64le runner using the s390x credentials, as approved by the CI teamR

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01

Changed files

.ci/docker/manywheel/Dockerfile_ppc64le (added, +124/-0)
.ci/docker/manywheel/build.sh (modified, +7/-1)
.ci/docker/manywheel/build_scripts/build.sh (modified, +6/-3)
.ci/docker/manywheel/build_scripts/manylinux1-check.py (modified, +1/-1)
.ci/pytorch/build.sh (modified, +3/-3)
.github/pytorch-probot.yml (modified, +1/-0)
.github/workflows/_linux-build.yml (modified, +41/-25)
.github/workflows/ppc64le.yml (added, +22/-0)
aten/src/ATen/native/quantized/cpu/qconv.cpp (modified, +7/-2)
aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp (modified, +6/-2)

RAW_BUFFERClick to expand / collapse

Author: Sandeep Gupta
Maintained By: IBM Power team
Stakeholders: PyTorch CI Infra team (review and approval only)

CC: @afrittoli @malfet @atalman

Summary

The goal is to make Power CI visible upstream while keeping ownership, operations, and infrastructure fully vendor‑managed by IBM.

Motivation

PyTorch is widely used across multiple hardware architectures, including Power. Today, CI testing and build validation for Power systems are largely maintained outside of upstream PyTorch CI, typically in downstream or vendor‑managed forks.

Bringing Power CI upstream provides:

Earlier detection of architecture‑specific regressions
Increased confidence for users running PyTorch on Power systems
Reduced long‑term maintenance cost of downstream forks
Alignment with PyTorch’s multi‑architecture support goals

At the same time, introducing new architecture coverage into upstream CI must be done carefully to avoid instability, unexpected CI noise, or community maintenance burden. This RFC proposes a measured, incremental plan to reach consensus on scope and execution.

Current Status

As of today:

✅ Power build and test jobs are validated on a fork
✅ CI workflows and ciflow/ppc64le labels have been updated and validated
✅ Jobs are passing in controlled environments
❌ No Power CI coverage exists upstream in pytorch/pytorch
❌ No nightly artifacts are published for Power

Technical feasibility has been established; upstream integration and governance are the next steps.

Proposal Overview

Power CI support will be introduced through incremental CI roles, each with clearly scoped responsibilities, triggers, and expectations.

Guiding Principles

Start with non‑blocking, opt‑in CI execution
Expand scope only after stability and signal quality are demonstrated
Avoid introducing new merge or release gating
Keep ownership and operational responsibility explicit

All Power CI jobs discussed below run on IBM‑operated, vendor‑managed runner infrastructure.

Technical Implementation: Runners and Infrastructure

This section summarizes how Power CI runners are implemented and operated. The design follows the same general model used for other vendor‑managed architectures, with IBM retaining full ownership and operational responsibility.

Docker Images Overview

Power CI uses two distinct container images, each serving a separate purpose:

Ephemeral Runner Image
PyTorch Build and Test Image

Ephemeral Runner Image

The ephemeral runner image is used to provision GitHub self‑hosted runners dynamically
This image contains:
- GitHub Actions runner binaries
- Runner registration and teardown logic
- Minimal system dependencies required for runner initialization
The runner image is maintained outside of the pytorch/pytorch repository and is owned and operated by the IBM Power team
Runners are dynamically registered with GitHub at job start and deregistered on completion

PyTorch Build and Test Image

A separate container image is used for building and testing PyTorch on Power (ppc64le)
This image is defined by the Dockerfile located at: .ci/docker/manywheel/Dockerfile_ppc64le
The image includes:
- Required system dependencies for Power
- PyTorch build toolchains
- Runtime and test dependencies used by CI
The Dockerfile and related configuration are versioned directly in the pytorch/pytorch repository as part of the Power CI changes
Keeping the build image definition in‑repo ensures the Power build environment is transparent, reviewable, and aligned with upstream CI expectations

Image Maintenance and Updates

Both images are maintained by the IBM Power team
Updates are performed when:
- Base OS packages require updates
- Build or test dependencies change
- Toolchain updates are needed
Changes to the PyTorch build image are validated via fork‑based CI before being used in upstream workflows

Runner Provisioning and Lifecycle

Runner provisioning, registration, and teardown are handled by IBM‑managed automation outside of the PyTorch repository
Runners are provisioned on demand and destroyed after job completion
IBM is responsible for runner capacity planning, scaling, and health

Integration with Upstream PyTorch

Runners are currently integrated with pytorch/pytorch using the existing PyTorch GitHub App, consistent with the s390 setup
CI jobs are triggered via standard GitHub Actions workflows
Power CI is opt‑in and label‑driven (ciflow/ppc64le)
As PyTorch CI infrastructure evolves (e.g. GHARTS when available), the Power CI integration may be aligned with those mechanisms without changing ownership or execution semantics

Runner Roles, Triggers, and Jobs

Phase 1: Tester – Fork Validation ✅ (Completed)

Role: Tester
Triggers: Manual execution on forks only

Jobs:

Build PyTorch (core)
Execute tests as defined by the Power CI workflow

Purpose:
Validate CI workflows, runner configuration, and correctness without impacting upstream PyTorch CI.

Phase 2: Tester – On‑Demand PR Testing (Upstream, Non‑Blocking)

Role: Tester

Triggers:

Manual, opt‑in execution on pull requests via the ciflow/ppc64le label
Not enabled by default on all PRs

Jobs:

Power CI workflow defined in .github/workflows/ppc64le.yml
Build PyTorch
Execute tests as defined by the workflow

Infrastructure:

Jobs run exclusively on IBM‑managed, vendor‑provided runner pools

Characteristics:

Non‑blocking
Failures are reported for visibility only
Intended to validate upstream CI integration and detect regressions

Current Implementation:

Initial upstream enablement implemented in:   https://github.com/pytorch/pytorch/pull/173519

Purpose:

Validate Power CI behavior on real upstream PRs
Collect reliability and signal‑quality data before introducing scheduled jobs

Phase 3: Publisher – Nightly Builds

Role: Publisher

Triggers: Scheduled nightly workflows

Jobs:

Build and publish Power nightly artifacts
Update nightly build workflows to include Power
Update common reusable CI workflows in alignment with PyTorch CI design guidelines and patterns

Infrastructure:

All jobs run on IBM‑operated, vendor‑managed runner pools

Benefits:

Enables broader downstream and user testing on Power systems
Provides early signal for architecture‑specific issues
Aligns Power CI with existing PyTorch CI workflow design

Prerequisite:

Demonstrated stability and reliability from Phase 2 on‑demand PR testing

Non‑Goals (Initial Phases)

Making Power CI a required merge gate for pull requests
Including Power in official release blocking criteria
Achieving full test‑matrix parity with tier‑1 architectures

These may be revisited after sustained stability is demonstrated.

Risks and Mitigations

Risk	Mitigation
CI flakiness	Start with non‑blocking, opt‑in execution
Increased CI noise	Label‑based, on‑demand triggering
Unclear ownership	Explicit IBM‑only ownership and maintenance

Ownership and Maintenance

The Power CI and build infrastructure is fully owned and maintained by IBM.

IBM responsibilities include:

Runner provisioning and capacity management
CI job reliability, debugging, and failure triage
Workflow and configuration maintenance
Build and test infrastructure upkeep

There is no expectation for the PyTorch community to operate, debug, or maintain the Power CI infrastructure.

Open Questions

Appropriate test scope for future nightly Power builds
Long‑term criteria for expanding Power CI coverage
Conditions for any future release‑tier consideration

These topics are intentionally deferred until upstream signal is available.

Next Steps

Gather feedback on this RFC issue from CI Infra and stakeholders
Iterate on Phase 2 based on upstream PR signal
Introduce nightly build workflows for Power when Phase 2 stabilizes
Reassess scope and coverage based on observed results

cc @malfet @seemethere @pytorch/pytorch-dev-infra

extent analysis

TL;DR

To integrate Power CI testing and build infrastructure into PyTorch, start by implementing Phase 2, which involves on-demand PR testing using the ciflow/ppc64le label, ensuring non-blocking and opt-in execution to validate upstream CI integration and detect regressions.

Guidance

Implement Phase 2 of the Power CI integration plan, focusing on on-demand PR testing to validate CI workflows and detect regressions without impacting upstream PyTorch CI.
Ensure that all Power CI jobs run exclusively on IBM-managed, vendor-provided runner pools to maintain ownership and operational responsibility.
Monitor the reliability and signal quality of Power CI during Phase 2 to inform the decision to proceed with Phase 3, which involves scheduled nightly builds.
Review and iterate on the Power CI workflow defined in .github/workflows/ppc64le.yml to ensure it aligns with PyTorch CI design guidelines and patterns.

Example

No specific code snippet is provided as the issue focuses on the overall strategy and plan for integrating Power CI into PyTorch rather than specific code changes.

Notes

The integration plan is designed to be incremental, starting with non-blocking, opt-in execution to minimize risk and operational overhead. The plan's success depends on the demonstration of stability and reliability in each phase before proceeding to the next.

Recommendation

Apply the workaround by starting with Phase 2 of the Power CI integration plan, which allows for controlled testing and validation of the CI workflows without immediately introducing nightly builds or making Power CI a required merge gate for pull requests. This approach enables the collection of reliability and signal-quality data before expanding the scope of Power CI.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

pytorch - ✅(Solved) Fix RFC: Power (ppc64le) CI and Build Infrastructure Plan [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Risks and Mitigations

PR fix notes

PR #173519: Add on-demand ppc64le wheel build support

Description (problem / solution / changelog)

Changed files

Summary

Motivation

Current Status

Proposal Overview

Guiding Principles

Technical Implementation: Runners and Infrastructure

Docker Images Overview

Ephemeral Runner Image

PyTorch Build and Test Image

Image Maintenance and Updates

Runner Provisioning and Lifecycle

Integration with Upstream PyTorch

Runner Roles, Triggers, and Jobs

Phase 1: Tester – Fork Validation ✅ (Completed)

Phase 2: Tester – On‑Demand PR Testing (Upstream, Non‑Blocking)

Phase 3: Publisher – Nightly Builds

Non‑Goals (Initial Phases)

Risks and Mitigations

Ownership and Maintenance

Open Questions

Next Steps

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING