pytorch - 💡(How to fix) Fix LF Fleet unable to register self-hosted runners with pytorch/pytorch and pytorch/test-infra

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Error looks like

Root Cause

Repository level self-hosted runners are disabled in the PyTorch GitHub Org affecting the pytorch/pytorch repo which is required by the LF Fleet to attache self-hosted runners to pytorch/pytorch.

Fix Action

Mitigation

Set LF Fleet runner determinator to 0% so that it does not queue new jobs.

RAW_BUFFERClick to expand / collapse

NOTE: Remember to label this issue with "ci: sev" If you want autorevert to be disabled, keep the ci: disable-autorevert label

<!-- Add the `merge blocking` label to this PR to prevent PRs from being merged while this issue is open -->

Current Status

Ongoing - LF Fleet is currently set to 0% of runners (effectively disabled) so new jobs shouldn't run on them.

Error looks like

HUD shows many jobs requesting lf.* runners queued but no runners are connecting to GitHub.

Incident timeline (all times pacific)

  • 7:30 am PDT LF Fleet stopped connecting to GitHub API
  • 2:30 pm PDT Notification of issue to ci-infra channel, triage started
  • 7:15 pm PDT Root cause discovered
  • 7:39 am PDT config was updated and confirmed LF Fleet is able to connect to GHA again

User impact

  • Jobs prior to our mitigation are hanging queued with no runners.

Root cause

Repository level self-hosted runners are disabled in the PyTorch GitHub Org affecting the pytorch/pytorch repo which is required by the LF Fleet to attache self-hosted runners to pytorch/pytorch.

Mitigation

Set LF Fleet runner determinator to 0% so that it does not queue new jobs.

Prevention/followups

How do we prevent issues like this in the future?

cc @malfet @pytorch/pytorch-dev-infra

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix LF Fleet unable to register self-hosted runners with pytorch/pytorch and pytorch/test-infra