pytorch - ✅(Solved) Fix Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181075Fetched 2026-04-23 07:22:55
View on GitHub
Comments
3
Participants
3
Timeline
26
Reactions
1
Author
Timeline (top)
subscribed ×15mentioned ×4commented ×3labeled ×3

Fix Action

Mitigation

  1. Dial down traffic to OSDC to mitigate load for load issues
  2. Opt-out pytorch bot to go back 100% to EC2 runners for critical issues
  3. Retry failed jobs to backfill signals

PR fix notes

PR #180636: [dynamo] restore Python dispatch TLS across graph breaks

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #180636

Fix a flaky functorch+dynamo failure where Python fallback can be entered without a valid TLS snapshot, or with Python dispatch TLS leaked past the compiled frame. Use MaybeSetTLSOnEntryGuard in Python fallback, and restore the saved dispatch TLS around the compiled call so later eager ops like assertEqual/isclose/module setup do not trip the same failure after a graph break.

Add a regression test covering the graph-break path that previously left Python=True and PythonTLSSnapshot=False in local dispatch TLS.

This appears related to the flaky test disables in #180299, #180300, #180320, #180321, #180322, #180336, #180337, and likely the same failure family as #180436, #180555, #180590, #180591, and #180592. Though I could not repro any of them locally.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @azahed98

Changed files

  • aten/src/ATen/core/PythonFallbackKernel.cpp (modified, +1/-1)
  • test/dynamo/test_error_messages.py (modified, +35/-4)
  • test/dynamo/test_python_dispatcher.py (modified, +29/-0)
  • torch/_dynamo/eval_frame.py (modified, +46/-39)
RAW_BUFFERClick to expand / collapse

Current Status

Ongoing

What's happening

After the initial trial in https://github.com/pytorch/pytorch/issues/179855, we'll start routing the traffics from pull & trunk workflows in PyTorch main to OSDC (ARC) runners starting today (Apr 21st). Here's what you need to know:

Pull and trunk jobs in main will transparently go to OSDC (ARC) runners instead of EC2, we will dial it up to 50% by end of week and 100% by next Wed. Jobs running on pull request remains unchanged. If something breaks, you can escalate to this issue and cc @seemethere @malfet @pytorch/pytorch-dev-infra @huydhn

Incident timeline (all times pacific)

1 week until Wed 29th

User impact

No impact

Mitigation

  1. Dial down traffic to OSDC to mitigate load for load issues
  2. Opt-out pytorch bot to go back 100% to EC2 runners for critical issues
  3. Retry failed jobs to backfill signals

extent analysis

TL;DR

Dial down traffic to OSDC or opt-out PyTorch bot to mitigate potential load issues during the transition to ARC runners.

Guidance

  • Monitor the incident timeline and job performance to identify potential issues with the new ARC runners.
  • If issues arise, consider dialing down traffic to OSDC to mitigate load and prevent job failures.
  • For critical issues, opt-out the PyTorch bot to switch back to 100% EC2 runners temporarily.
  • Retry failed jobs to backfill signals and ensure continuity of workflow.

Notes

The provided information does not specify the exact technical cause of potential issues, so these suggestions focus on mitigation strategies mentioned in the issue.

Recommendation

Apply workaround: Dial down traffic to OSDC or opt-out PyTorch bot to mitigate potential load issues, as this allows for a controlled transition and minimizes disruption to critical workflows.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING