pytorch - 💡(How to fix) Fix [CUDA][B200][CI Test] Smoke Test Jobs Timing Out (3.5hrs+) [3 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178306Fetched 2026-04-08 01:26:11
View on GitHub
Comments
3
Participants
1
Timeline
31
Reactions
0
Author
Participants
Timeline (top)
subscribed ×16mentioned ×6commented ×3labeled ×3
RAW_BUFFERClick to expand / collapse

Current HUD is showing that B200 Smoke Test Periodic Jobs are having non-deterministic time-outs: https://hud.pytorch.org/hud/pytorch/pytorch/main/2?per_page=50&name_filter=B200

Job1: did not timeout https://github.com/pytorch/pytorch/actions/runs/23409308508/job/68094791890 Job2: timed out: https://github.com/pytorch/pytorch/actions/runs/23413673218/job/68106488258 Job3: timed out: https://github.com/pytorch/pytorch/actions/runs/23437133612/job/68181132945

cc @seemethere @malfet @pytorch/pytorch-dev-infra @eqy @drisspg @atalman @tinglvv

extent analysis

Fix Plan

The fix involves increasing the timeout limit for the B200 Smoke Test Periodic Jobs to prevent non-deterministic time-outs.

Steps to Fix

  • Increase the timeout limit in the GitHub Actions workflow file:
    • Locate the timeout parameter in the workflow file (e.g., .github/workflows/b200_smoke_test.yml)
    • Update the timeout value to a higher limit (e.g., timeout: 4h instead of timeout: 2h)
  • Example code snippet:
jobs:
  b200_smoke_test:
    runs-on: ubuntu-latest
    timeout: 4h  # Increased timeout limit
    steps:
      # ... rest of the workflow steps ...
  • Commit and push the changes to the repository to update the workflow configuration.

Verification

  • Verify that the updated workflow runs without timing out by checking the GitHub Actions job logs.
  • Monitor the HUD dashboard to ensure that the B200 Smoke Test Periodic Jobs are completing successfully without time-outs.

Extra Tips

  • Consider implementing a retry mechanism for the job to handle intermittent failures.
  • Review the job's resource utilization to ensure it's not exceeding the allocated resources, which could contribute to time-outs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING