pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10 / test (distributed) [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177301Fetched 2026-04-08 00:42:17
View on GitHub
Comments
4
Participants
4
Timeline
86
Reactions
0
Timeline (top)
subscribed ×45mentioned ×25labeled ×6commented ×4

Fix Action

Fixed

PR fix notes

PR #177101: bump kineto submodule to 0035505

Description (problem / solution / changelog)

Updates Kineto to hash 00355051f09eef00ba32c366326e73e8057421da from March 10, 2026. See: https://github.com/pytorch/kineto/tree/00355051f09eef00ba32c366326e73e8057421da

Changed files

  • third_party/kineto (modified, +1/-1)
RAW_BUFFERClick to expand / collapse

a lot of timeouts are being observed on the trunk distributed jobs for ROCm: https://hud.pytorch.org/hud/pytorch/pytorch/6775069391cb18f988ad9f5b0676b398071b1fb8/1?per_page=50&name_filter=trunk.*rocm.*distributed&useRegexFilter=true&mergeEphemeralLF=true

Marking it as unstable until we get it back under control

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry

extent analysis

Fix Plan

The fix involves optimizing the distributed job configuration to reduce timeouts on ROCm.

Steps

  • Increase the timeout value in the torch.distributed configuration
  • Implement a retry mechanism for failed jobs
  • Optimize the job execution to reduce overall execution time

Example Code

import torch.distributed as dist

# Increase timeout value
dist.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=300000  # 5 minutes
)

# Implement retry mechanism
def run_job():
    try:
        # Job execution code
        pass
    except Exception as e:
        # Retry job execution
        run_job()

# Optimize job execution
def optimize_job_execution():
    # Use torch.cuda.amp for mixed precision training
    scaler = torch.cuda.amp.GradScaler()
    # Use torch.nn.DataParallel for parallelization
    model = torch.nn.DataParallel(model)

Verification

Verify the fix by checking the job execution logs for reduced timeouts and successful job completions.

Extra Tips

  • Monitor job execution metrics to identify bottlenecks
  • Adjust the timeout value and retry mechanism as needed
  • Consider using a more robust distributed training framework like torch.distributed.launch

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING