pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10 / test (distributed) [1 pull requests, 4 comments, 4 participants]

pytorch2026-03-12 19:39:17

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177301•Fetched 2026-04-08 00:42:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×45mentioned ×25labeled ×6commented ×4

Fix Action

Fixed

Fixed by PR: bump kineto submodule to 0035505 (https://github.com/pytorch/pytorch/pull/177101)

PR fix notes

PR #177101: bump kineto submodule to 0035505

Repository: pytorch/pytorch
Author: scotts
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/177101

Description (problem / solution / changelog)

Updates Kineto to hash 00355051f09eef00ba32c366326e73e8057421da from March 10, 2026. See: https://github.com/pytorch/kineto/tree/00355051f09eef00ba32c366326e73e8057421da

Changed files

third_party/kineto (modified, +1/-1)

RAW_BUFFERClick to expand / collapse

a lot of timeouts are being observed on the trunk distributed jobs for ROCm: https://hud.pytorch.org/hud/pytorch/pytorch/6775069391cb18f988ad9f5b0676b398071b1fb8/1?per_page=50&name_filter=trunk.*rocm.*distributed&useRegexFilter=true&mergeEphemeralLF=true

Marking it as unstable until we get it back under control

cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry

extent analysis

Fix Plan

The fix involves optimizing the distributed job configuration to reduce timeouts on ROCm.

Steps

Increase the timeout value in the torch.distributed configuration
Implement a retry mechanism for failed jobs
Optimize the job execution to reduce overall execution time

Example Code

import torch.distributed as dist

# Increase timeout value
dist.init_process_group(
    backend='nccl',
    init_method='env://',
    timeout=300000  # 5 minutes
)

# Implement retry mechanism
def run_job():
    try:
        # Job execution code
        pass
    except Exception as e:
        # Retry job execution
        run_job()

# Optimize job execution
def optimize_job_execution():
    # Use torch.cuda.amp for mixed precision training
    scaler = torch.cuda.amp.GradScaler()
    # Use torch.nn.DataParallel for parallelization
    model = torch.nn.DataParallel(model)

Verification

Verify the fix by checking the job execution logs for reduced timeouts and successful job completions.

Extra Tips

Monitor job execution metrics to identify bottlenecks
Adjust the timeout value and retry mechanism as needed
Consider using a more robust distributed training framework like torch.distributed.launch

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10 / test (distributed) [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #177101: bump kineto submodule to 0035505

Description (problem / solution / changelog)

Changed files

extent analysis

Fix Plan

Steps

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix UNSTABLE trunk / linux-jammy-rocm-py3.10 / test (distributed) [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #177101: bump kineto submodule to 0035505

Description (problem / solution / changelog)

Changed files

extent analysis

Fix Plan

Steps

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING