pytorch - ✅(Solved) Fix nvshmem symm memory backend causes stuck sometimes for helion distributed kernel autotuning [1 pull requests, 4 comments, 3 participants]

pytorch2026-03-19 22:10:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177910•Fetched 2026-04-08 01:02:58

View on GitHub

Comments

Participants

Timeline

208

Reactions

Author

Participants

Timeline (top)

mentioned ×94subscribed ×94labeled ×11commented ×4

PR fix notes

PR #1744: add kernel-filter to select kernel for allreduce-rmsnorm

Repository: pytorch/helion
Author: shunting314
State: closed | merged: True
Link: https://github.com/pytorch/helion/pull/1744

Description (problem / solution / changelog)

Stacked PRs:

#1800
#1799
#1797
#1770
#1791
#1772
#1771
#1753
#1750
#1532
->#1744

add kernel-filter to select kernel for allreduce-rmsnorm

it's useful to pick a specifc kernel to run when debugging stuck job.

Changed files

examples/distributed/allreduce_bias_rmsnorm.py (modified, +7/-4)

Code Example

for itr in `seq 1 100`; do
    echo "===== itr $itr ========"
    HELION_DEBUG_DISTRIBUTED=1 HELION_AUTOTUNE_FOR_DISTRIBUTED_KERNEL=1 CUDA_LAUNCH_BLOCKING=1 KERNEL_FILTER=one_shot HELION_AUTOTUNE_EFFORT=quick HELION_FORCE_AUTOTUNE=1 torchrun --nproc-per-node=4 examples/distributed/allreduce_bias_rmsnorm.py
done

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

We are trying to build distributed kernel autotuner in helion. One thing we found is that the nvshmem can cause the job stuck sometimes.

Repro:

based on https://github.com/pytorch/helion/pull/1744
run

for itr in `seq 1 100`; do
    echo "===== itr $itr ========"
    HELION_DEBUG_DISTRIBUTED=1 HELION_AUTOTUNE_FOR_DISTRIBUTED_KERNEL=1 CUDA_LAUNCH_BLOCKING=1 KERNEL_FILTER=one_shot HELION_AUTOTUNE_EFFORT=quick HELION_FORCE_AUTOTUNE=1 torchrun --nproc-per-node=4 examples/distributed/allreduce_bias_rmsnorm.py
done

under the helion repo.

The issue does not trigger every time. It repro once for about 5 runs.

When a stuck job happens, I used cuda-gdb to attach the different ranks and it shows.

rank0 stucks on barrier_on_stream_kernel_threadgroup kernel

rank1,2,3, stuck on the helion kernel for symm_mem_sync

This post shows the relationship btw barrier_on_stream_kernel_threadgroup kernel and nvshmem.

If I change the kernel to use the default CUDA backend the issue is gone (succeed 30+ runs in a row).

Versions

trunk: c5065b607ce2b150cd3cbad84b1bf4944dd53c1b

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @oulgen @jansel @yf225 @Sibylau @choijon5

extent analysis

Fix Plan

The fix involves modifying the Helion kernel to properly handle nvshmem synchronization.

Update the barrier_on_stream_kernel_threadgroup kernel to ensure correct synchronization with nvshmem.
Modify the symm_mem_sync kernel to handle potential deadlocks caused by nvshmem.

Example Code Changes

# In barrier_on_stream_kernel_threadgroup kernel
import nvshmem

# ...

# Ensure correct synchronization with nvshmem
nvshmem.barrier_all()

# ...

# In symm_mem_sync kernel
import nvshmem

# ...

# Handle potential deadlocks caused by nvshmem
try:
    # Perform synchronization
    nvshmem.barrier_all()
except nvshmem.NvshmemError as e:
    # Handle error and retry if necessary
    print(f"Error: {e}")
    # Retry or exit as needed

Verification

To verify the fix, run the same test command multiple times:

for itr in `seq 1 100`; do
    echo "===== itr $itr ========"
    HELION_DEBUG_DISTRIBUTED=1 HELION_AUTOTUNE_FOR_DISTRIBUTED_KERNEL=1 CUDA_LAUNCH_BLOCKING=1 KERNEL_FILTER=one_shot HELION_AUTOTUNE_EFFORT=quick HELION_FORCE_AUTOTUNE=1 torchrun --nproc-per-node=4 examples/distributed/allreduce_bias_rmsnorm.py
done

If the issue is resolved, the test should complete successfully without any stuck jobs.

Extra Tips

Ensure that the nvshmem library is properly installed and configured.
Review the nvshmem documentation for best practices on synchronization and error handling.
Consider adding additional logging or debugging statements to help diagnose any future issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #permission error #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix nvshmem symm memory backend causes stuck sometimes for helion distributed kernel autotuning [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #1744: add kernel-filter to select kernel for allreduce-rmsnorm

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Example Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix nvshmem symm memory backend causes stuck sometimes for helion distributed kernel autotuning [1 pull requests, 4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #1744: add kernel-filter to select kernel for allreduce-rmsnorm

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

Fix Plan

Example Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING