pytorch - ✅(Solved) Fix nvshmem symm memory backend causes stuck sometimes for helion distributed kernel autotuning [1 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177910Fetched 2026-04-08 01:02:58
View on GitHub
Comments
4
Participants
3
Timeline
208
Reactions
0
Timeline (top)
mentioned ×94subscribed ×94labeled ×11commented ×4

PR fix notes

PR #1744: add kernel-filter to select kernel for allreduce-rmsnorm

Description (problem / solution / changelog)

Stacked PRs:

  • #1800
  • #1799
  • #1797
  • #1770
  • #1791
  • #1772
  • #1771
  • #1753
  • #1750
  • #1532
  • ->#1744

add kernel-filter to select kernel for allreduce-rmsnorm

it's useful to pick a specifc kernel to run when debugging stuck job.

Changed files

  • examples/distributed/allreduce_bias_rmsnorm.py (modified, +7/-4)

Code Example

for itr in `seq 1 100`; do
    echo "===== itr $itr ========"
    HELION_DEBUG_DISTRIBUTED=1 HELION_AUTOTUNE_FOR_DISTRIBUTED_KERNEL=1 CUDA_LAUNCH_BLOCKING=1 KERNEL_FILTER=one_shot HELION_AUTOTUNE_EFFORT=quick HELION_FORCE_AUTOTUNE=1 torchrun --nproc-per-node=4 examples/distributed/allreduce_bias_rmsnorm.py
done
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

We are trying to build distributed kernel autotuner in helion. One thing we found is that the nvshmem can cause the job stuck sometimes.

Repro:

  1. based on https://github.com/pytorch/helion/pull/1744
  2. run
for itr in `seq 1 100`; do
    echo "===== itr $itr ========"
    HELION_DEBUG_DISTRIBUTED=1 HELION_AUTOTUNE_FOR_DISTRIBUTED_KERNEL=1 CUDA_LAUNCH_BLOCKING=1 KERNEL_FILTER=one_shot HELION_AUTOTUNE_EFFORT=quick HELION_FORCE_AUTOTUNE=1 torchrun --nproc-per-node=4 examples/distributed/allreduce_bias_rmsnorm.py
done

under the helion repo.

The issue does not trigger every time. It repro once for about 5 runs.

When a stuck job happens, I used cuda-gdb to attach the different ranks and it shows.

  1. rank0 stucks on barrier_on_stream_kernel_threadgroup kernel
<img width="1374" height="185" alt="Image" src="https://github.com/user-attachments/assets/da4a2512-29de-45b8-b732-b9cd0c275e5a" />
  1. rank1,2,3, stuck on the helion kernel for symm_mem_sync
<img width="1234" height="141" alt="Image" src="https://github.com/user-attachments/assets/1834da4c-8a8e-4b86-9633-bf4862cd16c0" />

This post shows the relationship btw barrier_on_stream_kernel_threadgroup kernel and nvshmem.

If I change the kernel to use the default CUDA backend the issue is gone (succeed 30+ runs in a row).

Versions

trunk: c5065b607ce2b150cd3cbad84b1bf4944dd53c1b

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @oulgen @jansel @yf225 @Sibylau @choijon5

extent analysis

Fix Plan

The fix involves modifying the Helion kernel to properly handle nvshmem synchronization.

  • Update the barrier_on_stream_kernel_threadgroup kernel to ensure correct synchronization with nvshmem.
  • Modify the symm_mem_sync kernel to handle potential deadlocks caused by nvshmem.

Example Code Changes

# In barrier_on_stream_kernel_threadgroup kernel
import nvshmem

# ...

# Ensure correct synchronization with nvshmem
nvshmem.barrier_all()

# ...

# In symm_mem_sync kernel
import nvshmem

# ...

# Handle potential deadlocks caused by nvshmem
try:
    # Perform synchronization
    nvshmem.barrier_all()
except nvshmem.NvshmemError as e:
    # Handle error and retry if necessary
    print(f"Error: {e}")
    # Retry or exit as needed

Verification

To verify the fix, run the same test command multiple times:

for itr in `seq 1 100`; do
    echo "===== itr $itr ========"
    HELION_DEBUG_DISTRIBUTED=1 HELION_AUTOTUNE_FOR_DISTRIBUTED_KERNEL=1 CUDA_LAUNCH_BLOCKING=1 KERNEL_FILTER=one_shot HELION_AUTOTUNE_EFFORT=quick HELION_FORCE_AUTOTUNE=1 torchrun --nproc-per-node=4 examples/distributed/allreduce_bias_rmsnorm.py
done

If the issue is resolved, the test should complete successfully without any stuck jobs.

Extra Tips

  • Ensure that the nvshmem library is properly installed and configured.
  • Review the nvshmem documentation for best practices on synchronization and error handling.
  • Consider adding additional logging or debugging statements to help diagnose any future issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix nvshmem symm memory backend causes stuck sometimes for helion distributed kernel autotuning [1 pull requests, 4 comments, 3 participants]