pytorch - 💡(How to fix) Fix `torch.compile` + `torch.distributed` communication can cause random segmentation fault at process exit [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178859Fetched 2026-04-08 01:57:33
View on GitHub
Comments
3
Participants
3
Timeline
217
Reactions
0
Author
Timeline (top)
mentioned ×103subscribed ×103labeled ×7commented ×3

I found a likely PyTorch bug where using torch.compile(create_block_mask) together with distributed communication (e.g. dist.broadcast) can cause a random segmentation fault when the program exits.
The failure is intermittent but happens with relatively high probability across repeated runs.

Error Message

SIGSEGV(11), PID: 3163834, Thread 3163834: frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8f (0x7fa99c6d5a9f in /opt/conda/envs/torch-base/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x12990 (0x7fa9e0cf1990 in /usr/lib64/libpthread.so.0) frame #2: waitpid + 0x52 (0x7fa9e0cf1312 in /usr/lib64/libpthread.so.0) frame #3: /opt/conda/envs/torch-base/bin/python3() [0x6455e7] frame #4: _PyEval_EvalFrameDefault + 0x15ed (0x521f7d in /opt/conda/envs/torch-base/bin/python3) frame #5: /opt/conda/envs/torch-base/bin/python3() [0x617f90] frame #6: Py_FinalizeEx + 0x77 (0x5fed57 in /opt/conda/envs/torch-base/bin/python3) frame #7: Py_Exit + 0x8 (0x619728 in /opt/conda/envs/torch-base/bin/python3) frame #8: /opt/conda/envs/torch-base/bin/python3() [0x616eeb] frame #9: /opt/conda/envs/torch-base/bin/python3() [0x616b41] frame #10: PyRun_SimpleStringFlags + 0x5c (0x60fc7c in /opt/conda/envs/torch-base/bin/python3) frame #11: Py_RunMain + 0x4e1 (0x60d5f1 in /opt/conda/envs/torch-base/bin/python3) frame #12: Py_BytesMain + 0x39 (0x5c4809 in /opt/conda/envs/torch-base/bin/python3) frame #13: __libc_start_main + 0xe5 (0x7fa9e01d77e5 in /usr/lib64/libc.so.6) frame #14: /opt/conda/envs/torch-base/bin/python3() [0x5c4635]

SIGSEGV(11), PID: 3163834, Thread 3164280: frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8f (0x7fa99c6d5a9f in /opt/conda/envs/torch-base/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x12990 (0x7fa9e0cf1990 in /usr/lib64/libpthread.so.0) frame #2: pthread_cond_wait + 0x161 (0x7fa9e0ced371 in /usr/lib64/libpthread.so.0) frame #3: <unknown function> + 0x3506fb (0x7fa8ee2f66fb in /opt/conda/envs/torch-base/lib/python3.12/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so) frame #4: <unknown function> + 0x81ca (0x7fa9e0ce71ca in /usr/lib64/libpthread.so.0) frame #5: clone + 0x43 (0x7fa9e01d68d3 in /usr/lib64/libc.so.6)

Root Cause

I found a likely PyTorch bug where using torch.compile(create_block_mask) together with distributed communication (e.g. dist.broadcast) can cause a random segmentation fault when the program exits.
The failure is intermittent but happens with relatively high probability across repeated runs.

Code Example

# torchrun --nproc_per_node=8 --master_port=12345 reproduce.py
import os
import torch
from torch import distributed as dist
from torch.nn.attention.flex_attention import create_block_mask
torch.compile(create_block_mask)
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
torch.distributed.init_process_group("nccl")
dist.broadcast(torch.tensor([1, 2, 3]).cuda(), src=0)
dist.destroy_process_group()

---

SIGSEGV(11), PID: 3163834, Thread 3163834:
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8f (0x7fa99c6d5a9f in /opt/conda/envs/torch-base/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12990 (0x7fa9e0cf1990 in /usr/lib64/libpthread.so.0)
frame #2: waitpid + 0x52 (0x7fa9e0cf1312 in /usr/lib64/libpthread.so.0)
frame #3: /opt/conda/envs/torch-base/bin/python3() [0x6455e7]
frame #4: _PyEval_EvalFrameDefault + 0x15ed (0x521f7d in /opt/conda/envs/torch-base/bin/python3)
frame #5: /opt/conda/envs/torch-base/bin/python3() [0x617f90]
frame #6: Py_FinalizeEx + 0x77 (0x5fed57 in /opt/conda/envs/torch-base/bin/python3)
frame #7: Py_Exit + 0x8 (0x619728 in /opt/conda/envs/torch-base/bin/python3)
frame #8: /opt/conda/envs/torch-base/bin/python3() [0x616eeb]
frame #9: /opt/conda/envs/torch-base/bin/python3() [0x616b41]
frame #10: PyRun_SimpleStringFlags + 0x5c (0x60fc7c in /opt/conda/envs/torch-base/bin/python3)
frame #11: Py_RunMain + 0x4e1 (0x60d5f1 in /opt/conda/envs/torch-base/bin/python3)
frame #12: Py_BytesMain + 0x39 (0x5c4809 in /opt/conda/envs/torch-base/bin/python3)
frame #13: __libc_start_main + 0xe5 (0x7fa9e01d77e5 in /usr/lib64/libc.so.6)
frame #14: /opt/conda/envs/torch-base/bin/python3() [0x5c4635]

SIGSEGV(11), PID: 3163834, Thread 3164280:
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8f (0x7fa99c6d5a9f in /opt/conda/envs/torch-base/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12990 (0x7fa9e0cf1990 in /usr/lib64/libpthread.so.0)
frame #2: pthread_cond_wait + 0x161 (0x7fa9e0ced371 in /usr/lib64/libpthread.so.0)
frame #3: <unknown function> + 0x3506fb (0x7fa8ee2f66fb in /opt/conda/envs/torch-base/lib/python3.12/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so)
frame #4: <unknown function> + 0x81ca (0x7fa9e0ce71ca in /usr/lib64/libpthread.so.0)
frame #5: clone + 0x43 (0x7fa9e01d68d3 in /usr/lib64/libc.so.6)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

I found a likely PyTorch bug where using torch.compile(create_block_mask) together with distributed communication (e.g. dist.broadcast) can cause a random segmentation fault when the program exits.
The failure is intermittent but happens with relatively high probability across repeated runs.

Minimal Repro

# torchrun --nproc_per_node=8 --master_port=12345 reproduce.py
import os
import torch
from torch import distributed as dist
from torch.nn.attention.flex_attention import create_block_mask
torch.compile(create_block_mask)
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
torch.distributed.init_process_group("nccl")
dist.broadcast(torch.tensor([1, 2, 3]).cuda(), src=0)
dist.destroy_process_group()
SIGSEGV(11), PID: 3163834, Thread 3163834:
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8f (0x7fa99c6d5a9f in /opt/conda/envs/torch-base/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12990 (0x7fa9e0cf1990 in /usr/lib64/libpthread.so.0)
frame #2: waitpid + 0x52 (0x7fa9e0cf1312 in /usr/lib64/libpthread.so.0)
frame #3: /opt/conda/envs/torch-base/bin/python3() [0x6455e7]
frame #4: _PyEval_EvalFrameDefault + 0x15ed (0x521f7d in /opt/conda/envs/torch-base/bin/python3)
frame #5: /opt/conda/envs/torch-base/bin/python3() [0x617f90]
frame #6: Py_FinalizeEx + 0x77 (0x5fed57 in /opt/conda/envs/torch-base/bin/python3)
frame #7: Py_Exit + 0x8 (0x619728 in /opt/conda/envs/torch-base/bin/python3)
frame #8: /opt/conda/envs/torch-base/bin/python3() [0x616eeb]
frame #9: /opt/conda/envs/torch-base/bin/python3() [0x616b41]
frame #10: PyRun_SimpleStringFlags + 0x5c (0x60fc7c in /opt/conda/envs/torch-base/bin/python3)
frame #11: Py_RunMain + 0x4e1 (0x60d5f1 in /opt/conda/envs/torch-base/bin/python3)
frame #12: Py_BytesMain + 0x39 (0x5c4809 in /opt/conda/envs/torch-base/bin/python3)
frame #13: __libc_start_main + 0xe5 (0x7fa9e01d77e5 in /usr/lib64/libc.so.6)
frame #14: /opt/conda/envs/torch-base/bin/python3() [0x5c4635]

SIGSEGV(11), PID: 3163834, Thread 3164280:
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x8f (0x7fa99c6d5a9f in /opt/conda/envs/torch-base/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x12990 (0x7fa9e0cf1990 in /usr/lib64/libpthread.so.0)
frame #2: pthread_cond_wait + 0x161 (0x7fa9e0ced371 in /usr/lib64/libpthread.so.0)
frame #3: <unknown function> + 0x3506fb (0x7fa8ee2f66fb in /opt/conda/envs/torch-base/lib/python3.12/site-packages/numpy/core/../../numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so)
frame #4: <unknown function> + 0x81ca (0x7fa9e0ce71ca in /usr/lib64/libpthread.so.0)
frame #5: clone + 0x43 (0x7fa9e01d68d3 in /usr/lib64/libc.so.6)

Versions

PyTorch version: 2.7.1 Is debug build: False CUDA used to build PyTorch: 12.9 ROCM used to build PyTorch: N/A

GCC version: (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) Clang version: 19.1.7 ( 19.1.7-2.module+el8.10.0+750+2c34988e) CMake version: version 3.28.3 Libc version: glibc-2.28

Python version: 3.12.11 | packaged by Anaconda, Inc. | (main, Jun 5 2025, 13:09:17) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.241-1-tlinux4-0017.7-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 12.9.86 CUDA_MODULE_LOADING set to: LAZY

Nvidia driver version: 535.247.01 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.9.10.2 /usr/lib64/libcudnn_adv.so.9.10.2 /usr/lib64/libcudnn_cnn.so.9.10.2 /usr/lib64/libcudnn_engines_precompiled.so.9.10.2 /usr/lib64/libcudnn_engines_runtime_compiled.so.9.10.2 /usr/lib64/libcudnn_graph.so.9.10.2 /usr/lib64/libcudnn_heuristic.so.9.10.2 /usr/lib64/libcudnn_ops.so.9.10.2 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cudnn-frontend==1.16.0 [pip3] onnx==1.17.0 [pip3] onnx-ir==0.1.9 [pip3] onnxscript==0.3.1 [pip3] torch==2.7.1 [pip3] triton==3.3.1 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cudnn-frontend 1.16.0 pypi_0 pypi [conda] torch 2.7.1 pypi_0 pypi [conda] torchtitan 0.1.0 pypi_0 pypi [conda] triton 3.3.1 pypi_0 pypi

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @xmfan @chauhang @penguinwu @Chillee @drisspg @yanboliang @BoyuanFeng @liangel-02 @howardzhang-cv

extent analysis

TL;DR

The most likely fix for the intermittent segmentation fault when using torch.compile(create_block_mask) with distributed communication is to avoid using torch.compile with functions that are not designed to be compiled, or to update PyTorch to a version where this issue is fixed, if available.

Guidance

  • Verify that the segmentation fault is reproducible with the provided minimal repro code to ensure it's not an environment-specific issue.
  • Check the PyTorch documentation for any known issues or limitations with using torch.compile with create_block_mask and distributed communication.
  • Consider removing the torch.compile call for the create_block_mask function to see if the issue persists, as this could be a workaround if the compilation is not necessary for the specific use case.
  • Review the stacktrace to identify any other potential issues or libraries that might be contributing to the segmentation fault, such as compatibility problems between different library versions.

Example

No specific code example is provided as the issue seems to be related to the interaction between PyTorch's compilation and distributed communication features, and any changes would require a deeper understanding of the specific use case and requirements.

Notes

The provided information suggests a potential issue with the interaction between torch.compile and distributed communication in PyTorch. However, without more context or details about the specific requirements and constraints of the project, it's challenging to provide a definitive solution. The suggestions provided are based on the information given and aim to help troubleshoot and potentially mitigate the issue.

Recommendation

Apply workaround: Remove the torch.compile call for the create_block_mask function to see if the issue persists, as this could be a temporary solution until a more permanent fix is available or the root cause is fully understood.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `torch.compile` + `torch.distributed` communication can cause random segmentation fault at process exit [3 comments, 3 participants]