pytorch - ✅(Solved) Fix Compiling CppExtensions in parallel causes race condition [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177265Fetched 2026-04-08 00:42:24
View on GitHub
Comments
0
Participants
1
Timeline
20
Reactions
0
Author
Participants
Timeline (top)
labeled ×6mentioned ×6subscribed ×6cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #178082: Fix ninja workdir collisions in BuildExtension

Description (problem / solution / changelog)

Summary

  • isolate ninja manifests/workdirs for BuildExtension object compilation
  • derive a per-compile ninja subdirectory from the object paths instead of writing directly into the shared setuptools output_dir
  • keep object output paths unchanged so distutils/setuptools behavior stays intact

Problem

When multiple CppExtensions are compiled in parallel with ninja, setuptools can hand them the same temporary output_dir. The current implementation writes build.ninja directly into that shared directory and runs ninja from there, so concurrent extension builds can overwrite each other's manifests.

Testing

  • python3 -m py_compile torch/utils/cpp_extension.py
  • verified the helper yields stable workdirs for the same object set and distinct workdirs for different object sets

Fixes #177265

Changed files

  • torch/utils/cpp_extension.py (modified, +9/-2)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

DeepSpeed uses custom operators that can be compiled during installation.

It uses some logic in setup.py but ultimately just adds multiple instances of CppExtension to ext_modules which are created at: https://github.com/deepspeedai/DeepSpeed/blob/f88d0f8d4d10d7da913cfb601a0d83323a4fc25e/op_builder/builder.py#L516-L523

When you try compiling those in parallel by using pip install --config-setting='--build-option=build_ext' --config-setting='--build-option=-j8' ... setuptools will call the compiler.compile methods of all extensions in parallel

This method (through e.g. unix_wrap_ninja_compile) eventually ends up writing a ninja file directly into the passed output_dir: https://github.com/pytorch/pytorch/blob/4cce831a21940c74b4ed504532bc09b44c3e95bb/torch/utils/cpp_extension.py#L2332

As that directory is the same for all extensions (platform specific folder determined by setuptools, e.g. build/temp.linux-x86_64-cpython-312) multiple processes/threads will a) overwrite the Ninja file and b) during compilation step on each others foot as results from one compilation process overwrite those of another.

So to me it looks like PyTorch should create unique subdirectories for extensions.

Versions

Pretty independent of any version.

I was using Python 3.12, PyTorch 2.3, 2.6 and 2.9 with setuptools 70.0.0

cc @malfet @seemethere @janeyx99

extent analysis

Fix Plan

To resolve the issue of concurrent compilation overwriting ninja files, we need to ensure that each extension is compiled in a unique directory.

Here are the steps to achieve this:

  • Modify the torch/utils/cpp_extension.py file to create a unique subdirectory for each extension.
  • Update the unix_wrap_ninja_compile function to use the new unique directory.

Example Code

import os
import tempfile

# ...

def unix_wrap_ninja_compile(objects, src, obj_ext, include_dirs, cflags, ldflags, extra_flags, output_dir):
    # Create a unique subdirectory for this extension
    ext_dir = tempfile.mkdtemp(dir=output_dir)
    # ...
    # Use the new directory to write the ninja file
    with open(os.path.join(ext_dir, 'build.ninja'), 'w') as f:
        # ...
    # ...

Alternatively, you can modify the CppExtension class to pass a unique directory to the compiler.compile method:

class CppExtension:
    # ...
    def build(self, build_dir):
        # Create a unique subdirectory for this extension
        ext_dir = os.path.join(build_dir, self.name)
        os.makedirs(ext_dir, exist_ok=True)
        # ...
        self.compiler.compile([self], output_dir=ext_dir)
        # ...

Verification

To verify that the fix worked, you can try compiling the extensions in parallel using the pip install command with the --build-option=-j8 flag. If the fix is successful, the compilation should complete without errors.

Extra Tips

  • Make sure to clean up the temporary directories created during compilation to avoid leaving behind unnecessary files.
  • Consider submitting a pull request to the PyTorch repository to include this fix in the main codebase.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING