pytorch - 💡(How to fix) Fix Inductor compile-worker subprocess+fork pool initializes CUDA in parent before fork, breaking with upstream Triton 3.7.0+ [1 pull requests]

pytorch2026-05-21 02:51:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

torch._inductor.exc.InductorError: SubprocException: An exception occurred in a subprocess: ... File ".../triton/runtime/driver.py", line 20, in _create_driver raise RuntimeError(f"{len(active_drivers)} active drivers ({active_drivers}). There should only be one.") RuntimeError: 0 active drivers ([]). There should only be one.

Root Cause

parent (user script)
  └─ subprocess.Popen(...) → manager        # SubprocPool, default TORCHINDUCTOR_WORKER_START=subprocess
       torch/_inductor/compile_worker/__main__.py:
         line 30:  import triton            # preload in parent
         line 68:  pre_fork_setup()
                       → caching_device_properties()
                       → Worker.get_device_properties()
                       → INITIALIZES CUDA DRIVER STATE in manager     ← the anti-pattern
       then SubprocPool fork()s N workers
       (--kind=fork is the hard-coded default SubprocKind.FORK in
        torch/_inductor/compile_worker/subproc_pool.py:135)
       └─ forked worker
            inherits broken-after-fork CUDA state
            first line of inductor-generated kernel module is:
                triton_helpers.set_driver_to_gpu()
            → DriverConfig.active → DriverConfig.default → _create_driver()
              → for backend in backends.values(): backend.driver.is_active()
                → NVIDIA backend: _cuda_driver_is_active()
                   ctypes.CDLL("libcuda.so.1")        OK
                   cuInit(0)                           → rc=3  (NOT_INITIALIZED)
                   return False
              → active_drivers == []
            → RuntimeError("0 active drivers ([])")

Verified by patching _cuda_driver_is_active to log:

[pid=PARENT  argv=[repro.py]]                                   cuInit(0) rc=0  RESULT=True   # parent OK
[pid=MANAGER argv=[.../compile_worker/__main__.py, --kind=fork]] PATCH installed              # patch loaded in manager
[pid=WORKER  ppid=MANAGER ...]                                   cuInit(0) rc=3  RESULT=False # forked worker fails

Fix Action

Fix / Workaround

Verified by patching _cuda_driver_is_active to log:

[pid=PARENT  argv=[repro.py]]                                   cuInit(0) rc=0  RESULT=True   # parent OK
[pid=MANAGER argv=[.../compile_worker/__main__.py, --kind=fork]] PATCH installed              # patch loaded in manager
[pid=WORKER  ppid=MANAGER ...]                                   cuInit(0) rc=3  RESULT=False # forked worker fails

Code Example

"""repro_inductor_triton_fork_cuda.py"""
import os

# Force inductor to use its compile-worker pool (>1).
# On most machines TORCHINDUCTOR_COMPILE_THREADS is already > 1 by default.
os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "4"

import torch
import triton


def _a(x): return (x * 2 + 1).relu().sum()
def _b(x): return (x.sin() + x.cos()).pow(2).mean()
def _c(x): return torch.softmax(x @ x.T, dim=-1).sum()
def _d(x): return (x.tanh() * x.exp().clamp(max=10.0)).log1p().mean()


def main():
    print(f"torch  = {torch.__version__}")
    print(f"triton = {triton.__version__}")
    x = torch.randn(2048, 2048, device="cuda")
    for fn in (_a, _b, _c, _d):
        torch.compile(fn, dynamic=False, fullgraph=True)(x)
    torch.cuda.synchronize()
    print("ALL OK")


if __name__ == "__main__":
    main()

---

pip install --pre torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/nightly/cu130
pip uninstall -y triton pytorch-triton
pip install --index-url \
    https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ \
    "triton==3.7.0+git40e899b0"

rm -rf ~/.triton/cache /tmp/torchinductor_$USER
python repro_inductor_triton_fork_cuda.py

---

torch._inductor.exc.InductorError: SubprocException: An exception occurred in a subprocess:
...
  File ".../triton/runtime/driver.py", line 20, in _create_driver
    raise RuntimeError(f"{len(active_drivers)} active drivers ({active_drivers}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.

---

parent (user script)
  └─ subprocess.Popen(...) → manager        # SubprocPool, default TORCHINDUCTOR_WORKER_START=subprocess
       torch/_inductor/compile_worker/__main__.py:
         line 30:  import triton            # preload in parent
         line 68:  pre_fork_setup()
                       → caching_device_properties()
                       → Worker.get_device_properties()
                       → INITIALIZES CUDA DRIVER STATE in manager     ← the anti-pattern
       then SubprocPool fork()s N workers
       (--kind=fork is the hard-coded default SubprocKind.FORK in
        torch/_inductor/compile_worker/subproc_pool.py:135)
       └─ forked worker
            inherits broken-after-fork CUDA state
            first line of inductor-generated kernel module is:
                triton_helpers.set_driver_to_gpu()
            → DriverConfig.active → DriverConfig.default → _create_driver()
              → for backend in backends.values(): backend.driver.is_active()
                → NVIDIA backend: _cuda_driver_is_active()
                   ctypes.CDLL("libcuda.so.1")        OK
                   cuInit(0)                           → rc=3  (NOT_INITIALIZED)
                   return False
              → active_drivers == []
            → RuntimeError("0 active drivers ([])")

---

[pid=PARENT  argv=[repro.py]]                                   cuInit(0) rc=0  RESULT=True   # parent OK
[pid=MANAGER argv=[.../compile_worker/__main__.py, --kind=fork]] PATCH installed              # patch loaded in manager
[pid=WORKER  ppid=MANAGER ...]                                   cuInit(0) rc=3  RESULT=False # forked worker fails

---

python  3.14
torch   2.13.0.dev20260519+cu130       (PyTorch nightly cu130)
triton  3.7.0+git40e899b0              (upstream NVIDIA Triton nightly, from
                                        https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/)
CUDA    13.0
GPU     NVIDIA A10G (reproduces in CI and locally)
OS      Linux x86_64

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Inductor's default compile-worker pool initializes CUDA in the manager process (via pre_fork_setup()) and then fork()s workers. CUDA contexts are process-local, so the forked workers inherit invalid state. Upstream Triton nightly 3.7.0+'s new _cuda_driver_is_active() probe calls cuInit(0) in the worker, gets CUDA_ERROR_NOT_INITIALIZED, and Triton's driver discovery raises RuntimeError: 0 active drivers ([]). Reproduces with default PyTorch settings as soon as a user installs the upstream triton wheel instead of pytorch-triton.

To Reproduce

Self-contained reproducer (no PyTorch-side env tweaks — just the upstream Triton wheel and the default inductor worker pool):

"""repro_inductor_triton_fork_cuda.py"""
import os

# Force inductor to use its compile-worker pool (>1).
# On most machines TORCHINDUCTOR_COMPILE_THREADS is already > 1 by default.
os.environ["TORCHINDUCTOR_COMPILE_THREADS"] = "4"

import torch
import triton


def _a(x): return (x * 2 + 1).relu().sum()
def _b(x): return (x.sin() + x.cos()).pow(2).mean()
def _c(x): return torch.softmax(x @ x.T, dim=-1).sum()
def _d(x): return (x.tanh() * x.exp().clamp(max=10.0)).log1p().mean()


def main():
    print(f"torch  = {torch.__version__}")
    print(f"triton = {triton.__version__}")
    x = torch.randn(2048, 2048, device="cuda")
    for fn in (_a, _b, _c, _d):
        torch.compile(fn, dynamic=False, fullgraph=True)(x)
    torch.cuda.synchronize()
    print("ALL OK")


if __name__ == "__main__":
    main()

Setup

pip install --pre torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/nightly/cu130
pip uninstall -y triton pytorch-triton
pip install --index-url \
    https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ \
    "triton==3.7.0+git40e899b0"

rm -rf ~/.triton/cache /tmp/torchinductor_$USER
python repro_inductor_triton_fork_cuda.py

Necessary conditions (all required)

Upstream NVIDIA Triton wheel (3.7.0+git40e899b0 confirmed; any version where _cuda_driver_is_active() calls cuInit is suspect)
TORCHINDUCTOR_COMPILE_THREADS > 1 (default on most machines)
Cold ~/.triton/cache (with warm cache the generated kernel is loaded directly without invoking set_driver_to_gpu in the worker)
Cold /tmp/torchinductor_$USER (same reason)
≥ 3 distinct torch.compile units (race; 4+ is 100%)

Expected behavior

ALL OK.

Actual behavior

torch._inductor.exc.InductorError: SubprocException: An exception occurred in a subprocess:
...
  File ".../triton/runtime/driver.py", line 20, in _create_driver
    raise RuntimeError(f"{len(active_drivers)} active drivers ({active_drivers}). There should only be one.")
RuntimeError: 0 active drivers ([]). There should only be one.

Root cause

parent (user script)
  └─ subprocess.Popen(...) → manager        # SubprocPool, default TORCHINDUCTOR_WORKER_START=subprocess
       torch/_inductor/compile_worker/__main__.py:
         line 30:  import triton            # preload in parent
         line 68:  pre_fork_setup()
                       → caching_device_properties()
                       → Worker.get_device_properties()
                       → INITIALIZES CUDA DRIVER STATE in manager     ← the anti-pattern
       then SubprocPool fork()s N workers
       (--kind=fork is the hard-coded default SubprocKind.FORK in
        torch/_inductor/compile_worker/subproc_pool.py:135)
       └─ forked worker
            inherits broken-after-fork CUDA state
            first line of inductor-generated kernel module is:
                triton_helpers.set_driver_to_gpu()
            → DriverConfig.active → DriverConfig.default → _create_driver()
              → for backend in backends.values(): backend.driver.is_active()
                → NVIDIA backend: _cuda_driver_is_active()
                   ctypes.CDLL("libcuda.so.1")        OK
                   cuInit(0)                           → rc=3  (NOT_INITIALIZED)
                   return False
              → active_drivers == []
            → RuntimeError("0 active drivers ([])")

Verified by patching _cuda_driver_is_active to log:

[pid=PARENT  argv=[repro.py]]                                   cuInit(0) rc=0  RESULT=True   # parent OK
[pid=MANAGER argv=[.../compile_worker/__main__.py, --kind=fork]] PATCH installed              # patch loaded in manager
[pid=WORKER  ppid=MANAGER ...]                                   cuInit(0) rc=3  RESULT=False # forked worker fails

Versions

python  3.14
torch   2.13.0.dev20260519+cu130       (PyTorch nightly cu130)
triton  3.7.0+git40e899b0              (upstream NVIDIA Triton nightly, from
                                        https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/)
CUDA    13.0
GPU     NVIDIA A10G (reproduces in CI and locally)
OS      Linux x86_64

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @bertmaher @int3 @davidberard98 @nmacchioni @embg @peterbell10 @iupaikov-amd

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

ALL OK.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Inductor compile-worker subprocess+fork pool initializes CUDA in parent before fork, breaking with upstream Triton 3.7.0+ [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

To Reproduce

Setup

Necessary conditions (all required)

Expected behavior

Actual behavior

Root cause

Versions

Versions

FAQ

Expected behavior

Still need to ship something?

TRENDING