pytorch - 💡(How to fix) Fix CUDA extension build failure on glibc 2.38+ and GCC 14+ due to math header noexcept mismatch - proposed clean fix

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

error: declaration of 'double cospi(double)' has a different exception specifier than previous declaration 'double cospi(double) noexcept'

Root Cause

3. Root Cause

Fix Action

Fix / Workaround

Status. Verified end-to-end on Fedora 44, CUDA 12.8 (nvcc), clang 22, GCC 15, Python 3.14 + PyTorch — the extension compiles and the .so loads. The compiler-detection and version-parsing logic is written to generalize across distros (Debian/Ubuntu, Arch, openSUSE, Alpine), but those paths are derived from the same mechanism rather than separately tested; treat them as "should work," not "verified."

Try the cheapest fix first. These two problems are independent — you may only need one:

  • If a CUDA-supported GCC is installable on your system (≤13 for CUDA 12.0–12.3, ≤14 for 12.4+), and your glibc is < 2.38, you likely need nothing in this guide beyond -ccbin g++-13 (or CUDAHOSTCXX=g++-13). Stop there.
  • You need the full workaround below only when you are forced onto GCC ≥ 14 (the EDG built-ins problem, §4 Layers 1–2) and/or you are on glibc ≥ 2.38 (the noexcept math conflict, §4 Layer 3). The two are orthogonal: a supported GCC does not avoid the glibc math conflict, and an old glibc does not avoid the GCC built-ins problem.

The upstream fix NVIDIA should apply is to add exception specifications to crt/math_functions.h to match modern glibc declarations. Until CUDA toolkits are updated, a build-time workaround is required.

The workaround exploits the C++ preprocessor's include-path ordering and macro scoping to prevent the conflicting declarations from being evaluated by the compiler simultaneously.

Code Example

error: declaration of 'double cospi(double)' has a different exception specifier
       than previous declaration 'double cospi(double) noexcept'

---

error: too many arguments for option -- 'Xcudafe'

---

/usr/include/x86_64-linux-gnu/bits/mathcalls.h: In function 'XXX':
   error: 'XXX' was not declared in this scope

---

error: unknown type name '__is_pointer'

---

import os
import sys
import shutil
import subprocess
import tempfile
from pathlib import Path
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CUDA_HOME

ROOT = Path(__file__).resolve().parent

def _parse_ver(s: str) -> "tuple[int, ...] | None":
    """Parse a version string like '12.3.0' into a tuple of integers.
    
    Returns a tuple of integers for comparison, or None if parsing fails.
    """
    try:
        parts = s.split(".")
        if not parts or not parts[0].isdigit():
            return None
        return tuple(int(p) for p in parts)
    except ValueError:
        return None

def _find_gcc_base() -> "Path | None":
    """Locate the GCC installation directory across all major Linux distributions.
    
    Tries common installation paths for Debian/Ubuntu, Fedora/RHEL, Arch, openSUSE,
    Alpine, and musl-based distros. Falls back to glob matching if none are found.
    
    Returns the Path to the GCC base directory, or None if no installation is found.
    """
    candidates = [
        "/usr/lib/gcc/x86_64-linux-gnu",          # Debian/Ubuntu
        "/usr/lib/gcc/x86_64-redhat-linux",       # Fedora/RHEL
        "/usr/lib/gcc/x86_64-pc-linux-gnu",       # Arch
        "/usr/lib/gcc/x86_64-suse-linux",         # openSUSE
        "/usr/lib/gcc/x86_64-alpine-linux-musl",  # Alpine
        "/usr/lib/gcc/x86_64-linux-musl",         # Void/musl
    ]
    
    for c in candidates:
        p = Path(c)
        if p.is_dir():
            return p
    
    # Fallback: glob for any x86_64 GCC directory
    import glob
    matches = sorted(glob.glob("/usr/lib/gcc/x86_64-*/"))
    return Path(matches[0]) if matches else None

def _get_compiler_search_paths(compiler_path: str, extra_args: "list[str] | None" = None) -> list[str]:
    """Query the host compiler to find its default system include directories.
    
    Runs the compiler preprocessor in verbose mode and parses its output to extract
    standard include paths. This ensures headers are sourced from the correct
    toolchain version without any hardcoding.
    
    IMPORTANT: pass the *same* toolchain-selection flags here that you pass to the
    host compiler at build time (e.g. --gcc-install-dir=...). Clang otherwise
    defaults to the newest GCC installed, which on a mixed system is exactly the
    version whose libstdc++ intrinsics broke the build. The reported include paths
    must match the GCC the host compiler is actually pinned to.
    
    Args:
        compiler_path: Path to the compiler binary (e.g., '/usr/bin/clang++')
        extra_args: Toolchain-selection flags to forward (e.g. the --gcc-install-dir
            pin), so the reported paths correspond to the pinned GCC.
    
    Returns:
        A list of absolute paths to system include directories, or an empty list if
        the compiler cannot be queried.
    """
    try:
        cmd = [compiler_path, *(extra_args or []), "-E", "-x", "c++", "-", "-v"]
        result = subprocess.run(
            cmd,
            input="",
            capture_output=True,
            text=True,
            check=True
        )
        
        paths = []
        in_search_list = False
        for line in result.stderr.splitlines():
            line = line.strip()
            if line == "#include <...> search starts here:":
                in_search_list = True
                continue
            if line == "End of search list.":
                break
            if in_search_list and os.path.isdir(line):
                paths.append(os.path.abspath(line))
        return paths
    except Exception:
        return []

def _find_compatible_host_compiler() -> tuple[str | None, str | None]:
    """Find a compatible host compiler and GCC toolchain pair.
    
    Prefers GCC 13 or older if available (these don't trigger EDG parser issues).
    Falls back to Clang with an explicitly chosen GCC15 toolchain to avoid
    modern GCC intrinsics that NVCC's EDG front-end cannot parse.
    
    Returns:
        A tuple (compiler_path, gcc_install_dir) where:
          - compiler_path is the path to the C++ compiler binary
          - gcc_install_dir is the path to a compatible GCC toolchain (or None if
            the compiler_path is already GCC15)
    """
    # 1. User override via environment variable
    env_cxx = os.environ.get("CUDAHOSTCXX")
    if env_cxx and shutil.which(env_cxx):
        return env_cxx, None

    # 2. Prefer older GCC versions (11 to 13) which work natively with NVCC
    for ver in ("13", "12", "11"):
        gcc_bin = shutil.which(f"g++-{ver}")
        if gcc_bin:
            return gcc_bin, None

    # 3. Fall back to Clang with a companion GCC15 toolchain
    clang_bin = shutil.which("clang++")
    if clang_bin:
        gcc_base = _find_gcc_base()
        if gcc_base:
            # List all installed GCC versions and find the highest ≤15
            installed_versions = sorted(
                [p.name for p in gcc_base.iterdir() if p.is_dir()],
                key=lambda v: _parse_ver(v) or (0,)
            )
            compatible_versions = [
                v for v in installed_versions
                if _parse_ver(v) and _parse_ver(v)[0] <= 15
            ]
            if compatible_versions:
                target_ver = compatible_versions[-1]
                return clang_bin, str(gcc_base / target_ver)

    # 4. Final fallback: use system default compiler
    default_cxx = shutil.which("g++")
    return default_cxx, None

def build_extensions():
    """Build CUDA extensions with automatic compatibility handling.
    
    Checks if CUDA_EXT_ENABLE environment variable is set to "1". If not,
    returns an empty extension list (CUDA support is optional).
    
    Creates a temporary compatibility shim directory, generates the bits/mathcalls.h
    wrapper, configures the NVCC compiler with compatibility flags, and returns
    a CUDAExtension configured to use these settings.
    """
    use_cuda = os.environ.get("CUDA_EXT_ENABLE", "0") == "1"
    if not (use_cuda and CUDA_HOME):
        return []

    host_cxx, gcc_install_dir = _find_compatible_host_compiler()

    # Create a temp directory in /tmp (guaranteed space-free) for generated files
    # Use PID to avoid collisions in multi-process builds
    compat_root = Path(tempfile.gettempdir()) / f"nvcc_compat_{os.getpid()}"
    compat_bits = compat_root / "bits"
    compat_bits.mkdir(parents=True, exist_ok=True)
    
    # Write the mathcalls.h interception shim
    # This renames conflicting math functions and suppresses __DECL_SIMD macros
    # to prevent redeclaration errors when glibc headers meet CUDA headers
    shim_content = (
        "/* CUDA/glibc noexcept compatibility shim — auto-generated */\n"
        "#pragma push_macro(\"cospi\")\n"
        "#pragma push_macro(\"sinpi\")\n"
        "#pragma push_macro(\"rsqrt\")\n"
        "#define cospi  __compat_cospi__\n"
        "#define sinpi  __compat_sinpi__\n"
        "#define rsqrt  __compat_rsqrt__\n"
        "\n"
        "#define __DECL_SIMD___compat_cospi__\n"
        "#define __DECL_SIMD___compat_cospi__f\n"
        "#define __DECL_SIMD___compat_cospi__l\n"
        "#define __DECL_SIMD___compat_cospi__f32\n"
        "#define __DECL_SIMD___compat_cospi__f64\n"
        "#define __DECL_SIMD___compat_cospi__f128\n"
        "#define __DECL_SIMD___compat_cospi__f32x\n"
        "#define __DECL_SIMD___compat_cospi__f64x\n"
        "#define __DECL_SIMD___compat_sinpi__\n"
        "#define __DECL_SIMD___compat_sinpi__f\n"
        "#define __DECL_SIMD___compat_sinpi__l\n"
        "#define __DECL_SIMD___compat_sinpi__f32\n"
        "#define __DECL_SIMD___compat_sinpi__f64\n"
        "#define __DECL_SIMD___compat_sinpi__f128\n"
        "#define __DECL_SIMD___compat_sinpi__f32x\n"
        "#define __DECL_SIMD___compat_sinpi__f64x\n"
        "#define __DECL_SIMD___compat_rsqrt__\n"
        "#define __DECL_SIMD___compat_rsqrt__f\n"
        "#define __DECL_SIMD___compat_rsqrt__l\n"
        "#define __DECL_SIMD___compat_rsqrt__f32\n"
        "#define __DECL_SIMD___compat_rsqrt__f64\n"
        "#define __DECL_SIMD___compat_rsqrt__f128\n"
        "#define __DECL_SIMD___compat_rsqrt__f32x\n"
        "#define __DECL_SIMD___compat_rsqrt__f64x\n"
        "\n"
        "#include_next <bits/mathcalls.h>\n"
        "\n"
        "#undef __DECL_SIMD___compat_cospi__\n"
        "#undef __DECL_SIMD___compat_cospi__f\n"
        "#undef __DECL_SIMD___compat_cospi__l\n"
        "#undef __DECL_SIMD___compat_cospi__f32\n"
        "#undef __DECL_SIMD___compat_cospi__f64\n"
        "#undef __DECL_SIMD___compat_cospi__f128\n"
        "#undef __DECL_SIMD___compat_cospi__f32x\n"
        "#undef __DECL_SIMD___compat_cospi__f64x\n"
        "#undef __DECL_SIMD___compat_sinpi__\n"
        "#undef __DECL_SIMD___compat_sinpi__f\n"
        "#undef __DECL_SIMD___compat_sinpi__l\n"
        "#undef __DECL_SIMD___compat_sinpi__f32\n"
        "#undef __DECL_SIMD___compat_sinpi__f64\n"
        "#undef __DECL_SIMD___compat_sinpi__f128\n"
        "#undef __DECL_SIMD___compat_sinpi__f32x\n"
        "#undef __DECL_SIMD___compat_sinpi__f64x\n"
        "#undef __DECL_SIMD___compat_rsqrt__\n"
        "#undef __DECL_SIMD___compat_rsqrt__f\n"
        "#undef __DECL_SIMD___compat_rsqrt__l\n"
        "#undef __DECL_SIMD___compat_rsqrt__f32\n"
        "#undef __DECL_SIMD___compat_rsqrt__f64\n"
        "#undef __DECL_SIMD___compat_rsqrt__f128\n"
        "#undef __DECL_SIMD___compat_rsqrt__f32x\n"
        "#undef __DECL_SIMD___compat_rsqrt__f64x\n"
        "\n"
        "#pragma pop_macro(\"rsqrt\")\n"
        "#pragma pop_macro(\"sinpi\")\n"
        "#pragma pop_macro(\"cospi\")\n"
    )
    (compat_bits / "mathcalls.h").write_text(shim_content)

    nvcc_args = [
        "-O3",
        "--use_fast_math",
        "-std=c++17",
        "--expt-relaxed-constexpr",
        "--allow-unsupported-compiler",
        f"-I{compat_root}"
    ]

    # Specify target architectures
    # Adjust these based on your target GPUs (sm_80 for A100, sm_90 for H100, etc.)
    nvcc_args += [
        "-gencode", "arch=compute_80,code=sm_80",
        "-gencode", "arch=compute_89,code=sm_89",
        "-gencode", "arch=compute_90,code=sm_90",
    ]

    # If Clang is selected with a companion GCC toolchain, create a wrapper script
    if host_cxx and "clang" in host_cxx and gcc_install_dir:
        wrapper_path = compat_root / "clang_host_wrapper.sh"
        wrapper_script = (
            "#!/bin/sh\n"
            f'exec "{host_cxx}" --gcc-install-dir="{gcc_install_dir}" "$@"\n'
        )
        wrapper_path.write_text(wrapper_script)
        wrapper_path.chmod(0o755)
        nvcc_args += ["-ccbin", str(wrapper_path)]
        
        # Inject the companion GCC toolchain's C++ include paths into NVCC's own
        # frontend. These MUST be queried with the same --gcc-install-dir pin the
        # wrapper applies, otherwise clang reports its default (newest) GCC headers
        # and NVCC's frontend parses libstdc++ intrinsics from the wrong GCC version.
        gcc_dir_flag = [f"--gcc-install-dir={gcc_install_dir}"]
        for sys_path in _get_compiler_search_paths(host_cxx, gcc_dir_flag):
            if "c++" in sys_path:
                nvcc_args += ["-isystem", sys_path]
    elif host_cxx:
        nvcc_args += ["-ccbin", host_cxx]

    # Build the CUDA extension
    # <PLACEHOLDER>: Update the name, sources, and include_dirs to match your project
    return [
        CUDAExtension(
            name="<my_package>.cuda_ext._ops",  # e.g., "my_project.cuda_ext._ops"
            sources=[
                str(ROOT / "bindings.cpp"),      # C++ bindings to CUDA kernels
                str(ROOT / "kernel.cu"),         # CUDA kernel source
            ],
            extra_compile_args={
                "cxx": ["-O3", "-std=c++17"],
                "nvcc": nvcc_args,
            },
            include_dirs=[str(ROOT)],
        )
    ]

setup(
    name="<my_package_cuda_ext>",  # e.g., "my_project_cuda_ext"
    version="0.0.0",
    ext_modules=build_extensions(),
    cmdclass={"build_ext": BuildExtension.with_options(use_ninja=False)},
)

---

export CUDA_EXT_ENABLE=1
python setup.py build_ext --inplace

---

export CUDA_EXT_ENABLE=0
python setup.py build_ext --inplace
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Solving the CUDA 12.x + Modern glibc (2.38+) noexcept Redeclaration Failure

Status. Verified end-to-end on Fedora 44, CUDA 12.8 (nvcc), clang 22, GCC 15, Python 3.14 + PyTorch — the extension compiles and the .so loads. The compiler-detection and version-parsing logic is written to generalize across distros (Debian/Ubuntu, Arch, openSUSE, Alpine), but those paths are derived from the same mechanism rather than separately tested; treat them as "should work," not "verified."

Try the cheapest fix first. These two problems are independent — you may only need one:

  • If a CUDA-supported GCC is installable on your system (≤13 for CUDA 12.0–12.3, ≤14 for 12.4+), and your glibc is < 2.38, you likely need nothing in this guide beyond -ccbin g++-13 (or CUDAHOSTCXX=g++-13). Stop there.
  • You need the full workaround below only when you are forced onto GCC ≥ 14 (the EDG built-ins problem, §4 Layers 1–2) and/or you are on glibc ≥ 2.38 (the noexcept math conflict, §4 Layer 3). The two are orthogonal: a supported GCC does not avoid the glibc math conflict, and an old glibc does not avoid the GCC built-ins problem.

1. The Problem

CUDA 12.x combined with modern glibc (2.38+) and GCC 14+ introduces a three-layer incompatibility that prevents compilation of CUDA kernels:

Layer 1: EDG Parser Limitation
NVCC's EDG front-end (the C++ parser CUDA uses internally) does not recognize the compiler built-ins that GCC 14+ libstdc++ uses to implement standard traits — __is_pointer, __is_volatile, __array_rank, __is_invocable, __builtin_operator_new, __builtin_is_virtual_base_of, and others. Parsing <type_traits>/<functional> from a GCC 14+ installation then fails with a burst of identifier "__is_pointer" is undefined / type name is not allowed errors. Note this is independent of the host compiler version check: --allow-unsupported-compiler silences the check but cannot make EDG understand the built-ins.

Layer 2: glibc math Function Declarations
Starting with glibc 2.38, the file bits/mathcalls.h declares standard math functions like cospi, sinpi, and rsqrt with exception specifications (noexcept(true)), enabled when compiling in C++17 mode. This is correct modern C++ practice.

Layer 3: CUDA crt/math_functions.h Mismatch
CUDA's crt/math_functions.h declares the same functions without exception specifications, treating them as functions that may throw. When the preprocessor encounters both declarations, the C++ standard rules forbid redeclaration with differing exception specifications—it is a hard compilation error.

Layer 4: Macro Cascade
When glibc's preprocessing machinery renames the conflicting function symbols (via __DECL_SIMD macros), those renames propagate through CUDA's header stack, creating additional unresolved symbol conflicts and circular dependencies.

2. Symptoms

Users will encounter one or more of these errors:

error: declaration of 'double cospi(double)' has a different exception specifier
       than previous declaration 'double cospi(double) noexcept'
error: too many arguments for option -- 'Xcudafe'
/usr/include/x86_64-linux-gnu/bits/mathcalls.h: In function 'XXX':
   error: 'XXX' was not declared in this scope
error: unknown type name '__is_pointer'

These errors often appear even in minimal CUDA files and prevent any CUDA extension from building on systems with glibc 2.38+.

3. Root Cause

The upstream fix NVIDIA should apply is to add exception specifications to crt/math_functions.h to match modern glibc declarations. Until CUDA toolkits are updated, a build-time workaround is required.

The workaround exploits the C++ preprocessor's include-path ordering and macro scoping to prevent the conflicting declarations from being evaluated by the compiler simultaneously.

4. The Workaround: Three-Layer Solution

Which layers do you actually need? Match your failing errors to the layers:

  • identifier "__is_pointer" is undefined, __builtin_operator_new, type name is not allowed from type_traits/functionalGCC built-ins problem → Layers 1 & 2. Only triggered when the host compiler is GCC ≥ 14. If you can install GCC ≤ 13 (or ≤ 14 on CUDA 12.4+), prefer that and skip Layers 1–2 entirely.
  • exception specification is incompatible ... cospi/sinpi/rsqrt ... crt/math_functions.hglibc noexcept math conflict → Layer 3. Triggered by glibc ≥ 2.38 + CUDA 12.x regardless of compiler.

The setup.py in §5 applies all three automatically and is a no-op for layers your system doesn't trip, so it is safe to use even if you only need one.

Layer 1: Use clang++ as NVCC Host Compiler
NVCC always parses device translation units with its own EDG-based front-end; the host compiler is not replaced. What the host compiler does control is the emulation mode EDG runs in — EDG adopts the host compiler's predefined macros, built-ins, and default include paths. When the host is GCC ≥ 14, EDG emulates GCC but does not implement the newer built-ins (__is_pointer, __builtin_operator_new, etc.) that GCC's libstdc++ now uses, so parsing fails. When the host is clang, EDG emulates clang, and clang provides those same built-ins, so the headers parse. Selecting clang++ via -ccbin is therefore what eliminates the __is_pointer family of errors.

Layer 2: Bind clang++ to a GCC ≤15 Toolchain
clang still pulls libstdc++ headers from a GCC installation, and by default it selects the newest GCC present — which may be a version (e.g. GCC 16) that introduces still-newer built-ins such as __builtin_is_virtual_base_of that even clang's emulation does not cover. We pin clang to a compatible GCC toolchain (version 15 or below) via --gcc-install-dir, and we feed that same pinned toolchain's C++ include directories to NVCC's front-end with -isystem so EDG and the host pass read identical headers.

Layer 3: Intercept bits/mathcalls.h
Before glibc's bits/mathcalls.h is processed, we insert a compatibility shim that:

  • Renames conflicting function names (e.g., cospi__compat_cospi)
  • Suppresses the __DECL_SIMD macro cascade by undefining it for the renamed symbols
  • Includes the real glibc header via #include_next, which now operates on safe (renamed) symbols
  • Restores the original macro definitions
  • The renamed symbols never reach CUDA's header declarations, avoiding the conflict entirely

5. Implementation

The following setup.py applies all three layers dynamically, with no manual configuration required beyond setting an environment variable.

import os
import sys
import shutil
import subprocess
import tempfile
from pathlib import Path
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension, CUDA_HOME

ROOT = Path(__file__).resolve().parent

def _parse_ver(s: str) -> "tuple[int, ...] | None":
    """Parse a version string like '12.3.0' into a tuple of integers.
    
    Returns a tuple of integers for comparison, or None if parsing fails.
    """
    try:
        parts = s.split(".")
        if not parts or not parts[0].isdigit():
            return None
        return tuple(int(p) for p in parts)
    except ValueError:
        return None

def _find_gcc_base() -> "Path | None":
    """Locate the GCC installation directory across all major Linux distributions.
    
    Tries common installation paths for Debian/Ubuntu, Fedora/RHEL, Arch, openSUSE,
    Alpine, and musl-based distros. Falls back to glob matching if none are found.
    
    Returns the Path to the GCC base directory, or None if no installation is found.
    """
    candidates = [
        "/usr/lib/gcc/x86_64-linux-gnu",          # Debian/Ubuntu
        "/usr/lib/gcc/x86_64-redhat-linux",       # Fedora/RHEL
        "/usr/lib/gcc/x86_64-pc-linux-gnu",       # Arch
        "/usr/lib/gcc/x86_64-suse-linux",         # openSUSE
        "/usr/lib/gcc/x86_64-alpine-linux-musl",  # Alpine
        "/usr/lib/gcc/x86_64-linux-musl",         # Void/musl
    ]
    
    for c in candidates:
        p = Path(c)
        if p.is_dir():
            return p
    
    # Fallback: glob for any x86_64 GCC directory
    import glob
    matches = sorted(glob.glob("/usr/lib/gcc/x86_64-*/"))
    return Path(matches[0]) if matches else None

def _get_compiler_search_paths(compiler_path: str, extra_args: "list[str] | None" = None) -> list[str]:
    """Query the host compiler to find its default system include directories.
    
    Runs the compiler preprocessor in verbose mode and parses its output to extract
    standard include paths. This ensures headers are sourced from the correct
    toolchain version without any hardcoding.
    
    IMPORTANT: pass the *same* toolchain-selection flags here that you pass to the
    host compiler at build time (e.g. --gcc-install-dir=...). Clang otherwise
    defaults to the newest GCC installed, which on a mixed system is exactly the
    version whose libstdc++ intrinsics broke the build. The reported include paths
    must match the GCC the host compiler is actually pinned to.
    
    Args:
        compiler_path: Path to the compiler binary (e.g., '/usr/bin/clang++')
        extra_args: Toolchain-selection flags to forward (e.g. the --gcc-install-dir
            pin), so the reported paths correspond to the pinned GCC.
    
    Returns:
        A list of absolute paths to system include directories, or an empty list if
        the compiler cannot be queried.
    """
    try:
        cmd = [compiler_path, *(extra_args or []), "-E", "-x", "c++", "-", "-v"]
        result = subprocess.run(
            cmd,
            input="",
            capture_output=True,
            text=True,
            check=True
        )
        
        paths = []
        in_search_list = False
        for line in result.stderr.splitlines():
            line = line.strip()
            if line == "#include <...> search starts here:":
                in_search_list = True
                continue
            if line == "End of search list.":
                break
            if in_search_list and os.path.isdir(line):
                paths.append(os.path.abspath(line))
        return paths
    except Exception:
        return []

def _find_compatible_host_compiler() -> tuple[str | None, str | None]:
    """Find a compatible host compiler and GCC toolchain pair.
    
    Prefers GCC 13 or older if available (these don't trigger EDG parser issues).
    Falls back to Clang with an explicitly chosen GCC ≤15 toolchain to avoid
    modern GCC intrinsics that NVCC's EDG front-end cannot parse.
    
    Returns:
        A tuple (compiler_path, gcc_install_dir) where:
          - compiler_path is the path to the C++ compiler binary
          - gcc_install_dir is the path to a compatible GCC toolchain (or None if
            the compiler_path is already GCC ≤15)
    """
    # 1. User override via environment variable
    env_cxx = os.environ.get("CUDAHOSTCXX")
    if env_cxx and shutil.which(env_cxx):
        return env_cxx, None

    # 2. Prefer older GCC versions (11 to 13) which work natively with NVCC
    for ver in ("13", "12", "11"):
        gcc_bin = shutil.which(f"g++-{ver}")
        if gcc_bin:
            return gcc_bin, None

    # 3. Fall back to Clang with a companion GCC ≤15 toolchain
    clang_bin = shutil.which("clang++")
    if clang_bin:
        gcc_base = _find_gcc_base()
        if gcc_base:
            # List all installed GCC versions and find the highest ≤15
            installed_versions = sorted(
                [p.name for p in gcc_base.iterdir() if p.is_dir()],
                key=lambda v: _parse_ver(v) or (0,)
            )
            compatible_versions = [
                v for v in installed_versions
                if _parse_ver(v) and _parse_ver(v)[0] <= 15
            ]
            if compatible_versions:
                target_ver = compatible_versions[-1]
                return clang_bin, str(gcc_base / target_ver)

    # 4. Final fallback: use system default compiler
    default_cxx = shutil.which("g++")
    return default_cxx, None

def build_extensions():
    """Build CUDA extensions with automatic compatibility handling.
    
    Checks if CUDA_EXT_ENABLE environment variable is set to "1". If not,
    returns an empty extension list (CUDA support is optional).
    
    Creates a temporary compatibility shim directory, generates the bits/mathcalls.h
    wrapper, configures the NVCC compiler with compatibility flags, and returns
    a CUDAExtension configured to use these settings.
    """
    use_cuda = os.environ.get("CUDA_EXT_ENABLE", "0") == "1"
    if not (use_cuda and CUDA_HOME):
        return []

    host_cxx, gcc_install_dir = _find_compatible_host_compiler()

    # Create a temp directory in /tmp (guaranteed space-free) for generated files
    # Use PID to avoid collisions in multi-process builds
    compat_root = Path(tempfile.gettempdir()) / f"nvcc_compat_{os.getpid()}"
    compat_bits = compat_root / "bits"
    compat_bits.mkdir(parents=True, exist_ok=True)
    
    # Write the mathcalls.h interception shim
    # This renames conflicting math functions and suppresses __DECL_SIMD macros
    # to prevent redeclaration errors when glibc headers meet CUDA headers
    shim_content = (
        "/* CUDA/glibc noexcept compatibility shim — auto-generated */\n"
        "#pragma push_macro(\"cospi\")\n"
        "#pragma push_macro(\"sinpi\")\n"
        "#pragma push_macro(\"rsqrt\")\n"
        "#define cospi  __compat_cospi__\n"
        "#define sinpi  __compat_sinpi__\n"
        "#define rsqrt  __compat_rsqrt__\n"
        "\n"
        "#define __DECL_SIMD___compat_cospi__\n"
        "#define __DECL_SIMD___compat_cospi__f\n"
        "#define __DECL_SIMD___compat_cospi__l\n"
        "#define __DECL_SIMD___compat_cospi__f32\n"
        "#define __DECL_SIMD___compat_cospi__f64\n"
        "#define __DECL_SIMD___compat_cospi__f128\n"
        "#define __DECL_SIMD___compat_cospi__f32x\n"
        "#define __DECL_SIMD___compat_cospi__f64x\n"
        "#define __DECL_SIMD___compat_sinpi__\n"
        "#define __DECL_SIMD___compat_sinpi__f\n"
        "#define __DECL_SIMD___compat_sinpi__l\n"
        "#define __DECL_SIMD___compat_sinpi__f32\n"
        "#define __DECL_SIMD___compat_sinpi__f64\n"
        "#define __DECL_SIMD___compat_sinpi__f128\n"
        "#define __DECL_SIMD___compat_sinpi__f32x\n"
        "#define __DECL_SIMD___compat_sinpi__f64x\n"
        "#define __DECL_SIMD___compat_rsqrt__\n"
        "#define __DECL_SIMD___compat_rsqrt__f\n"
        "#define __DECL_SIMD___compat_rsqrt__l\n"
        "#define __DECL_SIMD___compat_rsqrt__f32\n"
        "#define __DECL_SIMD___compat_rsqrt__f64\n"
        "#define __DECL_SIMD___compat_rsqrt__f128\n"
        "#define __DECL_SIMD___compat_rsqrt__f32x\n"
        "#define __DECL_SIMD___compat_rsqrt__f64x\n"
        "\n"
        "#include_next <bits/mathcalls.h>\n"
        "\n"
        "#undef __DECL_SIMD___compat_cospi__\n"
        "#undef __DECL_SIMD___compat_cospi__f\n"
        "#undef __DECL_SIMD___compat_cospi__l\n"
        "#undef __DECL_SIMD___compat_cospi__f32\n"
        "#undef __DECL_SIMD___compat_cospi__f64\n"
        "#undef __DECL_SIMD___compat_cospi__f128\n"
        "#undef __DECL_SIMD___compat_cospi__f32x\n"
        "#undef __DECL_SIMD___compat_cospi__f64x\n"
        "#undef __DECL_SIMD___compat_sinpi__\n"
        "#undef __DECL_SIMD___compat_sinpi__f\n"
        "#undef __DECL_SIMD___compat_sinpi__l\n"
        "#undef __DECL_SIMD___compat_sinpi__f32\n"
        "#undef __DECL_SIMD___compat_sinpi__f64\n"
        "#undef __DECL_SIMD___compat_sinpi__f128\n"
        "#undef __DECL_SIMD___compat_sinpi__f32x\n"
        "#undef __DECL_SIMD___compat_sinpi__f64x\n"
        "#undef __DECL_SIMD___compat_rsqrt__\n"
        "#undef __DECL_SIMD___compat_rsqrt__f\n"
        "#undef __DECL_SIMD___compat_rsqrt__l\n"
        "#undef __DECL_SIMD___compat_rsqrt__f32\n"
        "#undef __DECL_SIMD___compat_rsqrt__f64\n"
        "#undef __DECL_SIMD___compat_rsqrt__f128\n"
        "#undef __DECL_SIMD___compat_rsqrt__f32x\n"
        "#undef __DECL_SIMD___compat_rsqrt__f64x\n"
        "\n"
        "#pragma pop_macro(\"rsqrt\")\n"
        "#pragma pop_macro(\"sinpi\")\n"
        "#pragma pop_macro(\"cospi\")\n"
    )
    (compat_bits / "mathcalls.h").write_text(shim_content)

    nvcc_args = [
        "-O3",
        "--use_fast_math",
        "-std=c++17",
        "--expt-relaxed-constexpr",
        "--allow-unsupported-compiler",
        f"-I{compat_root}"
    ]

    # Specify target architectures
    # Adjust these based on your target GPUs (sm_80 for A100, sm_90 for H100, etc.)
    nvcc_args += [
        "-gencode", "arch=compute_80,code=sm_80",
        "-gencode", "arch=compute_89,code=sm_89",
        "-gencode", "arch=compute_90,code=sm_90",
    ]

    # If Clang is selected with a companion GCC toolchain, create a wrapper script
    if host_cxx and "clang" in host_cxx and gcc_install_dir:
        wrapper_path = compat_root / "clang_host_wrapper.sh"
        wrapper_script = (
            "#!/bin/sh\n"
            f'exec "{host_cxx}" --gcc-install-dir="{gcc_install_dir}" "$@"\n'
        )
        wrapper_path.write_text(wrapper_script)
        wrapper_path.chmod(0o755)
        nvcc_args += ["-ccbin", str(wrapper_path)]
        
        # Inject the companion GCC toolchain's C++ include paths into NVCC's own
        # frontend. These MUST be queried with the same --gcc-install-dir pin the
        # wrapper applies, otherwise clang reports its default (newest) GCC headers
        # and NVCC's frontend parses libstdc++ intrinsics from the wrong GCC version.
        gcc_dir_flag = [f"--gcc-install-dir={gcc_install_dir}"]
        for sys_path in _get_compiler_search_paths(host_cxx, gcc_dir_flag):
            if "c++" in sys_path:
                nvcc_args += ["-isystem", sys_path]
    elif host_cxx:
        nvcc_args += ["-ccbin", host_cxx]

    # Build the CUDA extension
    # <PLACEHOLDER>: Update the name, sources, and include_dirs to match your project
    return [
        CUDAExtension(
            name="<my_package>.cuda_ext._ops",  # e.g., "my_project.cuda_ext._ops"
            sources=[
                str(ROOT / "bindings.cpp"),      # C++ bindings to CUDA kernels
                str(ROOT / "kernel.cu"),         # CUDA kernel source
            ],
            extra_compile_args={
                "cxx": ["-O3", "-std=c++17"],
                "nvcc": nvcc_args,
            },
            include_dirs=[str(ROOT)],
        )
    ]

setup(
    name="<my_package_cuda_ext>",  # e.g., "my_project_cuda_ext"
    version="0.0.0",
    ext_modules=build_extensions(),
    cmdclass={"build_ext": BuildExtension.with_options(use_ninja=False)},
)

6. Build Command

To enable CUDA support and build the extension:

export CUDA_EXT_ENABLE=1
python setup.py build_ext --inplace

To disable CUDA support (builds without CUDA code):

export CUDA_EXT_ENABLE=0
python setup.py build_ext --inplace

Or simply omit the environment variable entirely; it defaults to "0".

7. Why This Works (Technical Detail)

The #include_next Mechanism
The C++ preprocessor processes #include <bits/mathcalls.h> by searching the include path in order. When we inject our compatibility shim directory first (via -I<compat_root>), the preprocessor finds our bits/mathcalls.h wrapper instead of the system one.

Our wrapper:

  1. Renames conflicting symbols using macros (cospi__compat_cospi__)
  2. Stubs out the __DECL_SIMD_* macros to prevent automatic vectorization declarations that would otherwise propagate renamed symbols downstream
  3. Calls #include_next <bits/mathcalls.h>, which causes the preprocessor to resume searching after our wrapper and find the real glibc header
  4. The real header's declarations and macro expansions now operate on the renamed symbols (which glibc never heard of), so no redeclaration conflict occurs
  5. The pragma pop/push mechanics restore the original macro names so CUDA code can still use cospi and friends—they just reference the renamed versions

Clang + GCC Toolchain
NVCC's EDG front-end emulates whichever host compiler is passed via -ccbin. With clang as that host, EDG inherits clang's built-ins and accepts the GCC 14+ libstdc++ intrinsics (__is_pointer and friends) that a GCC-emulated EDG rejects. Pinning clang to a GCC ≤15 toolchain with --gcc-install-dir, and passing that toolchain's C++ headers to NVCC via -isystem, keeps still-newer GCC 16 intrinsics (e.g. __builtin_is_virtual_base_of) out of both the front-end parse and the host compile. The -isystem paths must be derived from clang with the same --gcc-install-dir pin applied — querying an unpinned clang reports its default (newest) GCC headers and silently reintroduces the very version you are trying to avoid.

Temporary Directory Isolation
Using /tmp/nvcc_compat_<pid> ensures the generated wrapper script and shim are placed in a filesystem location guaranteed to be writable and free of spaces, even on systems where the project root may contain special characters or be mounted in restricted ways (Docker, NFS, etc.).

Version Parsing
The _parse_ver() function handles arbitrary version strings like "12.3.0" or "15.0.1" by splitting on dots and converting each component to an integer. This allows robust comparison of GCC versions without relying on fragile string comparisons.


Appendix: Troubleshooting

Error: "clang: error: unsupported option '--gcc-install-dir'"
Your Clang version is too old (pre-11.0). Upgrade Clang or set CUDAHOSTCXX=g++-13 to force GCC.

Error: "no GCC installation found"
Ensure GCC is installed. On Debian/Ubuntu: sudo apt install g++-13. On Fedora: sudo dnf install gcc-c++. The script will auto-detect it.

Error: "too many arguments for option -- 'Xcudafe'"
You are likely running an older version of this script that includes the -Xcudafe flags. Update to the latest version above, which has removed those flags.

Error: "declaration ... has a different exception specifier"
The shim was not injected (the -I<compat_root> flag is missing). Ensure build_extensions() is being called. You can add diagnostic output: add print(f"Shim at: {compat_root}") before returning the extension list and verify the path appears in your build output.

Permission Denied on Wrapper Script
Ensure the temp directory has execute permissions. The wrapper_path.chmod(0o755) call should handle this, but on some systems chmod may fail if the filesystem is mounted noexec. Try setting TMPDIR to a writable directory: export TMPDIR=$HOME/.tmp; mkdir -p $TMPDIR; python setup.py build_ext --inplace.

Versions

Collecting environment information... PyTorch version: 2.10.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A

OS: Fedora Linux 44 (Workstation Edition) (x86_64) GCC version: (GCC) 16.1.1 20260515 (Red Hat 16.1.1-2) Clang version: 22.1.6 (Fedora 22.1.6-1.fc44) CMake version: version 4.3.0 Libc version: glibc-2.43

Python version: 3.14.4 (main, Apr 16 2026, 00:00:00) [GCC 16.0.1 20260321 (Red Hat 16.0.1-0)] (64-bit runtime) Python platform: Linux-7.0.9-205.fc44.x86_64-x86_64-with-glibc2.43 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5060 Ti Nvidia driver version: 595.71.05 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz CPU family: 6 Model: 165 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 Stepping: 3 CPU(s) scaling MHz: 93% CPU max MHz: 4300,0000 CPU min MHz: 800,0000 BogoMIPS: 5799,77 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 192 KiB (6 instances) L1i cache: 192 KiB (6 instances) L2 cache: 1,5 MiB (6 instances) L3 cache: 12 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Versions of relevant libraries: [pip3] numpy==2.4.5 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvtx==13.0.85 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] torch==2.10.0 [pip3] torchao==0.17.0 [pip3] torchvision==0.25.0+cu128 [pip3] triton==3.6.0 [conda] Could not collect

cc @malfet @janeyx99 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix CUDA extension build failure on glibc 2.38+ and GCC 14+ due to math header noexcept mismatch - proposed clean fix