pytorch - ✅(Solved) Fix torch.utils.cpp_extension.load should support CUDA device linking (-dlink) similar to CUDAExtension [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180762Fetched 2026-04-19 15:03:50
View on GitHub
Comments
0
Participants
1
Timeline
63
Reactions
0
Participants
Timeline (top)
mentioned ×28subscribed ×28labeled ×6cross-referenced ×1

Root Cause

Because CUDAExtension already supports this class of workflows, exposing similar functionality in load() seems like the most natural solution.

Fix Action

Fix / Workaround

Today, users who want the convenience of torch.utils.cpp_extension.load() cannot express the same device-linking workflow without patching cpp_extension.py locally or moving to a different build path.

  1. Patch torch/utils/cpp_extension.py locally.
    • This is fragile and not maintainable across PyTorch upgrades.

PR fix notes

PR #180764: cpp_extension: add CUDA dlink flags to JIT load APIs

Description (problem / solution / changelog)

Summary

This PR adds an opt-in CUDA device-link path for JIT cpp extension builds.

Today, torch.utils.cpp_extension.CUDAExtension supports RDC/device-link workflows, but torch.utils.cpp_extension.load() and load_inline() do not expose an equivalent capability. This change adds a JIT-side API for forwarding CUDA device-link flags through the ninja generation path.

Changes

  • add extra_cuda_dlink_cflags to torch.utils.cpp_extension.load()
  • add extra_cuda_dlink_cflags to torch.utils.cpp_extension.load_inline()
  • include the new argument in JIT extension versioning
  • plumb CUDA device-link flags through the JIT ninja build path
  • add a JIT test for a multi-CU CUDA extension that requires a device-link step

Notes

  • this is intentionally opt-in, so existing JIT extension behavior remains unchanged by default
  • the device-link arch flags follow extra_cuda_cflags so compile and dlink steps use consistent CUDA arch semantics
  • this is a lower-level flags-based API, analogous to the existing nvcc_dlink pathway in CUDAExtension

Why draft

I wanted feedback on whether the JIT-side API should stay as a low-level flags entry point (extra_cuda_dlink_cflags) or be reshaped into a higher-level dlink=True style API in a follow-up.

Related

Refs pytorch/pytorch#180762

Changed files

  • test/test_cpp_extensions_jit.py (modified, +21/-0)
  • torch/utils/cpp_extension.py (modified, +54/-2)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

torch.utils.cpp_extension.CUDAExtension already supports relocatable device code / device linking workflows (for example, via dlink=True, dlink_libraries, and nvcc_dlink), but torch.utils.cpp_extension.load() does not expose an equivalent capability.

This creates a feature gap between the AOT/setuptools path and the JIT path.

Motivation:

  • Some CUDA extensions require an extra device-link step (-dlink) before the final host link step.
  • Typical examples include:
    • CUDA dynamic parallelism
    • multi-CU device symbol references
    • linking against static CUDA libraries that contain RDC-compiled objects
  • These use cases are already recognized by PyTorch and documented for CUDAExtension, but they are not available through load() / JIT extensions.

Today, users who want the convenience of torch.utils.cpp_extension.load() cannot express the same device-linking workflow without patching cpp_extension.py locally or moving to a different build path.

I would like PyTorch to add support for CUDA device linking in torch.utils.cpp_extension.load() (and potentially load_inline() if appropriate), ideally with an API that is consistent with or analogous to CUDAExtension.

Related context:

  • #57543 requested CUDA dynamic parallelism / device linking support for custom PyTorch CUDA extensions
  • #78225 added relocatable device code linking support for CUDAExtension
  • #44279 shows user confusion around how to achieve this in cpp extensions, especially for JIT-like workflows

Expected outcome:

  • JIT cpp extensions should be able to trigger the extra CUDA device-link step when needed
  • The behavior should remain opt-in and preserve current defaults for existing users

Alternatives

Current alternatives are not ideal:

  1. Switch from torch.utils.cpp_extension.load() to the setuptools / CUDAExtension path.

    • This loses the convenience of JIT-style workflows.
    • It adds packaging/build-system overhead for users who only want runtime compilation.
  2. Patch torch/utils/cpp_extension.py locally.

    • This is fragile and not maintainable across PyTorch upgrades.
  3. Manually reimplement the build flow with custom ninja / cmake logic.

    • This defeats the purpose of using PyTorch's extension helpers.

Because CUDAExtension already supports this class of workflows, exposing similar functionality in load() seems like the most natural solution.

Additional context

This request is not asking for RDC/device-link support in general from scratch, because that already exists for CUDAExtension. Instead, this issue is specifically about feature parity for the JIT path:

  • torch.utils.cpp_extension.load()
  • potentially torch.utils.cpp_extension.load_inline() A possible implementation might expose:
  • a higher-level option such as dlink=True
  • or a lower-level way to pass nvcc device-link flags I am intentionally not prescribing the exact API here; the main request is to support the extra CUDA device-link step in the JIT extension flow.

cc @janeyx99 @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

TL;DR

To address the feature gap, PyTorch should add support for CUDA device linking in torch.utils.cpp_extension.load() with an API consistent with CUDAExtension.

Guidance

  • Investigate adding a dlink parameter to torch.utils.cpp_extension.load() to enable device linking, similar to CUDAExtension.
  • Consider exposing lower-level options to pass custom nvcc device-link flags for more fine-grained control.
  • Review the implementation of CUDAExtension to ensure consistency and feature parity with the proposed load() changes.
  • Evaluate the need to extend this support to torch.utils.cpp_extension.load_inline() based on user requirements and use cases.

Notes

The proposed solution builds upon existing support for relocatable device code and device linking in CUDAExtension, aiming to provide a consistent API for both AOT/setuptools and JIT paths.

Recommendation

Apply a workaround by patching cpp_extension.py locally or switching to the setuptools/CUDAExtension path until the feature is officially added to torch.utils.cpp_extension.load(). This recommendation is chosen because it allows users to achieve their goals, albeit with some inconvenience, while waiting for the official implementation of the requested feature.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix torch.utils.cpp_extension.load should support CUDA device linking (-dlink) similar to CUDAExtension [1 pull requests, 1 participants]