pytorch - ✅(Solved) Fix torch.utils.cpp_extension.load should support CUDA device linking (-dlink) similar to CUDAExtension [1 pull requests, 1 participants]

pytorch2026-04-19 06:47:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180762•Fetched 2026-04-19 15:03:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Thinkin999

Participants

Thinkin999

Timeline (top)

mentioned ×28subscribed ×28labeled ×6cross-referenced ×1

Root Cause

Because CUDAExtension already supports this class of workflows, exposing similar functionality in load() seems like the most natural solution.

Fix Action

Fix / Workaround

Today, users who want the convenience of torch.utils.cpp_extension.load() cannot express the same device-linking workflow without patching cpp_extension.py locally or moving to a different build path.

Patch torch/utils/cpp_extension.py locally.
- This is fragile and not maintainable across PyTorch upgrades.

PR fix notes

PR #180764: cpp_extension: add CUDA dlink flags to JIT load APIs

Repository: pytorch/pytorch
Author: Thinkin999
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/180764

Description (problem / solution / changelog)

Summary

This PR adds an opt-in CUDA device-link path for JIT cpp extension builds.

Today, torch.utils.cpp_extension.CUDAExtension supports RDC/device-link workflows, but torch.utils.cpp_extension.load() and load_inline() do not expose an equivalent capability. This change adds a JIT-side API for forwarding CUDA device-link flags through the ninja generation path.

Changes

add extra_cuda_dlink_cflags to torch.utils.cpp_extension.load()
add extra_cuda_dlink_cflags to torch.utils.cpp_extension.load_inline()
include the new argument in JIT extension versioning
plumb CUDA device-link flags through the JIT ninja build path
add a JIT test for a multi-CU CUDA extension that requires a device-link step

Notes

this is intentionally opt-in, so existing JIT extension behavior remains unchanged by default
the device-link arch flags follow extra_cuda_cflags so compile and dlink steps use consistent CUDA arch semantics
this is a lower-level flags-based API, analogous to the existing nvcc_dlink pathway in CUDAExtension

Why draft

I wanted feedback on whether the JIT-side API should stay as a low-level flags entry point (extra_cuda_dlink_cflags) or be reshaped into a higher-level dlink=True style API in a follow-up.

Refs pytorch/pytorch#180762

Changed files

test/test_cpp_extensions_jit.py (modified, +21/-0)
torch/utils/cpp_extension.py (modified, +54/-2)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

torch.utils.cpp_extension.CUDAExtension already supports relocatable device code / device linking workflows (for example, via dlink=True, dlink_libraries, and nvcc_dlink), but torch.utils.cpp_extension.load() does not expose an equivalent capability.

This creates a feature gap between the AOT/setuptools path and the JIT path.

Motivation:

Some CUDA extensions require an extra device-link step (-dlink) before the final host link step.
Typical examples include:
- CUDA dynamic parallelism
- multi-CU device symbol references
- linking against static CUDA libraries that contain RDC-compiled objects
These use cases are already recognized by PyTorch and documented for CUDAExtension, but they are not available through load() / JIT extensions.

I would like PyTorch to add support for CUDA device linking in torch.utils.cpp_extension.load() (and potentially load_inline() if appropriate), ideally with an API that is consistent with or analogous to CUDAExtension.

Related context:

#57543 requested CUDA dynamic parallelism / device linking support for custom PyTorch CUDA extensions
#78225 added relocatable device code linking support for CUDAExtension
#44279 shows user confusion around how to achieve this in cpp extensions, especially for JIT-like workflows

Expected outcome:

JIT cpp extensions should be able to trigger the extra CUDA device-link step when needed
The behavior should remain opt-in and preserve current defaults for existing users

Alternatives

Current alternatives are not ideal:

Switch from torch.utils.cpp_extension.load() to the setuptools / CUDAExtension path.
- This loses the convenience of JIT-style workflows.
- It adds packaging/build-system overhead for users who only want runtime compilation.
Patch torch/utils/cpp_extension.py locally.
- This is fragile and not maintainable across PyTorch upgrades.
Manually reimplement the build flow with custom ninja / cmake logic.
- This defeats the purpose of using PyTorch's extension helpers.

Because CUDAExtension already supports this class of workflows, exposing similar functionality in load() seems like the most natural solution.

Additional context

This request is not asking for RDC/device-link support in general from scratch, because that already exists for CUDAExtension. Instead, this issue is specifically about feature parity for the JIT path:

torch.utils.cpp_extension.load()
potentially torch.utils.cpp_extension.load_inline() A possible implementation might expose:
a higher-level option such as dlink=True
or a lower-level way to pass nvcc device-link flags I am intentionally not prescribing the exact API here; the main request is to support the extra CUDA device-link step in the JIT extension flow.

cc @janeyx99 @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

TL;DR

To address the feature gap, PyTorch should add support for CUDA device linking in torch.utils.cpp_extension.load() with an API consistent with CUDAExtension.

Guidance

Investigate adding a dlink parameter to torch.utils.cpp_extension.load() to enable device linking, similar to CUDAExtension.
Consider exposing lower-level options to pass custom nvcc device-link flags for more fine-grained control.
Review the implementation of CUDAExtension to ensure consistency and feature parity with the proposed load() changes.
Evaluate the need to extend this support to torch.utils.cpp_extension.load_inline() based on user requirements and use cases.

Notes

The proposed solution builds upon existing support for relocatable device code and device linking in CUDAExtension, aiming to provide a consistent API for both AOT/setuptools and JIT paths.

Recommendation

Apply a workaround by patching cpp_extension.py locally or switching to the setuptools/CUDAExtension path until the feature is officially added to torch.utils.cpp_extension.load(). This recommendation is chosen because it allows users to achieve their goals, albeit with some inconvenience, while waiting for the official implementation of the requested feature.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix torch.utils.cpp_extension.load should support CUDA device linking (-dlink) similar to CUDAExtension [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #180764: cpp_extension: add CUDA dlink flags to JIT load APIs

Description (problem / solution / changelog)

Summary

Changes

Notes

Why draft

Related

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix torch.utils.cpp_extension.load should support CUDA device linking (-dlink) similar to CUDAExtension [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #180764: cpp_extension: add CUDA dlink flags to JIT load APIs

Description (problem / solution / changelog)

Summary

Changes

Notes

Why draft

Related

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING