pytorch - 💡(How to fix) Fix `cudaErrorInvalidDeviceFunction` error due to use of `__CUDA_ARCH__` in host code [1 comments, 1 participants]

pytorch2026-03-10 09:06:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177014•Fetched 2026-04-08 00:22:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×29subscribed ×29labeled ×6assigned ×2

Error Message

When PyTorch is not compiled for SM 8.0 or higher this will clearly cause an error as the host code (__CUDA_ARCH__ is undefined) calls the kernel without any further checks. However the error does exist, although it is still beeing looked into by NVIDIA. Independent of this the existence of the kernel depends on the value of __CUDA_ARCH__ , which may be invalid, but that code still calls it even though it is not compiled for SM < 8.0 and hence will fail at least on runtime with the TORCH_CHECK macro to show an error not being triggered.

Root Cause

Reason is a potentially unsupported use of __CUDA_ARCH__

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Calling e.g. torch._convert_weight_to_int4pack fails with cudaErrorInvalidDeviceFunction (only visible in subsequent checks like trying to print a CUDA tensor)

Reason is a potentially unsupported use of __CUDA_ARCH__

In this case the relevant code is https://github.com/pytorch/pytorch/blob/abd22e1775fc57acd41821d1b82ed3c216e68433/aten/src/ATen/native/cuda/int4mm.cu#L1325-L1346

When PyTorch is not compiled for SM 8.0 or higher this will clearly cause an error as the host code (__CUDA_ARCH__ is undefined) calls the kernel without any further checks.

More subtle though depending on the order of device architectures in TORCH_CUDA_ARCH_LIST it will fail or work. On A100 GPUs with TORCH_CUDA_ARCH_LIST='8.0;7.0' it fails, with TORCH_CUDA_ARCH_LIST='7.0;8.0' it works.

I have seen this issue in DeepSpeed too, which used a similar approach. You can find the discussion at https://github.com/deepspeedai/DeepSpeed/issues/7863

I have contacted NVIDIA support on that (NVIDIA Bug ID 5922576) and was referred to the documentation which has a passage on that, e.g.

If a __global__ function template is instantiated and launched from the host, then the function template must be instantiated with the same template arguments irrespective of whether __CUDA_ARCH__ is defined and regardless of the value of __CUDA_ARCH__.

In separate compilation mode, the presence or absence of a definition of a function or variable with external linkage shall not depend on whether __CUDA_ARCH__ is defined or on a particular value of __CUDA_ARCH__7.

However, it seems separate compilation is only used for some projects in third_party not for this particular function which is compiled with nvcc ... -gencode arch=compute_80,code=sm_80 -gencode arch=compute_70,code=sm_70 -x cu ...

However the error does exist, although it is still beeing looked into by NVIDIA.

Independent of this the existence of the kernel depends on the value of __CUDA_ARCH__ , which may be invalid, but that code still calls it even though it is not compiled for SM < 8.0 and hence will fail at least on runtime with the TORCH_CHECK macro to show an error not being triggered.

Versions

This code exists since PyTorch 2.2.0 and still in 2.10.0

Can be triggered e.g. on A100 with TORCH_CUDA_ARCH_LIST='8.0;7.0', but I assume also when replacing the "8.0" by any other GPU arch

cc @malfet @seemethere @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel

extent analysis

Fix Plan

1. Update `TORCH_CUDA_ARCH_LIST` to ensure SM 8.0 is listed first

export TORCH_CUDA_ARCH_LIST='8.0;7.0'

or update the environment variable in your code.

2. Add a check for `__CUDA_ARCH__` before calling the kernel

// Replace the problematic code with this
if (__CUDA_ARCH__ >= 800) {
    // Call the kernel
    torch._convert_weight_to_int4pack();
} else {
    // Handle the case where __CUDA_ARCH__ is not supported
    TORCH_CHECK(false, "Unsupported CUDA architecture");
}

3. Consider updating the kernel to use a more robust way of checking the CUDA architecture

// Instead of relying on __CUDA_ARCH__, use the CUDA runtime API
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (prop.major >= 8) {
    // Call the kernel
    torch._convert_weight_to_int4pack();
} else {
    // Handle the case where the CUDA architecture is not supported
    TORCH_CHECK(false, "Unsupported CUDA architecture");
}

Verification

Run your code with the updated TORCH_CUDA_ARCH_LIST and check if the error is resolved.
Verify that the kernel is called correctly for supported CUDA architectures.
Test the code with different CUDA architectures to ensure it works as expected.

Extra Tips

Make sure to update the TORCH_CUDA_ARCH_LIST environment variable consistently across all your code and environments.
Consider adding more robust error handling and logging to handle cases where the CUDA architecture is not supported.
If you're using a version of PyTorch older than 2.2.0, you may need to apply additional patches or updates to resolve this issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #authentication issue #prompt issue #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `cudaErrorInvalidDeviceFunction` error due to use of `__CUDA_ARCH__` in host code [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

🐛 Describe the bug

Versions

extent analysis

Fix Plan

1. Update `TORCH_CUDA_ARCH_LIST` to ensure SM 8.0 is listed first

2. Add a check for `__CUDA_ARCH__` before calling the kernel

3. Consider updating the kernel to use a more robust way of checking the CUDA architecture

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `cudaErrorInvalidDeviceFunction` error due to use of `__CUDA_ARCH__` in host code [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

🐛 Describe the bug

Versions

extent analysis

Fix Plan

1. Update TORCH_CUDA_ARCH_LIST to ensure SM 8.0 is listed first

2. Add a check for __CUDA_ARCH__ before calling the kernel

3. Consider updating the kernel to use a more robust way of checking the CUDA architecture

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Update `TORCH_CUDA_ARCH_LIST` to ensure SM 8.0 is listed first

2. Add a check for `__CUDA_ARCH__` before calling the kernel