pytorch - 💡(How to fix) Fix `cudaErrorInvalidDeviceFunction` error due to use of `__CUDA_ARCH__` in host code [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177014Fetched 2026-04-08 00:22:51
View on GitHub
Comments
1
Participants
1
Timeline
68
Reactions
0
Author
Participants
Timeline (top)
mentioned ×29subscribed ×29labeled ×6assigned ×2

Error Message

When PyTorch is not compiled for SM 8.0 or higher this will clearly cause an error as the host code (__CUDA_ARCH__ is undefined) calls the kernel without any further checks. However the error does exist, although it is still beeing looked into by NVIDIA. Independent of this the existence of the kernel depends on the value of __CUDA_ARCH__ , which may be invalid, but that code still calls it even though it is not compiled for SM < 8.0 and hence will fail at least on runtime with the TORCH_CHECK macro to show an error not being triggered.

Root Cause

Reason is a potentially unsupported use of __CUDA_ARCH__

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Calling e.g. torch._convert_weight_to_int4pack fails with cudaErrorInvalidDeviceFunction (only visible in subsequent checks like trying to print a CUDA tensor)

Reason is a potentially unsupported use of __CUDA_ARCH__

In this case the relevant code is https://github.com/pytorch/pytorch/blob/abd22e1775fc57acd41821d1b82ed3c216e68433/aten/src/ATen/native/cuda/int4mm.cu#L1325-L1346

When PyTorch is not compiled for SM 8.0 or higher this will clearly cause an error as the host code (__CUDA_ARCH__ is undefined) calls the kernel without any further checks.

More subtle though depending on the order of device architectures in TORCH_CUDA_ARCH_LIST it will fail or work. On A100 GPUs with TORCH_CUDA_ARCH_LIST='8.0;7.0' it fails, with TORCH_CUDA_ARCH_LIST='7.0;8.0' it works.

I have seen this issue in DeepSpeed too, which used a similar approach. You can find the discussion at https://github.com/deepspeedai/DeepSpeed/issues/7863

I have contacted NVIDIA support on that (NVIDIA Bug ID 5922576) and was referred to the documentation which has a passage on that, e.g.

If a __global__ function template is instantiated and launched from the host, then the function template must be instantiated with the same template arguments irrespective of whether __CUDA_ARCH__ is defined and regardless of the value of __CUDA_ARCH__.

In separate compilation mode, the presence or absence of a definition of a function or variable with external linkage shall not depend on whether __CUDA_ARCH__ is defined or on a particular value of __CUDA_ARCH__7.

However, it seems separate compilation is only used for some projects in third_party not for this particular function which is compiled with nvcc ... -gencode arch=compute_80,code=sm_80 -gencode arch=compute_70,code=sm_70 -x cu ...

However the error does exist, although it is still beeing looked into by NVIDIA.

Independent of this the existence of the kernel depends on the value of __CUDA_ARCH__ , which may be invalid, but that code still calls it even though it is not compiled for SM < 8.0 and hence will fail at least on runtime with the TORCH_CHECK macro to show an error not being triggered.

Versions

This code exists since PyTorch 2.2.0 and still in 2.10.0

Can be triggered e.g. on A100 with TORCH_CUDA_ARCH_LIST='8.0;7.0', but I assume also when replacing the "8.0" by any other GPU arch

cc @malfet @seemethere @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel

extent analysis

Fix Plan

1. Update TORCH_CUDA_ARCH_LIST to ensure SM 8.0 is listed first

export TORCH_CUDA_ARCH_LIST='8.0;7.0'

or update the environment variable in your code.

2. Add a check for __CUDA_ARCH__ before calling the kernel

// Replace the problematic code with this
if (__CUDA_ARCH__ >= 800) {
    // Call the kernel
    torch._convert_weight_to_int4pack();
} else {
    // Handle the case where __CUDA_ARCH__ is not supported
    TORCH_CHECK(false, "Unsupported CUDA architecture");
}

3. Consider updating the kernel to use a more robust way of checking the CUDA architecture

// Instead of relying on __CUDA_ARCH__, use the CUDA runtime API
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (prop.major >= 8) {
    // Call the kernel
    torch._convert_weight_to_int4pack();
} else {
    // Handle the case where the CUDA architecture is not supported
    TORCH_CHECK(false, "Unsupported CUDA architecture");
}

Verification

  1. Run your code with the updated TORCH_CUDA_ARCH_LIST and check if the error is resolved.
  2. Verify that the kernel is called correctly for supported CUDA architectures.
  3. Test the code with different CUDA architectures to ensure it works as expected.

Extra Tips

  • Make sure to update the TORCH_CUDA_ARCH_LIST environment variable consistently across all your code and environments.
  • Consider adding more robust error handling and logging to handle cases where the CUDA architecture is not supported.
  • If you're using a version of PyTorch older than 2.2.0, you may need to apply additional patches or updates to resolve this issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `cudaErrorInvalidDeviceFunction` error due to use of `__CUDA_ARCH__` in host code [1 comments, 1 participants]