pytorch - 💡(How to fix) Fix torch.any and torch.all with CPU input and CUDA out tensor raise CUDA INTERNAL ASSERT FAILED instead of a device mismatch error [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178733Fetched 2026-04-08 01:52:39
View on GitHub
Comments
1
Participants
2
Timeline
52
Reactions
0
Author
Assignees
Timeline (top)
mentioned ×22subscribed ×22labeled ×5assigned ×1

Error Message

PyTorch raises: RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch.

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 Stepping: 7 BogoMIPS: 6199.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat vnmi umip pku ospke avx512_vnni md_clear arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 128 MiB (32 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Vulnerability Vmscape: Not affected

Code Example

import torch

x = torch.randint(0, 2, (3, 4), dtype=torch.bool)  # CPU tensor
out = torch.empty((), dtype=torch.bool, device="cuda")

torch.any(x, out=out)
# torch.all(x, out=out) shows the same issue
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Describe the bug

torch.any and torch.all appear to hit an internal CUDA assert when the input tensor is on CPU and the out tensor is on CUDA.

This looks like an error-handling / device-validation bug. The input combination is invalid, but instead of raising a regular user-facing device mismatch error, PyTorch raises:

CUDA INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch.

Interestingly, the opposite direction (CUDA input with CPU out) raises a normal and much more reasonable error:

Expected out tensor to have device cuda:0, but got cpu instead

So the behavior seems asymmetric.

Minimal repro

import torch

x = torch.randint(0, 2, (3, 4), dtype=torch.bool)  # CPU tensor
out = torch.empty((), dtype=torch.bool, device="cuda")

torch.any(x, out=out)
# torch.all(x, out=out) shows the same issue

Observed behavior

PyTorch raises: RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch.

Expected behavior

PyTorch should raise a normal user-facing error indicating that the input tensor and out tensor are on different devices, similar to the error raised for the reverse direction (CUDA input, CPU out). For example, something along the lines of: Expected out tensor to have the same device as the input tensor

Additional notes

I also confirmed: torch.any(cpu_tensor, out=cpu_out) works torch.any(cuda_tensor, out=cuda_out) works torch.any(cuda_tensor, out=cpu_out) raises a normal device mismatch error torch.any(cpu_tensor, out=cuda_out) raises the internal assert above torch.all shows the same problematic behavior as torch.any This does not appear to be a segfault or memory corruption issue, but it looks like an internal assertion is being exposed where a regular validation error should be returned.

Versions

PyTorch version: 2.6.0+cu126 Is debug build: False CUDA used to build PyTorch: 12.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: Could not collect CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 | packaged by conda-forge | (main, Mar 5 2026, 16:42:22) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-6.17.0-19-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti GPU 2: NVIDIA GeForce RTX 2080 Ti GPU 3: NVIDIA GeForce RTX 2080 Ti

Nvidia driver version: 580.126.09 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 Stepping: 7 BogoMIPS: 6199.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat vnmi umip pku ospke avx512_vnni md_clear arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 128 MiB (32 instances) L3 cache: 16 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Mitigation; TSX disabled Vulnerability Vmscape: Not affected

Versions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.6.4.1 [pip3] nvidia-cuda-cupti-cu12==12.6.80 [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 [pip3] nvidia-cuda-runtime-cu12==12.6.77 [pip3] nvidia-cudnn-cu12==9.5.1.17 [pip3] nvidia-cufft-cu12==11.3.0.4 [pip3] nvidia-curand-cu12==10.3.7.77 [pip3] nvidia-cusolver-cu12==11.7.1.2 [pip3] nvidia-cusparse-cu12==12.5.4.2 [pip3] nvidia-cusparselt-cu12==0.6.3 [pip3] nvidia-nccl-cu12==2.21.5 [pip3] nvidia-nvjitlink-cu12==12.6.85 [pip3] nvidia-nvtx-cu12==12.6.77 [pip3] torch==2.6.0+cu126 [pip3] torchaudio==2.6.0+cu126 [pip3] torchvision==0.21.0+cu126 [pip3] triton==3.2.0 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi [conda] torch 2.6.0+cu126 pypi_0 pypi [conda] torchaudio 2.6.0+cu126 pypi_0 pypi [conda] torchvision 0.21.0+cu126 pypi_0 pypi [conda] triton 3.2.0 pypi_0 pypi

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @malfet

extent analysis

Fix Plan

To fix the issue, ensure that the input tensor and the output tensor are on the same device.

  • Check the device of the input tensor and the output tensor before calling torch.any or torch.all.
  • If the devices are different, move the input tensor to the same device as the output tensor.

Example Code

import torch

# Create a CPU tensor
x = torch.randint(0, 2, (3, 4), dtype=torch.bool)

# Create an output tensor on CUDA
out = torch.empty((), dtype=torch.bool, device="cuda")

# Move the input tensor to the CUDA device
x = x.to(out.device)

# Now torch.any should work without errors
torch.any(x, out=out)

Verification

After applying the fix, verify that torch.any and torch.all no longer raise the internal CUDA assert error. The code should run without errors and produce the expected results.

Extra Tips

  • Always check the device of tensors before performing operations that involve multiple tensors.
  • Use the to method to move tensors to the desired device.
  • Be aware of the device affinity of tensors to avoid device mismatch errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

PyTorch should raise a normal user-facing error indicating that the input tensor and out tensor are on different devices, similar to the error raised for the reverse direction (CUDA input, CPU out). For example, something along the lines of: Expected out tensor to have the same device as the input tensor

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix torch.any and torch.all with CPU input and CUDA out tensor raise CUDA INTERNAL ASSERT FAILED instead of a device mismatch error [1 comments, 2 participants]