pytorch - ✅(Solved) Fix [CUDA]When my data size is between 32 and 128 (exclusive), the argsort doesn't go through the WarpMergeSort branch. [1 pull requests, 1 participants]

pytorch2026-05-13 02:43:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#183499•Fetched 2026-05-14 03:28:36

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sun-ds-ai

Participants

sun-ds-ai

Timeline (top)

closed ×2cross-referenced ×1referenced ×1reopened ×1

Fix Action

Fixed

Fixed by PR: Fix CUDA version check for warp merge sort (https://github.com/pytorch/pytorch/pull/183527)

PR fix notes

PR #183527: Fix CUDA version check for warp merge sort

Repository: pytorch/pytorch
Author: HU-qingqing
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/183527

Description (problem / solution / changelog)

As referenced in the original code and PR #96223, CUDA 11.6 is explicitly required. The correct value of the CUDA_VERSION macro for CUDA 11.6 is 11060, following the standard encoding rule: CUDA_VERSION = major * 1000 + minor * 10 + patch.This aligns with the correct usage already implemented in pytorch/third_party/kineto/libkineto/src/CuptiCbidRegistry.cpp, which defines 11060 for CUDA 11.6.

This PR fixes the incorrect version check that existed previously. Performance benchmarks were conducted on the issue case before and after the fix to verify the performance improvement.

Before fix: sortCommon(MediumRadixSort{}, key, value, dim, descending) 8.544us After fix: sortCommon(WarpMergeSort<128>{}, key, value, dim, descending) 4.096us A 2x speedup is achieved, which is consistent with the conclusion stated in PR #96223 that WarpMergeSort is used under eligible conditions.This results in up to a 2x speedup for unstable sorts and up to 15x speedup for stable sorts, depending on the input geometry.

fixs #183499

before

Name	Self CPU %	Self CPU	CPU total %	CPU total	CPU time avg	Self CUDA	Self CUDA %	CUDA total	CUDA time avg	# of Calls
void at::native::radixSortKVInPlace<-2, -1, 32, 4, ...>	0.00%	0.000us	0.00%	0.000us	0.000us	8.544us	63.42%	8.544us	8.544us	1
Memcpy DtoD (Device -> Device)	0.00%	0.000us	0.00%	0.000us	0.000us	3.713us	27.56%	3.713us	1.238us	3
void (anonymous namespace)::elementwise_kernel_with_...	0.00%	0.000us	0.00%	0.000us	0.000us	1.216us	9.03%	1.216us	1.216us	1
cudaMemcpyAsync	3.00%	63.241us	98.03%	2.068ms	689.455us	0.000us	0.00%	0.000us	0.000us	3
Activity Buffer Request	95.03%	2.005ms	95.03%	2.005ms	2.005ms	0.000us	0.00%	0.000us	0.000us	1
cudaLaunchKernel	1.41%	29.713us	1.41%	29.713us	14.857us	0.000us	0.00%	0.000us	0.000us	2
cudaDeviceSynchronize	0.56%	11.893us	0.56%	11.893us	5.947us	0.000us	0.00%	0.000us	0.000us	2

Self CPU time total: 2.110ms Self CUDA time total: 13.473us

after

Name	Self CPU %	Self CPU	CPU total %	CPU total	CPU time avg	Self CUDA	Self CUDA %	CUDA total	CUDA time avg	# of Calls
void at::native::warpMergeSortKVInPlace<-2, -1, 128,...>	0.00%	0.000us	0.00%	0.000us	0.000us	4.096us	45.88%	4.096us	4.096us	1
Memcpy DtoD (Device -> Device)	0.00%	0.000us	0.00%	0.000us	0.000us	3.615us	40.50%	3.615us	1.205us	3
void (anonymous namespace)::elementwise_kernel_with_...	0.00%	0.000us	0.00%	0.000us	0.000us	1.216us	13.62%	1.216us	1.216us	1
cudaMemcpyAsync	3.39%	72.069us	96.96%	2.062ms	687.416us	0.000us	0.00%	0.000us	0.000us	3
Activity Buffer Request	93.57%	1.990ms	93.57%	1.990ms	1.990ms	0.000us	0.00%	0.000us	0.000us	1
cudaLaunchKernel	1.42%	30.218us	1.42%	30.218us	15.109us	0.000us	0.00%	0.000us	0.000us	2
cudaDeviceGetAttribute	0.10%	2.151us	0.10%	2.151us	0.538us	0.000us	0.00%	0.000us	0.000us	4
cudaFuncGetAttributes	0.64%	13.692us	0.64%	13.692us	13.692us	0.000us	0.00%	0.000us	0.000us	1
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags	0.30%	6.370us	0.30%	6.370us	0.398us	0.000us	0.00%	0.000us	0.000us	16
cudaDeviceSynchronize	0.57%	12.169us	0.57%	12.169us	6.085us	0.000us	0.00%	0.000us	0.000us	2

Self CPU time total: 2.127ms Self CUDA time total: 8.927us

Changed files

aten/src/ATen/native/cuda/SortUtils.cuh (modified, +1/-1)

Code Example

void sortKeyValueInplace(
    const TensorBase& key,
    const TensorBase& value,
    int64_t dim,
    bool descending,
    bool stable) {
  const auto sort_size = key.size(dim);
  if (sort_size <= 1) {
    return; // Already sorted
  } else if (!stable && sort_size <= 32) {
    // NOTE: Bitonic sort is unstable
    sortCommon(SmallBitonicSort{}, key, value, dim, descending);
#if HAS_WARP_MERGE_SORT()
  } else if (sort_size <= 128) {
   printf("HAS_WARP_SORT\n");
    sortCommon(WarpMergeSort<128, C10_WARP_SIZE>{}, key, value, dim, descending);
#endif
  } else {
printf("HAS_WARP_SOPT():%d\n",HAS_WARP_SORT());
printf("CUDA_VRESION:%d\n",CUDA_VERSION);
    sortCommon(MediumRadixSort{}, key, value, dim, descending);
  }
}

---

import torch
device = "cuda"
input = torch.tensor([[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1]],dtype=torch.int64,device=device)
input.view(-1).argsort()

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

In sortKeyValueInplace in ATen/native/cuda/Sort.cu, the WarpMergeSort branch is not taken. My CUDA version is 12020. I found that the macro HAS_WARP_MERGE_SORT() checks CUDA_VERSION >= 110600, so no matter what CUDA version I have, HAS_WARP_MERGE_SORT() would always evaluate to 0. my torch version 2.7.1

void sortKeyValueInplace(
    const TensorBase& key,
    const TensorBase& value,
    int64_t dim,
    bool descending,
    bool stable) {
  const auto sort_size = key.size(dim);
  if (sort_size <= 1) {
    return; // Already sorted
  } else if (!stable && sort_size <= 32) {
    // NOTE: Bitonic sort is unstable
    sortCommon(SmallBitonicSort{}, key, value, dim, descending);
#if HAS_WARP_MERGE_SORT()
  } else if (sort_size <= 128) {
   printf("HAS_WARP_SORT\n");
    sortCommon(WarpMergeSort<128, C10_WARP_SIZE>{}, key, value, dim, descending);
#endif
  } else {
printf("HAS_WARP_SOPT():%d\n",HAS_WARP_SORT());
printf("CUDA_VRESION:%d\n",CUDA_VERSION);
    sortCommon(MediumRadixSort{}, key, value, dim, descending);
  }
}

import torch
device = "cuda"
input = torch.tensor([[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1],[0,6,7,8,9,10,11,1]],dtype=torch.int64,device=device)
input.view(-1).argsort()

result: HAS_WARP_MERGE_SORT:0 CUDA_VERSION 12020

Versions

Around line 383 of aten/src/ATen/native/cuda/Sort.cu, add a printf, compile, and run argsort (to see the printed output)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#docker error #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [CUDA]When my data size is between 32 and 128 (exclusive), the argsort doesn't go through the WarpMergeSort branch. [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #183527: Fix CUDA version check for warp merge sort

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [CUDA]When my data size is between 32 and 128 (exclusive), the argsort doesn't go through the WarpMergeSort branch. [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #183527: Fix CUDA version check for warp merge sort

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING