The cuBLAS/cuSOLVER LU heuristic should account for batch size when choosing between cuBLAS `getrfBatched` and looped cuSOLVER `getrf` for batched square matrices. Before changing the heuristic, we should benchmark at least: - dtypes: `float32`, `float64`, `complex64`, `complex128` - batch sizes: small and large, e.g. `1, 2, 4, 8, 16, 32, 64, 128` - matrix sizes around the crossover bands, especially `511, 512, 513` - GPU generations: at least one Ampere/Hopper/Blackwell if available, plus the Ada result above - CUDA versions currently supported in CI/binaries, not only CUDA 13.2 - `lu_factor_ex` directly and a representative caller such as `solve_ex`

pytorch - 💡(How to fix) Fix CUDA linalg: replace hard LU M >= 512 cuSOLVER cutoff with batch-aware heuristic [1 participants]

Code Example

if (m != n || (batch_size == 1 || m >= 512)) {
  lu_factor_looped_cusolver(input, pivots, infos, compute_pivots);
} else {
  lu_factor_batched_cublas(input, pivots, infos, compute_pivots);
}

---

import torch

torch.cuda.set_device(0)
for batch in (8, 16, 32, 64, 128):
    for n in (511, 512):
        A = torch.randn((batch, n, n), device="cuda", dtype=torch.float32)
        B = torch.randn((batch, n, 1), device="cuda", dtype=torch.float32)
        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        torch.linalg.solve_ex(A, B, check_errors=False)
        end.record()
        torch.cuda.synchronize()
        print(batch, n, start.elapsed_time(end), "ms")

🐛 Describe the bug

The CUDA LU factorization heuristic for the cuBLAS/cuSOLVER path uses a hard matrix-size cutoff for batched square inputs:

https://github.com/pytorch/pytorch/blob/37bc32b8736cca7afb41d4794e32ed65bbc83521/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp#L877-L882

if (m != n || (batch_size == 1 || m >= 512)) {
  lu_factor_looped_cusolver(input, pivots, infos, compute_pivots);
} else {
  lu_factor_batched_cublas(input, pivots, infos, compute_pivots);
}

For batched square M x M inputs this means:

M < 512: cuBLAS getrfBatched
M >= 512: looped cuSOLVER getrf

This cutoff is too coarse. It is good for small batches, but for larger batches cuBLAS batched remains substantially faster beyond M=512. The result is a visible performance cliff in torch.linalg.lu_factor_ex and in callers such as torch.linalg.solve_ex.

This is not a cuBLAS API limit. NVIDIA cuBLAS docs have long documented that cublas<t>getrfBatched supports arbitrary dimension, while describing it as intended for small matrices. The right cutoff is a performance heuristic, and it depends strongly on batch size. It may also depend on dtype, CUDA version, and GPU generation.

Repro

Build without MAGMA, or otherwise exercise the cuBLAS/cuSOLVER CUDA linalg path, then run:

import torch

torch.cuda.set_device(0)
for batch in (8, 16, 32, 64, 128):
    for n in (511, 512):
        A = torch.randn((batch, n, n), device="cuda", dtype=torch.float32)
        B = torch.randn((batch, n, 1), device="cuda", dtype=torch.float32)
        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        torch.linalg.solve_ex(A, B, check_errors=False)
        end.record()
        torch.cuda.synchronize()
        print(batch, n, start.elapsed_time(end), "ms")

Local results

Environment:

PyTorch 2.13.0a0+git37bc32b
Commit 37bc32b8736cca7afb41d4794e32ed65bbc83521
CUDA runtime 13.2
GPU: NVIDIA RTX 6000 Ada Generation
Build option relevant to repro: USE_MAGMA=0

torch.linalg.solve_ex(A, B, check_errors=False), B.shape == (batch, M, 1):

batch	M=511	M=512
8	7.1 ms	6.7 ms
16	7.1 ms	12.9 ms
32	7.1 ms	25.4 ms
64	7.5 ms	50.2 ms
128	9.5 ms	100.1 ms

The cliff starts at M=512 because the LU factorization switches from cuBLAS batched to looped cuSOLVER. For larger batch sizes this is the wrong backend choice.

A standalone CUDA float32 GETRF sweep on the same machine produced these approximate crossover regions:

batch	best near M=512	observed crossover
1	cuSOLVER	cuBLAS <=64, cuSOLVER >=80
2	cuSOLVER	cuBLAS <=160, cuSOLVER >=176
4	cuSOLVER	cuBLAS <=320, cuSOLVER >=352
6	cuSOLVER	cuBLAS <=416, cuSOLVER >=448
8	roughly tied/noisy	around 504-513
12	cuBLAS	cuBLAS <=608, cuSOLVER >=640
16	cuBLAS	cuBLAS <=704, cuSOLVER >=736
24	cuBLAS	cuBLAS <=832, cuSOLVER >=896
32	cuBLAS	cuBLAS <=960, cuSOLVER >=1024
48	cuBLAS	cuBLAS <=1088, cuSOLVER >=1152
64	cuBLAS	cuBLAS <=1216, cuSOLVER >=1280
96	cuBLAS	cuBLAS <=1536, cuSOLVER >=1664
128	cuBLAS	cuBLAS <=1664, cuSOLVER >=1792

The exact table above should not be blindly hard-coded from one Ada GPU and one dtype, but it shows that the current scalar cutoff is not robust. A simple batch-aware heuristic, or a small benchmark-derived table, would avoid large regressions for common batched sizes.

Expected behavior

The cuBLAS/cuSOLVER LU heuristic should account for batch size when choosing between cuBLAS getrfBatched and looped cuSOLVER getrf for batched square matrices.

Before changing the heuristic, we should benchmark at least:

dtypes: float32, float64, complex64, complex128
batch sizes: small and large, e.g. 1, 2, 4, 8, 16, 32, 64, 128
matrix sizes around the crossover bands, especially 511, 512, 513
GPU generations: at least one Ampere/Hopper/Blackwell if available, plus the Ada result above
CUDA versions currently supported in CI/binaries, not only CUDA 13.2
lu_factor_ex directly and a representative caller such as solve_ex

The looped cuSOLVER path also allocates GETRF workspace once per matrix through the PyTorch wrapper, which is tracked separately in:

https://github.com/pytorch/pytorch/issues/181997

That allocator issue is independent but makes the hard M >= 512 switch more visible.

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @csarofeen @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

The CUDA LU factorization heuristic should be modified to account for batch size when choosing between cuBLAS and cuSOLVER for batched square matrices.

Guidance

Identify the current cutoff value (M >= 512) and understand its impact on performance for different batch sizes.
Develop a batch-aware heuristic or a small benchmark-derived table to determine the optimal cutoff value based on batch size, dtype, CUDA version, and GPU generation.
Benchmark the performance of cuBLAS and cuSOLVER for various batch sizes, dtypes, and matrix sizes to inform the heuristic.
Consider implementing a dynamic cutoff value that adapts to the specific use case and hardware configuration.

Example

No code snippet is provided as the issue is related to the heuristic used in the CUDA LU factorization, and the solution requires a more comprehensive approach.

Notes

The exact table provided in the issue should not be blindly hard-coded, as it is specific to one Ada GPU and one dtype. A more robust solution would involve benchmarking and testing on various hardware configurations and use cases.

Recommendation

Apply a workaround by implementing a batch-aware heuristic or a small benchmark-derived table to determine the optimal cutoff value. This approach will help avoid large regressions for common batched sizes and improve the overall performance of the CUDA LU factorization.

FAQ

Expected behavior

The cuBLAS/cuSOLVER LU heuristic should account for batch size when choosing between cuBLAS getrfBatched and looped cuSOLVER getrf for batched square matrices.

Before changing the heuristic, we should benchmark at least:

dtypes: float32, float64, complex64, complex128
batch sizes: small and large, e.g. 1, 2, 4, 8, 16, 32, 64, 128
matrix sizes around the crossover bands, especially 511, 512, 513
GPU generations: at least one Ampere/Hopper/Blackwell if available, plus the Ada result above
CUDA versions currently supported in CI/binaries, not only CUDA 13.2
lu_factor_ex directly and a representative caller such as solve_ex

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix CUDA linalg: replace hard LU M >= 512 cuSOLVER cutoff with batch-aware heuristic [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🐛 Describe the bug

Repro

Local results

Expected behavior

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix CUDA linalg: replace hard LU M >= 512 cuSOLVER cutoff with batch-aware heuristic [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🐛 Describe the bug

Repro

Local results

Expected behavior

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING