pytorch - 💡(How to fix) Fix CUDA linalg: replace hard LU M >= 512 cuSOLVER cutoff with batch-aware heuristic [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181999Fetched 2026-05-01 05:33:01
View on GitHub
Comments
0
Participants
1
Timeline
58
Reactions
0
Participants
Timeline (top)
mentioned ×25subscribed ×25labeled ×7cross-referenced ×1

Root Cause

The cliff starts at M=512 because the LU factorization switches from cuBLAS batched to looped cuSOLVER. For larger batch sizes this is the wrong backend choice.

Code Example

if (m != n || (batch_size == 1 || m >= 512)) {
  lu_factor_looped_cusolver(input, pivots, infos, compute_pivots);
} else {
  lu_factor_batched_cublas(input, pivots, infos, compute_pivots);
}

---

import torch

torch.cuda.set_device(0)
for batch in (8, 16, 32, 64, 128):
    for n in (511, 512):
        A = torch.randn((batch, n, n), device="cuda", dtype=torch.float32)
        B = torch.randn((batch, n, 1), device="cuda", dtype=torch.float32)
        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        torch.linalg.solve_ex(A, B, check_errors=False)
        end.record()
        torch.cuda.synchronize()
        print(batch, n, start.elapsed_time(end), "ms")
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

The CUDA LU factorization heuristic for the cuBLAS/cuSOLVER path uses a hard matrix-size cutoff for batched square inputs:

https://github.com/pytorch/pytorch/blob/37bc32b8736cca7afb41d4794e32ed65bbc83521/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp#L877-L882

if (m != n || (batch_size == 1 || m >= 512)) {
  lu_factor_looped_cusolver(input, pivots, infos, compute_pivots);
} else {
  lu_factor_batched_cublas(input, pivots, infos, compute_pivots);
}

For batched square M x M inputs this means:

  • M < 512: cuBLAS getrfBatched
  • M >= 512: looped cuSOLVER getrf

This cutoff is too coarse. It is good for small batches, but for larger batches cuBLAS batched remains substantially faster beyond M=512. The result is a visible performance cliff in torch.linalg.lu_factor_ex and in callers such as torch.linalg.solve_ex.

This is not a cuBLAS API limit. NVIDIA cuBLAS docs have long documented that cublas<t>getrfBatched supports arbitrary dimension, while describing it as intended for small matrices. The right cutoff is a performance heuristic, and it depends strongly on batch size. It may also depend on dtype, CUDA version, and GPU generation.

Repro

Build without MAGMA, or otherwise exercise the cuBLAS/cuSOLVER CUDA linalg path, then run:

import torch

torch.cuda.set_device(0)
for batch in (8, 16, 32, 64, 128):
    for n in (511, 512):
        A = torch.randn((batch, n, n), device="cuda", dtype=torch.float32)
        B = torch.randn((batch, n, 1), device="cuda", dtype=torch.float32)
        torch.cuda.synchronize()
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        torch.linalg.solve_ex(A, B, check_errors=False)
        end.record()
        torch.cuda.synchronize()
        print(batch, n, start.elapsed_time(end), "ms")

Local results

Environment:

  • PyTorch 2.13.0a0+git37bc32b
  • Commit 37bc32b8736cca7afb41d4794e32ed65bbc83521
  • CUDA runtime 13.2
  • GPU: NVIDIA RTX 6000 Ada Generation
  • Build option relevant to repro: USE_MAGMA=0

torch.linalg.solve_ex(A, B, check_errors=False), B.shape == (batch, M, 1):

batchM=511M=512
87.1 ms6.7 ms
167.1 ms12.9 ms
327.1 ms25.4 ms
647.5 ms50.2 ms
1289.5 ms100.1 ms

The cliff starts at M=512 because the LU factorization switches from cuBLAS batched to looped cuSOLVER. For larger batch sizes this is the wrong backend choice.

A standalone CUDA float32 GETRF sweep on the same machine produced these approximate crossover regions:

batchbest near M=512observed crossover
1cuSOLVERcuBLAS <=64, cuSOLVER >=80
2cuSOLVERcuBLAS <=160, cuSOLVER >=176
4cuSOLVERcuBLAS <=320, cuSOLVER >=352
6cuSOLVERcuBLAS <=416, cuSOLVER >=448
8roughly tied/noisyaround 504-513
12cuBLAScuBLAS <=608, cuSOLVER >=640
16cuBLAScuBLAS <=704, cuSOLVER >=736
24cuBLAScuBLAS <=832, cuSOLVER >=896
32cuBLAScuBLAS <=960, cuSOLVER >=1024
48cuBLAScuBLAS <=1088, cuSOLVER >=1152
64cuBLAScuBLAS <=1216, cuSOLVER >=1280
96cuBLAScuBLAS <=1536, cuSOLVER >=1664
128cuBLAScuBLAS <=1664, cuSOLVER >=1792

The exact table above should not be blindly hard-coded from one Ada GPU and one dtype, but it shows that the current scalar cutoff is not robust. A simple batch-aware heuristic, or a small benchmark-derived table, would avoid large regressions for common batched sizes.

Expected behavior

The cuBLAS/cuSOLVER LU heuristic should account for batch size when choosing between cuBLAS getrfBatched and looped cuSOLVER getrf for batched square matrices.

Before changing the heuristic, we should benchmark at least:

  • dtypes: float32, float64, complex64, complex128
  • batch sizes: small and large, e.g. 1, 2, 4, 8, 16, 32, 64, 128
  • matrix sizes around the crossover bands, especially 511, 512, 513
  • GPU generations: at least one Ampere/Hopper/Blackwell if available, plus the Ada result above
  • CUDA versions currently supported in CI/binaries, not only CUDA 13.2
  • lu_factor_ex directly and a representative caller such as solve_ex

Related

The looped cuSOLVER path also allocates GETRF workspace once per matrix through the PyTorch wrapper, which is tracked separately in:

https://github.com/pytorch/pytorch/issues/181997

That allocator issue is independent but makes the hard M >= 512 switch more visible.

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @csarofeen @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

The CUDA LU factorization heuristic should be modified to account for batch size when choosing between cuBLAS and cuSOLVER for batched square matrices.

Guidance

  • Identify the current cutoff value (M >= 512) and understand its impact on performance for different batch sizes.
  • Develop a batch-aware heuristic or a small benchmark-derived table to determine the optimal cutoff value based on batch size, dtype, CUDA version, and GPU generation.
  • Benchmark the performance of cuBLAS and cuSOLVER for various batch sizes, dtypes, and matrix sizes to inform the heuristic.
  • Consider implementing a dynamic cutoff value that adapts to the specific use case and hardware configuration.

Example

No code snippet is provided as the issue is related to the heuristic used in the CUDA LU factorization, and the solution requires a more comprehensive approach.

Notes

The exact table provided in the issue should not be blindly hard-coded, as it is specific to one Ada GPU and one dtype. A more robust solution would involve benchmarking and testing on various hardware configurations and use cases.

Recommendation

Apply a workaround by implementing a batch-aware heuristic or a small benchmark-derived table to determine the optimal cutoff value. This approach will help avoid large regressions for common batched sizes and improve the overall performance of the CUDA LU factorization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The cuBLAS/cuSOLVER LU heuristic should account for batch size when choosing between cuBLAS getrfBatched and looped cuSOLVER getrf for batched square matrices.

Before changing the heuristic, we should benchmark at least:

  • dtypes: float32, float64, complex64, complex128
  • batch sizes: small and large, e.g. 1, 2, 4, 8, 16, 32, 64, 128
  • matrix sizes around the crossover bands, especially 511, 512, 513
  • GPU generations: at least one Ampere/Hopper/Blackwell if available, plus the Ada result above
  • CUDA versions currently supported in CI/binaries, not only CUDA 13.2
  • lu_factor_ex directly and a representative caller such as solve_ex

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING