pytorch - ✅(Solved) Fix CUDA linalg: looped cuSOLVER LU allocates GETRF workspace per batch item [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181997Fetched 2026-05-01 05:33:03
View on GitHub
Comments
0
Participants
1
Timeline
37
Reactions
1
Participants
Timeline (top)
mentioned ×12subscribed ×12labeled ×7cross-referenced ×3

Root Cause

The allocator behavior is especially visible because the no-MAGMA / cuSOLVER-preferred LU heuristic currently switches batched square matrices from cuBLAS getrfBatched to looped cuSOLVER at M >= 512:

Fix Action

Fix / Workaround

That means a batch of size B performs B allocator allocate/free cycles for cuSOLVER workspace. With CUDA caching enabled these may usually hit the cache after warmup, but cache misses or PYTORCH_NO_CUDA_MEMORY_CACHING=1 turn this into real cudaMalloc / cudaFree calls. This also makes the allocator overhead visible at dispatch boundaries where PyTorch switches from cuBLAS batched LU to looped cuSOLVER LU.

casedispatch pathobserved CUDA allocation calls
A.shape == (16, 511, 511)cuBLAS getrfBatched7 cudaMalloc, 5 cudaFree
A.shape == (16, 512, 512)looped cuSOLVER getrf20 cudaMalloc, 20 cudaFree

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Description (problem / solution / changelog)

Summary

lu_factor_looped_cusolver used to call at::cuda::solver::getrf once per batch item, and the getrf wrapper queried cuSOLVER workspace size and allocated workspace internally on every call. For a batch of size B, that produced B workspace queries and B allocator round-trips.

This PR makes the workspace explicit for the internal GETRF wrapper:

  • Adds at::cuda::solver::getrf_bufferSize<T> as a thin wrapper over cusolverDn*getrf_bufferSize.
  • Updates at::cuda::solver::getrf<T> to take a caller-provided workspace pointer.
  • Updates lu_factor_looped_cusolver to query lwork once, allocate one workspace buffer, and reuse it across the per-matrix GETRF loop.

The workspace size is fixed for a given (dtype, m, n, lda) tuple, so reusing it across sequential GETRF calls on the same stream is safe.

Validation

Local environment:

  • PyTorch source build with USE_MAGMA=0
  • CUDA 13.2
  • NVIDIA RTX 6000 Ada Generation

Commands/checks run locally:

  • USE_MAGMA=0 MAX_JOBS=8 bash ~/dev/scripts-pytorch/run_build.sh
  • CUDA smoke test for torch.linalg.lu_factor_ex and torch.linalg.solve_ex
  • git diff --check

Allocator validation with PyTorch CUDA caching disabled:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 LD_PRELOAD=libcuda_alloc_counter.so, torch.backends.cuda.preferred_linalg_library("cusolver"), float32 torch.linalg.lu_factor_ex.

shapelocal pre-patch baselineafter this PR
[16, 512, 512]20 cudaMalloc / 20 cudaFree5 cudaMalloc / 5 cudaFree
[64, 512, 512]not rerun5 cudaMalloc / 5 cudaFree

This confirms the looped cuSOLVER GETRF path is now O(1) in allocator calls with respect to batch size.

BC

No BC break. The changed APIs are internal at::cuda::solver helpers, not public Python or C++ APIs.

Fixes #181997.

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

Changed files

  • aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp (modified, +9/-0)
  • aten/src/ATen/native/cuda/linalg/CUDASolver.cpp (modified, +36/-26)
  • aten/src/ATen/native/cuda/linalg/CUDASolver.h (modified, +19/-1)

Code Example

import torch

torch.cuda.set_device(0)
A = torch.randn((16, 512, 512), device="cuda")
torch.cuda.synchronize()
LU, pivots, info = torch.linalg.lu_factor_ex(A, check_errors=False)
torch.cuda.synchronize()
print(LU.shape, pivots.shape, info[:3].tolist())
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

In the CUDA linalg.lu_factor_ex / linalg.solve_ex path, the looped cuSOLVER LU backend allocates cuSOLVER GETRF workspace once per matrix in a batch.

Current flow:

That means a batch of size B performs B allocator allocate/free cycles for cuSOLVER workspace. With CUDA caching enabled these may usually hit the cache after warmup, but cache misses or PYTORCH_NO_CUDA_MEMORY_CACHING=1 turn this into real cudaMalloc / cudaFree calls. This also makes the allocator overhead visible at dispatch boundaries where PyTorch switches from cuBLAS batched LU to looped cuSOLVER LU.

This does not appear to be a cuSOLVER-internal allocation requirement: cusolverDn<t>getrf takes a caller-provided workspace pointer, and a standalone cuSOLVER benchmark that reused one workspace across the loop did not show hidden CUDA allocation calls.

Repro

Build without MAGMA, or otherwise force the cuSOLVER/cuBLAS CUDA linalg path, and run:

import torch

torch.cuda.set_device(0)
A = torch.randn((16, 512, 512), device="cuda")
torch.cuda.synchronize()
LU, pivots, info = torch.linalg.lu_factor_ex(A, check_errors=False)
torch.cuda.synchronize()
print(LU.shape, pivots.shape, info[:3].tolist())

To make allocator calls explicit, run with PYTORCH_NO_CUDA_MEMORY_CACHING=1 and trace CUDA API calls, or use an LD_PRELOAD interposer that counts cudaMalloc / cudaFree.

Observed locally on main commit 37bc32b8736cca7afb41d4794e32ed65bbc83521, CUDA 13.2, RTX 6000 Ada, USE_MAGMA=0:

casedispatch pathobserved CUDA allocation calls
A.shape == (16, 511, 511)cuBLAS getrfBatched7 cudaMalloc, 5 cudaFree
A.shape == (16, 512, 512)looped cuSOLVER getrf20 cudaMalloc, 20 cudaFree

The 512 case includes one cuSOLVER workspace allocation per batch item from the PyTorch wrapper.

Expected behavior

lu_factor_looped_cusolver should allocate GETRF workspace once per lu_factor call and reuse it across all matrices in the batch.

One possible implementation direction:

  • expose a cuSOLVER GETRF buffer-size helper, or add a getrf wrapper overload that accepts caller-owned workspace;
  • query lwork once for the fixed (m, n, dtype) call;
  • allocate the workspace once in lu_factor_looped_cusolver;
  • pass the same workspace to each per-matrix cusolverDn<t>getrf call.

The legacy cuSOLVER buffer-size API takes the matrix pointer, so the implementation can conservatively query using the first matrix pointer. For a fixed dtype and (m, n, lda), the required workspace should not vary across batch items.

Additional context

The allocator behavior is especially visible because the no-MAGMA / cuSOLVER-preferred LU heuristic currently switches batched square matrices from cuBLAS getrfBatched to looped cuSOLVER at M >= 512:

https://github.com/pytorch/pytorch/blob/37bc32b8736cca7afb41d4794e32ed65bbc83521/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp#L877-L882

That threshold is a separate performance heuristic issue, but hoisting/reusing cuSOLVER workspace is independently useful and should reduce allocator traffic for all batched looped-cuSOLVER LU calls.

Versions

  • PyTorch: 2.13.0a0+git37bc32b
  • Git commit: 37bc32b8736cca7afb41d4794e32ed65bbc83521
  • CUDA runtime: 13.2
  • GPU: NVIDIA RTX 6000 Ada Generation
  • Build option relevant to repro: USE_MAGMA=0

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

Allocate the cuSOLVER GETRF workspace once per lu_factor call and reuse it across all matrices in the batch to reduce allocator traffic.

Guidance

  • Expose a cuSOLVER GETRF buffer-size helper or add a getrf wrapper overload that accepts caller-owned workspace.
  • Query lwork once for the fixed (m, n, dtype) call and allocate the workspace once in lu_factor_looped_cusolver.
  • Pass the same workspace to each per-matrix cusolverDn<t>getrf call to avoid repeated allocations.
  • Consider using the first matrix pointer to conservatively query the required workspace size, as the legacy cuSOLVER buffer-size API takes the matrix pointer.

Example

No code snippet is provided as the issue requires changes to the PyTorch library.

Notes

The proposed solution assumes that the cuSOLVER GETRF workspace size does not vary across batch items for a fixed (m, n, dtype). If this assumption is incorrect, additional modifications may be necessary.

Recommendation

Apply the proposed workaround to allocate and reuse the cuSOLVER GETRF workspace, as it should reduce allocator traffic and improve performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

lu_factor_looped_cusolver should allocate GETRF workspace once per lu_factor call and reuse it across all matrices in the batch.

One possible implementation direction:

  • expose a cuSOLVER GETRF buffer-size helper, or add a getrf wrapper overload that accepts caller-owned workspace;
  • query lwork once for the fixed (m, n, dtype) call;
  • allocate the workspace once in lu_factor_looped_cusolver;
  • pass the same workspace to each per-matrix cusolverDn<t>getrf call.

The legacy cuSOLVER buffer-size API takes the matrix pointer, so the implementation can conservatively query using the first matrix pointer. For a fixed dtype and (m, n, lda), the required workspace should not vary across batch items.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix CUDA linalg: looped cuSOLVER LU allocates GETRF workspace per batch item [1 pull requests, 1 participants]