`lu_factor_looped_cusolver` should allocate GETRF workspace once per `lu_factor` call and reuse it across all matrices in the batch. One possible implementation direction: - expose a cuSOLVER GETRF buffer-size helper, or add a `getrf` wrapper overload that accepts caller-owned workspace; - query `lwork` once for the fixed `(m, n, dtype)` call; - allocate the workspace once in `lu_factor_looped_cusolver`; - pass the same workspace to each per-matrix `cusolverDn getrf` call. The legacy cuSOLVER buffer-size API takes the matrix pointer, so the implementation can conservatively query using the first matrix pointer. For a fixed dtype and `(m, n, lda)`, the required workspace should not vary across batch items.

pytorch - ✅(Solved) Fix CUDA linalg: looped cuSOLVER LU allocates GETRF workspace per batch item [1 pull requests, 1 participants]

pytorch2026-04-30 08:13:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181997•Fetched 2026-05-01 05:33:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×12subscribed ×12labeled ×7cross-referenced ×3

Root Cause

The allocator behavior is especially visible because the no-MAGMA / cuSOLVER-preferred LU heuristic currently switches batched square matrices from cuBLAS getrfBatched to looped cuSOLVER at M >= 512:

Fix Action

Fix / Workaround

That means a batch of size B performs B allocator allocate/free cycles for cuSOLVER workspace. With CUDA caching enabled these may usually hit the cache after warmup, but cache misses or PYTORCH_NO_CUDA_MEMORY_CACHING=1 turn this into real cudaMalloc / cudaFree calls. This also makes the allocator overhead visible at dispatch boundaries where PyTorch switches from cuBLAS batched LU to looped cuSOLVER LU.

case	dispatch path	observed CUDA allocation calls
`A.shape == (16, 511, 511)`	cuBLAS `getrfBatched`	7 `cudaMalloc`, 5 `cudaFree`
`A.shape == (16, 512, 512)`	looped cuSOLVER `getrf`	20 `cudaMalloc`, 20 `cudaFree`

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Repository: pytorch/pytorch
Author: Copilot
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/181998

Description (problem / solution / changelog)

Summary

lu_factor_looped_cusolver used to call at::cuda::solver::getrf once per batch item, and the getrf wrapper queried cuSOLVER workspace size and allocated workspace internally on every call. For a batch of size B, that produced B workspace queries and B allocator round-trips.

This PR makes the workspace explicit for the internal GETRF wrapper:

Adds at::cuda::solver::getrf_bufferSize<T> as a thin wrapper over cusolverDn*getrf_bufferSize.
Updates at::cuda::solver::getrf<T> to take a caller-provided workspace pointer.
Updates lu_factor_looped_cusolver to query lwork once, allocate one workspace buffer, and reuse it across the per-matrix GETRF loop.

The workspace size is fixed for a given (dtype, m, n, lda) tuple, so reusing it across sequential GETRF calls on the same stream is safe.

Validation

Local environment:

PyTorch source build with USE_MAGMA=0
CUDA 13.2
NVIDIA RTX 6000 Ada Generation

Commands/checks run locally:

USE_MAGMA=0 MAX_JOBS=8 bash ~/dev/scripts-pytorch/run_build.sh
CUDA smoke test for torch.linalg.lu_factor_ex and torch.linalg.solve_ex
git diff --check

Allocator validation with PyTorch CUDA caching disabled:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 LD_PRELOAD=libcuda_alloc_counter.so, torch.backends.cuda.preferred_linalg_library("cusolver"), float32 torch.linalg.lu_factor_ex.

shape	local pre-patch baseline	after this PR
`[16, 512, 512]`	20 `cudaMalloc` / 20 `cudaFree`	5 `cudaMalloc` / 5 `cudaFree`
`[64, 512, 512]`	not rerun	5 `cudaMalloc` / 5 `cudaFree`

This confirms the looped cuSOLVER GETRF path is now O(1) in allocator calls with respect to batch size.

BC

No BC break. The changed APIs are internal at::cuda::solver helpers, not public Python or C++ APIs.

Fixes #181997.

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

Changed files

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp (modified, +9/-0)
aten/src/ATen/native/cuda/linalg/CUDASolver.cpp (modified, +36/-26)
aten/src/ATen/native/cuda/linalg/CUDASolver.h (modified, +19/-1)

Code Example

import torch

torch.cuda.set_device(0)
A = torch.randn((16, 512, 512), device="cuda")
torch.cuda.synchronize()
LU, pivots, info = torch.linalg.lu_factor_ex(A, check_errors=False)
torch.cuda.synchronize()
print(LU.shape, pivots.shape, info[:3].tolist())

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

In the CUDA linalg.lu_factor_ex / linalg.solve_ex path, the looped cuSOLVER LU backend allocates cuSOLVER GETRF workspace once per matrix in a batch.

Current flow:

lu_factor_looped_cusolver loops over batch items and calls at::cuda::solver::getrf(...) for each matrix: https://github.com/pytorch/pytorch/blob/37bc32b8736cca7afb41d4794e32ed65bbc83521/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp#L1647-L1675
The getrf wrapper queries lwork and allocates workspace via CUDACachingAllocator::allocate(...) inside each call: https://github.com/pytorch/pytorch/blob/37bc32b8736cca7afb41d4794e32ed65bbc83521/aten/src/ATen/native/cuda/linalg/CUDASolver.cpp#L23-L31

This does not appear to be a cuSOLVER-internal allocation requirement: cusolverDn<t>getrf takes a caller-provided workspace pointer, and a standalone cuSOLVER benchmark that reused one workspace across the loop did not show hidden CUDA allocation calls.

Repro

Build without MAGMA, or otherwise force the cuSOLVER/cuBLAS CUDA linalg path, and run:

import torch

torch.cuda.set_device(0)
A = torch.randn((16, 512, 512), device="cuda")
torch.cuda.synchronize()
LU, pivots, info = torch.linalg.lu_factor_ex(A, check_errors=False)
torch.cuda.synchronize()
print(LU.shape, pivots.shape, info[:3].tolist())

To make allocator calls explicit, run with PYTORCH_NO_CUDA_MEMORY_CACHING=1 and trace CUDA API calls, or use an LD_PRELOAD interposer that counts cudaMalloc / cudaFree.

Observed locally on main commit 37bc32b8736cca7afb41d4794e32ed65bbc83521, CUDA 13.2, RTX 6000 Ada, USE_MAGMA=0:

case	dispatch path	observed CUDA allocation calls
`A.shape == (16, 511, 511)`	cuBLAS `getrfBatched`	7 `cudaMalloc`, 5 `cudaFree`
`A.shape == (16, 512, 512)`	looped cuSOLVER `getrf`	20 `cudaMalloc`, 20 `cudaFree`

The 512 case includes one cuSOLVER workspace allocation per batch item from the PyTorch wrapper.

Expected behavior

lu_factor_looped_cusolver should allocate GETRF workspace once per lu_factor call and reuse it across all matrices in the batch.

One possible implementation direction:

expose a cuSOLVER GETRF buffer-size helper, or add a getrf wrapper overload that accepts caller-owned workspace;
query lwork once for the fixed (m, n, dtype) call;
allocate the workspace once in lu_factor_looped_cusolver;
pass the same workspace to each per-matrix cusolverDn<t>getrf call.

The legacy cuSOLVER buffer-size API takes the matrix pointer, so the implementation can conservatively query using the first matrix pointer. For a fixed dtype and (m, n, lda), the required workspace should not vary across batch items.

Additional context

https://github.com/pytorch/pytorch/blob/37bc32b8736cca7afb41d4794e32ed65bbc83521/aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp#L877-L882

That threshold is a separate performance heuristic issue, but hoisting/reusing cuSOLVER workspace is independently useful and should reduce allocator traffic for all batched looped-cuSOLVER LU calls.

Versions

PyTorch: 2.13.0a0+git37bc32b
Git commit: 37bc32b8736cca7afb41d4794e32ed65bbc83521
CUDA runtime: 13.2
GPU: NVIDIA RTX 6000 Ada Generation
Build option relevant to repro: USE_MAGMA=0

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

Allocate the cuSOLVER GETRF workspace once per lu_factor call and reuse it across all matrices in the batch to reduce allocator traffic.

Guidance

Expose a cuSOLVER GETRF buffer-size helper or add a getrf wrapper overload that accepts caller-owned workspace.
Query lwork once for the fixed (m, n, dtype) call and allocate the workspace once in lu_factor_looped_cusolver.
Pass the same workspace to each per-matrix cusolverDn<t>getrf call to avoid repeated allocations.
Consider using the first matrix pointer to conservatively query the required workspace size, as the legacy cuSOLVER buffer-size API takes the matrix pointer.

Example

No code snippet is provided as the issue requires changes to the PyTorch library.

Notes

The proposed solution assumes that the cuSOLVER GETRF workspace size does not vary across batch items for a fixed (m, n, dtype). If this assumption is incorrect, additional modifications may be necessary.

Recommendation

Apply the proposed workaround to allocate and reuse the cuSOLVER GETRF workspace, as it should reduce allocator traffic and improve performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

lu_factor_looped_cusolver should allocate GETRF workspace once per lu_factor call and reuse it across all matrices in the batch.

One possible implementation direction:

expose a cuSOLVER GETRF buffer-size helper, or add a getrf wrapper overload that accepts caller-owned workspace;
query lwork once for the fixed (m, n, dtype) call;
allocate the workspace once in lu_factor_looped_cusolver;
pass the same workspace to each per-matrix cusolverDn<t>getrf call.

#api #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix CUDA linalg: looped cuSOLVER LU allocates GETRF workspace per batch item [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Description (problem / solution / changelog)

Summary

Validation

BC

Changed files

Code Example

🐛 Describe the bug

Repro

Expected behavior

Additional context

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix CUDA linalg: looped cuSOLVER LU allocates GETRF workspace per batch item [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Description (problem / solution / changelog)

Summary

Validation

BC

Changed files

Code Example

🐛 Describe the bug

Repro

Expected behavior

Additional context

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING