pytorch - ✅(Solved) Fix CUDA linalg: use auxiliary streams for independent looped cuSOLVER batch calls [1 pull requests, 1 participants]

pytorch2026-04-30 08:56:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#182001•Fetched 2026-05-01 05:32:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

IvanYashchuk

Participants

IvanYashchuk

Timeline (top)

mentioned ×12subscribed ×12labeled ×6cross-referenced ×1

Error Message

Error reporting through info tensors and check_errors behavior must remain unchanged.

Root Cause

Some of these paths also allocate workspace inside the batch loop. Workspace lifetime and reuse matter for any auxiliary-stream version because concurrent calls cannot share the same workspace unless their lifetimes do not overlap.

Fix Action

Fix / Workaround

lu_factor_looped_cusolver / getrf
lu_solve_looped_cusolver / getrs
apply_cholesky_cusolver_potrf_looped / potrf and xpotrf
QR paths: geqrf, orgqr, ormqr
SVD fallback paths: gesvd, gesvdj where implemented as per-batch loops
Hermitian/general eigensolver paths: syevd, xsyevd, xgeev
LDL paths that are currently dispatched as looped cuSOLVER

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Repository: pytorch/pytorch
Author: Copilot
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/181998

Description (problem / solution / changelog)

Summary

lu_factor_looped_cusolver used to call at::cuda::solver::getrf once per batch item, and the getrf wrapper queried cuSOLVER workspace size and allocated workspace internally on every call. For a batch of size B, that produced B workspace queries and B allocator round-trips.

This PR makes the workspace explicit for the internal GETRF wrapper:

Adds at::cuda::solver::getrf_bufferSize<T> as a thin wrapper over cusolverDn*getrf_bufferSize.
Updates at::cuda::solver::getrf<T> to take a caller-provided workspace pointer.
Updates lu_factor_looped_cusolver to query lwork once, allocate one workspace buffer, and reuse it across the per-matrix GETRF loop.

The workspace size is fixed for a given (dtype, m, n, lda) tuple, so reusing it across sequential GETRF calls on the same stream is safe.

Validation

Local environment:

PyTorch source build with USE_MAGMA=0
CUDA 13.2
NVIDIA RTX 6000 Ada Generation

Commands/checks run locally:

USE_MAGMA=0 MAX_JOBS=8 bash ~/dev/scripts-pytorch/run_build.sh
CUDA smoke test for torch.linalg.lu_factor_ex and torch.linalg.solve_ex
git diff --check

Allocator validation with PyTorch CUDA caching disabled:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 LD_PRELOAD=libcuda_alloc_counter.so, torch.backends.cuda.preferred_linalg_library("cusolver"), float32 torch.linalg.lu_factor_ex.

shape	local pre-patch baseline	after this PR
`[16, 512, 512]`	20 `cudaMalloc` / 20 `cudaFree`	5 `cudaMalloc` / 5 `cudaFree`
`[64, 512, 512]`	not rerun	5 `cudaMalloc` / 5 `cudaFree`

This confirms the looped cuSOLVER GETRF path is now O(1) in allocator calls with respect to batch size.

BC

No BC break. The changed APIs are internal at::cuda::solver helpers, not public Python or C++ APIs.

Fixes #181997.

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

Changed files

aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp (modified, +9/-0)
aten/src/ATen/native/cuda/linalg/CUDASolver.cpp (modified, +36/-26)
aten/src/ATen/native/cuda/linalg/CUDASolver.h (modified, +19/-1)

RAW_BUFFERClick to expand / collapse

Motivation

Several CUDA linalg paths use cuSOLVER by looping over independent batch items on the current CUDA stream. This is correct, but it can underutilize the GPU for medium-sized matrices when there is no suitable batched cuSOLVER API, or when PyTorch's heuristic selects a looped cuSOLVER path.

The recent LU investigation showed that these independent calls can overlap well when issued onto multiple nonblocking CUDA streams. This issue is not just for getrf; the desired outcome is to investigate and improve looped cuSOLVER batch execution across CUDA linalg where auxiliary-stream execution makes sense.

Local evidence

Standalone C++ benchmark on:

GPU: NVIDIA RTX 6000 Ada Generation
CUDA runtime: 13.2
cuSOLVER reported version: 12200
dtype: float32
operation: independent cusolverDnSgetrf calls over a batch

M	batch	streams	sequential cuSOLVER	one-handle multistream	multi-handle multistream
512	64	8	49.37 ms	20.99 ms, 2.35x	20.98 ms, 2.35x
768	64	8	72.32 ms	18.98 ms, 3.81x	19.14 ms, 3.78x
1024	64	8	105.04 ms	38.21 ms, 2.75x	38.65 ms, 2.72x

In this standalone probe, changing the stream on one cuSOLVER handle before each call performed about the same as using separate handles. That still needs validation against cuSOLVER's intended handle/stream usage and PyTorch's handle pool assumptions.

This does not replace the need for a batch-aware cuBLAS/cuSOLVER heuristic. For example, in the same investigation cublasSgetrfBatched for M=512, batch=64 was still around 6.3 ms, much faster than the multistream looped cuSOLVER result. The multistream approach looks more relevant once PyTorch has already selected a looped cuSOLVER path, especially for larger matrices or operations without a fast batched alternative.

Scope

Examples in aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp include:

lu_factor_looped_cusolver / getrf
lu_solve_looped_cusolver / getrs
apply_cholesky_cusolver_potrf_looped / potrf and xpotrf
QR paths: geqrf, orgqr, ormqr
SVD fallback paths: gesvd, gesvdj where implemented as per-batch loops
Hermitian/general eigensolver paths: syevd, xsyevd, xgeev
LDL paths that are currently dispatched as looped cuSOLVER

Constraints

Any implementation should preserve normal PyTorch stream semantics and existing behavior:

Auxiliary streams should wait on the current stream before reading inputs or outputs.
The current stream should wait on auxiliary streams before the op returns to the user.
Inputs, outputs, pivots, infos, and workspaces used on auxiliary streams need correct caching allocator stream recording so storage is not reused while still in use.
Operations that already have efficient batched cuSOLVER/cuBLAS paths should keep using those when faster.
Error reporting through info tensors and check_errors behavior must remain unchanged.

#181997
#181998
#181999

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

Implementing multistream execution for looped cuSOLVER batch operations can significantly improve performance on medium-sized matrices by utilizing the GPU more efficiently.

Guidance

Investigate using auxiliary streams for independent batch items to overlap cuSOLVER calls, as shown in the standalone C++ benchmark.
Validate the approach against cuSOLVER's intended handle/stream usage and PyTorch's handle pool assumptions to ensure correctness.
Identify and prioritize CUDA linalg paths that can benefit from multistream execution, such as lu_factor_looped_cusolver and lu_solve_looped_cusolver.
Ensure that any implementation preserves normal PyTorch stream semantics, including correct caching allocator stream recording and error reporting.

Example

No explicit code example is provided, but the aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp file can be referenced for examples of looped cuSOLVER paths.

Notes

The multistream approach may not replace the need for a batch-aware cuBLAS/cuSOLVER heuristic, and its effectiveness may vary depending on the specific operation and matrix size.

Recommendation

Apply a workaround by implementing multistream execution for looped cuSOLVER batch operations, as it has shown significant performance improvements in the standalone benchmark.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix CUDA linalg: use auxiliary streams for independent looped cuSOLVER batch calls [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Description (problem / solution / changelog)

Summary

Validation

BC

Changed files

Motivation

Local evidence

Scope

Constraints

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix CUDA linalg: use auxiliary streams for independent looped cuSOLVER batch calls [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Description (problem / solution / changelog)

Summary

Validation

BC

Changed files

Motivation

Local evidence

Scope

Constraints

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING