pytorch - ✅(Solved) Fix CUDA linalg: use auxiliary streams for independent looped cuSOLVER batch calls [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#182001Fetched 2026-05-01 05:32:57
View on GitHub
Comments
0
Participants
1
Timeline
31
Reactions
0
Participants
Timeline (top)
mentioned ×12subscribed ×12labeled ×6cross-referenced ×1

Error Message

  • Error reporting through info tensors and check_errors behavior must remain unchanged.

Root Cause

Some of these paths also allocate workspace inside the batch loop. Workspace lifetime and reuse matter for any auxiliary-stream version because concurrent calls cannot share the same workspace unless their lifetimes do not overlap.

Fix Action

Fix / Workaround

  • lu_factor_looped_cusolver / getrf
  • lu_solve_looped_cusolver / getrs
  • apply_cholesky_cusolver_potrf_looped / potrf and xpotrf
  • QR paths: geqrf, orgqr, ormqr
  • SVD fallback paths: gesvd, gesvdj where implemented as per-batch loops
  • Hermitian/general eigensolver paths: syevd, xsyevd, xgeev
  • LDL paths that are currently dispatched as looped cuSOLVER

PR fix notes

PR #181998: CUDA linalg: hoist cuSOLVER GETRF workspace allocation out of lu_factor_looped_cusolver batch loop

Description (problem / solution / changelog)

Summary

lu_factor_looped_cusolver used to call at::cuda::solver::getrf once per batch item, and the getrf wrapper queried cuSOLVER workspace size and allocated workspace internally on every call. For a batch of size B, that produced B workspace queries and B allocator round-trips.

This PR makes the workspace explicit for the internal GETRF wrapper:

  • Adds at::cuda::solver::getrf_bufferSize<T> as a thin wrapper over cusolverDn*getrf_bufferSize.
  • Updates at::cuda::solver::getrf<T> to take a caller-provided workspace pointer.
  • Updates lu_factor_looped_cusolver to query lwork once, allocate one workspace buffer, and reuse it across the per-matrix GETRF loop.

The workspace size is fixed for a given (dtype, m, n, lda) tuple, so reusing it across sequential GETRF calls on the same stream is safe.

Validation

Local environment:

  • PyTorch source build with USE_MAGMA=0
  • CUDA 13.2
  • NVIDIA RTX 6000 Ada Generation

Commands/checks run locally:

  • USE_MAGMA=0 MAX_JOBS=8 bash ~/dev/scripts-pytorch/run_build.sh
  • CUDA smoke test for torch.linalg.lu_factor_ex and torch.linalg.solve_ex
  • git diff --check

Allocator validation with PyTorch CUDA caching disabled:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 LD_PRELOAD=libcuda_alloc_counter.so, torch.backends.cuda.preferred_linalg_library("cusolver"), float32 torch.linalg.lu_factor_ex.

shapelocal pre-patch baselineafter this PR
[16, 512, 512]20 cudaMalloc / 20 cudaFree5 cudaMalloc / 5 cudaFree
[64, 512, 512]not rerun5 cudaMalloc / 5 cudaFree

This confirms the looped cuSOLVER GETRF path is now O(1) in allocator calls with respect to batch size.

BC

No BC break. The changed APIs are internal at::cuda::solver helpers, not public Python or C++ APIs.

Fixes #181997.

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

Changed files

  • aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp (modified, +9/-0)
  • aten/src/ATen/native/cuda/linalg/CUDASolver.cpp (modified, +36/-26)
  • aten/src/ATen/native/cuda/linalg/CUDASolver.h (modified, +19/-1)
RAW_BUFFERClick to expand / collapse

Motivation

Several CUDA linalg paths use cuSOLVER by looping over independent batch items on the current CUDA stream. This is correct, but it can underutilize the GPU for medium-sized matrices when there is no suitable batched cuSOLVER API, or when PyTorch's heuristic selects a looped cuSOLVER path.

The recent LU investigation showed that these independent calls can overlap well when issued onto multiple nonblocking CUDA streams. This issue is not just for getrf; the desired outcome is to investigate and improve looped cuSOLVER batch execution across CUDA linalg where auxiliary-stream execution makes sense.

Local evidence

Standalone C++ benchmark on:

  • GPU: NVIDIA RTX 6000 Ada Generation
  • CUDA runtime: 13.2
  • cuSOLVER reported version: 12200
  • dtype: float32
  • operation: independent cusolverDnSgetrf calls over a batch
Mbatchstreamssequential cuSOLVERone-handle multistreammulti-handle multistream
51264849.37 ms20.99 ms, 2.35x20.98 ms, 2.35x
76864872.32 ms18.98 ms, 3.81x19.14 ms, 3.78x
1024648105.04 ms38.21 ms, 2.75x38.65 ms, 2.72x

In this standalone probe, changing the stream on one cuSOLVER handle before each call performed about the same as using separate handles. That still needs validation against cuSOLVER's intended handle/stream usage and PyTorch's handle pool assumptions.

This does not replace the need for a batch-aware cuBLAS/cuSOLVER heuristic. For example, in the same investigation cublasSgetrfBatched for M=512, batch=64 was still around 6.3 ms, much faster than the multistream looped cuSOLVER result. The multistream approach looks more relevant once PyTorch has already selected a looped cuSOLVER path, especially for larger matrices or operations without a fast batched alternative.

Scope

Examples in aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp include:

  • lu_factor_looped_cusolver / getrf
  • lu_solve_looped_cusolver / getrs
  • apply_cholesky_cusolver_potrf_looped / potrf and xpotrf
  • QR paths: geqrf, orgqr, ormqr
  • SVD fallback paths: gesvd, gesvdj where implemented as per-batch loops
  • Hermitian/general eigensolver paths: syevd, xsyevd, xgeev
  • LDL paths that are currently dispatched as looped cuSOLVER

Some of these paths also allocate workspace inside the batch loop. Workspace lifetime and reuse matter for any auxiliary-stream version because concurrent calls cannot share the same workspace unless their lifetimes do not overlap.

Constraints

Any implementation should preserve normal PyTorch stream semantics and existing behavior:

  • Auxiliary streams should wait on the current stream before reading inputs or outputs.
  • The current stream should wait on auxiliary streams before the op returns to the user.
  • Inputs, outputs, pivots, infos, and workspaces used on auxiliary streams need correct caching allocator stream recording so storage is not reused while still in use.
  • Operations that already have efficient batched cuSOLVER/cuBLAS paths should keep using those when faster.
  • Error reporting through info tensors and check_errors behavior must remain unchanged.

Related:

  • #181997
  • #181998
  • #181999

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

Implementing multistream execution for looped cuSOLVER batch operations can significantly improve performance on medium-sized matrices by utilizing the GPU more efficiently.

Guidance

  • Investigate using auxiliary streams for independent batch items to overlap cuSOLVER calls, as shown in the standalone C++ benchmark.
  • Validate the approach against cuSOLVER's intended handle/stream usage and PyTorch's handle pool assumptions to ensure correctness.
  • Identify and prioritize CUDA linalg paths that can benefit from multistream execution, such as lu_factor_looped_cusolver and lu_solve_looped_cusolver.
  • Ensure that any implementation preserves normal PyTorch stream semantics, including correct caching allocator stream recording and error reporting.

Example

No explicit code example is provided, but the aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp file can be referenced for examples of looped cuSOLVER paths.

Notes

The multistream approach may not replace the need for a batch-aware cuBLAS/cuSOLVER heuristic, and its effectiveness may vary depending on the specific operation and matrix size.

Recommendation

Apply a workaround by implementing multistream execution for looped cuSOLVER batch operations, as it has shown significant performance improvements in the standalone benchmark.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix CUDA linalg: use auxiliary streams for independent looped cuSOLVER batch calls [1 pull requests, 1 participants]