pytorch - ✅(Solved) Fix Performance improvement: updated backend selection for linalg.eigh on CUDA [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178979Fetched 2026-04-08 02:22:04
View on GitHub
Comments
1
Participants
2
Timeline
80
Reactions
0
Timeline (top)
mentioned ×36subscribed ×36labeled ×7commented ×1

PR fix notes

PR #175403: Update eigh CUDA heuristics

Description (problem / solution / changelog)

Motivation

As described by @nikitaved in #174674 : torch.linalg.eigh is around 100x slower than CuPy for batched inputs. This was also described by @alexshtf in #174601. Therefore the backend selection heuristics developed in #53040 seem to be suboptimal with recent updates to cuSOLVER.

Solution

Update heuristics to select the fastest available backend for the input matrix (batched and single matrix).

The code I used to switch the backend for eigh can be seen in #174674. Fortunately the results are very clear:

<img width="1896" height="455" alt="image" src="https://github.com/user-attachments/assets/bf0f7f21-c189-415f-b22f-85daf58367de" />

linalg_eigh_cusolver_syevj_batched seems to be the fastest for nearly all matrices. I took a closer look at the cases where it is outperformed by linalg_eigh_cusolver_syevd and it seems this is only by 0.05ms tops.

A more detailed view for the parameters used in #174674

<img width="571" height="455" alt="image" src="https://github.com/user-attachments/assets/e728db3d-3f16-4142-96ef-a49fc43348f6" />

Therefore I propose the solution of just dispatching to linalg_eigh_cusolver_syevj_batched unconditionally.

With this change the code from #174674 is over 100x faster than current nightly (outperforming CuPy by ~8x, exact numbers in the issue.)

After this change, syevj is no longer selected by any code path. Therefore I removed it from CUDASolver.cpp/h.

Tested using test/test_linalg.py. Observing failure on TestLinalgCUDA.test_tensorinv_cuda_float32. Failure is also present on current nightly (2.12.0.dev20260219+cu128), so I guess it is unrelated.

Fixes https://github.com/pytorch/pytorch/issues/175585

CC: @nikitaved @lezcano

cc @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

Changed files

  • aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp (modified, +3/-74)
  • aten/src/ATen/native/cuda/linalg/CUDASolver.cpp (modified, +0/-164)
  • aten/src/ATen/native/cuda/linalg/CUDASolver.h (modified, +0/-45)
RAW_BUFFERClick to expand / collapse

New Feature for Release

This issue tracks the updated selection of cuSOLVER APIs for solving symmetric/hermetian eigenvalue problems in the PyTorch 2.12 release notes.

Point(s) of contact

johannesz-codes, also available on pytorch-slack

Release Mode (pytorch/pytorch features only)

In-tree

Out-Of-Tree Repo

No response

Description and value to the user

The backend selecktion for solving symmetric/hermetian eigenvalue problems on CUDA devices has been updated. For batched inputs this leads to substantial performance gains (up to 100x) over the existing backend selection. Solves lacking performance in comparison to CuPy.

Link to design doc, GitHub issues, past submissions, etc

The performance regression in comparison to cuPy was brought up in:

Changes have landet in 175403 (already merged)

What feedback adopters have provided

No response

Plan for documentations / tutorials

Tutorial exists

Additional context for tutorials

No change to user facing behaviour, covered by existing materials

Marketing/Blog Coverage

Yes

Are you requesting other marketing assistance with this feature?

No response

Release Version

PyTorch 2.12

OS / Platform / Compute Coverage

GPU only, CUDA only

Testing Support (CI, test cases, etc..)

Covered by existing tests in https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py. Extended testing regarding performance has been conducted. See 175403 and 174674

cc @jerryzh168 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia @jianyuh @nikitaved @mruberry @walterddr @xwang233 @Lezcano

extent analysis

TL;DR

To leverage the updated cuSOLVER APIs for solving symmetric/hermetian eigenvalue problems, ensure you are using PyTorch 2.12 or later.

Guidance

  • Verify that your PyTorch version is 2.12 or newer to take advantage of the performance improvements for symmetric/hermetian eigenvalue problems on CUDA devices.
  • Review the changes and testing conducted in pull request 175403 for more details on the updates and their impact.
  • If experiencing performance issues related to eigenvalue problems, check if they are resolved by updating to PyTorch 2.12, considering the fixes and improvements made in relation to issues 174674 and 174601.
  • Utilize the existing tests in test_linalg.py as a reference for ensuring compatibility and performance.

Notes

The improvements are specifically for batched inputs on CUDA devices, offering substantial performance gains over previous backend selections.

Recommendation

Apply the workaround by upgrading to PyTorch 2.12 or later to leverage the updated cuSOLVER APIs for improved performance in solving symmetric/hermetian eigenvalue problems.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING