pytorch - 💡(How to fix) Fix Torch isntallation error in GH200. [9 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180713Fetched 2026-04-18 05:51:24
View on GitHub
Comments
9
Participants
3
Timeline
105
Reactions
0
Timeline (top)
mentioned ×43subscribed ×43commented ×9labeled ×9

Error Message

import torch

print("torch:", torch.version) print("cuda build:", torch.version.cuda) print("gpu:", torch.cuda.get_device_name(0))

for dtype in [torch.float16, torch.bfloat16, torch.float32]: try: a = torch.randn(128, 64, device="cuda", dtype=dtype) b = torch.randn(64, 32, device="cuda", dtype=dtype) c = a @ b torch.cuda.synchronize() print(dtype, "OK", c.shape) except Exception as e: print(dtype, "FAILED", repr(e))

Root Cause

The reason is

Code Example

import torch

print("torch:", torch.__version__)
print("cuda build:", torch.version.cuda)
print("gpu:", torch.cuda.get_device_name(0))

for dtype in [torch.float16, torch.bfloat16, torch.float32]:
     try:
         a = torch.randn(128, 64, device="cuda", dtype=dtype)
         b = torch.randn(64, 32, device="cuda", dtype=dtype)
         c = a @ b
         torch.cuda.synchronize()
         print(dtype, "OK", c.shape)
     except Exception as e:
         print(dtype, "FAILED", repr(e))
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

The thing happens when I have torch=2.10.0+cuda128.

Then check which CUDA/cuBLAS packages are actually installed:

python -m pip list | egrep 'torch|nvidia-cublas|nvidia-cuda|triton|flash-attn'

Most likely you will find some variant of:

torch 2.10.0+cu128 nvidia-cublas-cu12 12.8.4.1

If so, your best recovery options are:

The veersion for nvidia-cublas-cu12 is 12.8.4.1, however, it does not work on GH200. It should be:

pip install -U nvidia-cublas-cu12==12.9.1.4

The reason is

A raw CUDA matmul fails for float16, bfloat16, and float32, so the problem is below Transformers and below the model. Your current environment is broken at the PyTorch/cuBLAS/runtime layer on this GH200 node, not in your training code. The earlier model failures were just the first place that bad GEMM surfaced.

There is a recent PyTorch issue describing CUBLAS_STATUS_INVALID_VALUE on PyTorch 2.10 where updating nvidia-cublas-cu12 from 12.8.4.1 to 12.9.1.4 resolved the error, and a related vLLM issue points to the same fix path. https://github.com/pytorch/pytorch/issues/174949?utm_source=chatgpt.com

Now if we install it with 12.9.1.4, a warning will be raised. Therefore, the most sutiabel cublas for this version of torch should be 12.9.1.4, not 12.8.4.1.

The minial reproduction is:

import torch

print("torch:", torch.__version__)
print("cuda build:", torch.version.cuda)
print("gpu:", torch.cuda.get_device_name(0))

for dtype in [torch.float16, torch.bfloat16, torch.float32]:
     try:
         a = torch.randn(128, 64, device="cuda", dtype=dtype)
         b = torch.randn(64, 32, device="cuda", dtype=dtype)
         c = a @ b
         torch.cuda.synchronize()
         print(dtype, "OK", c.shape)
     except Exception as e:
         print(dtype, "FAILED", repr(e))

Versions (after correction)

torch 2.10.0+ cu128

packages in environment at /u/tliu10/.conda/envs/trlpy310:

Name Version Build Channel

_openmp_mutex 4.5 20_gnu conda-forge accelerate 1.13.0 pypi_0 pypi aiohappyeyeballs 2.6.1 pypi_0 pypi aiohttp 3.13.5 pypi_0 pypi aiosignal 1.4.0 pypi_0 pypi annotated-doc 0.0.4 pypi_0 pypi anyio 4.13.0 pypi_0 pypi async-timeout 5.0.1 pypi_0 pypi attrs 26.1.0 pypi_0 pypi bzip2 1.0.8 h4777abc_9 conda-forge ca-certificates 2026.2.25 hbd8a1cb_0 conda-forge certifi 2026.2.25 pypi_0 pypi charset-normalizer 3.4.7 pypi_0 pypi click 8.3.2 pypi_0 pypi cuda-bindings 12.9.4 pypi_0 pypi cuda-pathfinder 1.2.2 pypi_0 pypi datasets 4.8.4 pypi_0 pypi dill 0.4.1 pypi_0 pypi einops 0.8.2 pypi_0 pypi exceptiongroup 1.3.1 pypi_0 pypi filelock 3.25.2 pypi_0 pypi flash-attn 2.8.3 pypi_0 pypi frozenlist 1.8.0 pypi_0 pypi fsspec 2026.2.0 pypi_0 pypi h11 0.16.0 pypi_0 pypi hf-xet 1.4.3 pypi_0 pypi httpcore 1.0.9 pypi_0 pypi httpx 0.28.1 pypi_0 pypi huggingface-hub 1.10.2 pypi_0 pypi idna 3.11 pypi_0 pypi jinja2 3.1.6 pypi_0 pypi ld_impl_linux-aarch64 2.45.1 default_h1979696_102 conda-forge libexpat 2.7.5 hfae3067_0 conda-forge libffi 3.5.2 h376a255_0 conda-forge libgcc 15.2.0 h8acb6b2_18 conda-forge libgcc-ng 15.2.0 he9431aa_18 conda-forge libgomp 15.2.0 h8acb6b2_18 conda-forge liblzma 5.8.3 he30d5cf_0 conda-forge libnsl 2.0.1 h86ecc28_1 conda-forge libsqlite 3.53.0 h022381a_0 conda-forge libuuid 2.42 h1022ec0_0 conda-forge libxcrypt 4.4.36 h31becfc_1 conda-forge libzlib 1.3.2 hdc9db2a_2 conda-forge markdown-it-py 4.0.0 pypi_0 pypi markupsafe 3.0.3 pypi_0 pypi mdurl 0.1.2 pypi_0 pypi mpmath 1.3.0 pypi_0 pypi multidict 6.7.1 pypi_0 pypi multiprocess 0.70.19 pypi_0 pypi ncurses 6.5 ha32ae93_3 conda-forge networkx 3.4.2 pypi_0 pypi numpy 2.2.6 pypi_0 pypi nvidia-cublas-cu12 12.9.1.4 pypi_0 pypi nvidia-cuda-cupti-cu12 12.8.90 pypi_0 pypi nvidia-cuda-nvrtc-cu12 12.8.93 pypi_0 pypi nvidia-cuda-runtime-cu12 12.8.90 pypi_0 pypi nvidia-cudnn-cu12 9.10.2.21 pypi_0 pypi nvidia-cufft-cu12 11.3.3.83 pypi_0 pypi nvidia-cufile-cu12 1.13.1.3 pypi_0 pypi nvidia-curand-cu12 10.3.9.90 pypi_0 pypi nvidia-cusolver-cu12 11.7.3.90 pypi_0 pypi nvidia-cusparse-cu12 12.5.8.93 pypi_0 pypi nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi nvidia-nccl-cu12 2.27.5 pypi_0 pypi nvidia-nvjitlink-cu12 12.8.93 pypi_0 pypi nvidia-nvshmem-cu12 3.4.5 pypi_0 pypi nvidia-nvtx-cu12 12.8.90 pypi_0 pypi openssl 3.6.2 h546c87b_0 conda-forge packaging 26.1 pyhc364b38_0 conda-forge pandas 2.3.3 pypi_0 pypi pillow 12.1.1 pypi_0 pypi pip 26.0.1 pyh8b19718_0 conda-forge propcache 0.4.1 pypi_0 pypi psutil 7.2.2 pypi_0 pypi pyarrow 23.0.1 pypi_0 pypi pygments 2.20.0 pypi_0 pypi python 3.10.20 h28be5d3_0_cpython conda-forge python-dateutil 2.9.0.post0 pypi_0 pypi pytz 2026.1.post1 pypi_0 pypi pyyaml 6.0.3 pypi_0 pypi readline 8.3 hb682ff5_0 conda-forge regex 2026.4.4 pypi_0 pypi requests 2.33.1 pypi_0 pypi rich 15.0.0 pypi_0 pypi safetensors 0.7.0 pypi_0 pypi setuptools 82.0.1 pyh332efcf_0 conda-forge shellingham 1.5.4 pypi_0 pypi six 1.17.0 pypi_0 pypi sympy 1.14.0 pypi_0 pypi tk 8.6.13 noxft_h0dc03b3_103 conda-forge tokenizers 0.22.2 pypi_0 pypi torch 2.10.0+cu128 pypi_0 pypi torchaudio 2.10.0+cu128 pypi_0 pypi torchvision 0.25.0+cu128 pypi_0 pypi tqdm 4.67.3 pypi_0 pypi transformers 5.5.4 pypi_0 pypi triton 3.6.0 pypi_0 pypi trl 1.1.0 pypi_0 pypi typer 0.24.1 pypi_0 pypi typing-extensions 4.15.0 pypi_0 pypi tzdata 2026.1 pypi_0 pypi urllib3 2.6.3 pypi_0 pypi wheel 0.46.3 pyhd8ed1ab_0 conda-forge xxhash 3.6.0 pypi_0 pypi yarl 1.23.0 pypi_0 pypi zstd 1.5.7 h85ac4a6_6 conda-forge

cc @seemethere @malfet @atalman @tinglvv @nWEIdia @ptrblck @msaroufim @eqy @jerryzh168 @csarofeen

extent analysis

TL;DR

Update nvidia-cublas-cu12 to version 12.9.1.4 to resolve the CUDA matmul issue.

Guidance

  • The issue is caused by a compatibility problem between PyTorch 2.10.0+cu128 and nvidia-cublas-cu12 version 12.8.4.1.
  • To fix the issue, update nvidia-cublas-cu12 to version 12.9.1.4 using pip install -U nvidia-cublas-cu12==12.9.1.4.
  • Verify the fix by running the provided minimal reproduction code to check if the CUDA matmul operations are successful.
  • Be aware that updating nvidia-cublas-cu12 may raise a warning, but it is the recommended version for PyTorch 2.10.0+cu128.

Example

import torch

print("torch:", torch.__version__)
print("cuda build:", torch.version.cuda)
print("gpu:", torch.cuda.get_device_name(0))

for dtype in [torch.float16, torch.bfloat16, torch.float32]:
     try:
         a = torch.randn(128, 64, device="cuda", dtype=dtype)
         b = torch.randn(64, 32, device="cuda", dtype=dtype)
         c = a @ b
         torch.cuda.synchronize()
         print(dtype, "OK", c.shape)
     except Exception as e:
         print(dtype, "FAILED", repr(e))

Notes

  • This fix is specific to the combination of PyTorch 2.10.0+cu128 and nvidia-cublas-cu12 version 12.8.4.1.
  • Other versions of PyTorch or nvidia-cublas-cu12 may not be affected by this issue.

Recommendation

Apply the workaround by updating nvidia-cublas-cu12 to version 12.9.1.4 to resolve the CUDA matmul issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix Torch isntallation error in GH200. [9 comments, 3 participants]