pytorch - ✅(Solved) Fix [CUDA 13.2] Smoke test test_cuda_gds_errors_captured fails most likely due to cuFile compatibility mode change [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180576Fetched 2026-04-17 08:26:19
View on GitHub
Comments
0
Participants
1
Timeline
35
Reactions
0
Author
Participants
Timeline (top)
mentioned ×14subscribed ×14labeled ×5cross-referenced ×1

The nightly manywheel smoke test for CUDA 13.2 (manywheel-py3_10-cuda13_2) is failing in test_cuda_gds_errors_captured().

CI failure: https://github.com/pytorch/test-infra/actions/runs/24004820054/job/70006692110

Error Message

this error. As a result, no exception is raised, and the smoke test fails with:

Root Cause

The smoke test at https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/smoke_test/smoke_test.py#L190 expects cuFileHandleRegister to raise a RuntimeError("cuFileHandleRegister failed") when the GDS kernel driver (nvidia-fs) is not installed in the CI environment. It then validates that PyTorch correctly surfaces this error.

This assumption held for cuFile 1.11 (bundled with CUDA 12.6) but no longer holds for cuFile 1.17 (bundled with CUDA 13.2). NVIDIA expanded cuFile's compatibility mode so that cuFileHandleRegister now succeeds when nvidia-fs is absent — subsequent I/O transparently falls back to POSIX. The relevant GDS release notes entries:

  • GDS 1.16: "Removed batch io size limitation... would now perform a posix I/O"
  • GDS 1.17: "Updated the compatibility path to support ZFS and BTRFS across all I/O APIs"

As a result, no exception is raised, and the smoke test fails with: RuntimeError: Expected cuFileHandleRegister failed RuntimeError but have not received!

Note: the cuFile library (libcufile.so) is closed-source so we cannot pinpoint the exact version where this changed. The kernel driver is OSS at https://github.com/NVIDIA/gds-nvidia-fs but the compatibility mode logic lives in the userspace library.

  ┌──────────────┬────────────────┐
  │ CUDA Toolkit │ cuFile Version │
  ├──────────────┼────────────────┤
  │ 12.6         │ 1.11.0.17      │
  ├──────────────┼────────────────┤
  │ 13.0         │ 1.15.1.6       │
  ├──────────────┼────────────────┤
  │ 13.2         │ 1.17.1.22      │
  └──────────────┴────────────────┘

Fix Action

Fixed

PR fix notes

PR #180577: Fix GDS smoke test failure on CUDA 13.2

Description (problem / solution / changelog)

Summary

Fixes: https://github.com/pytorch/pytorch/issues/180576 the nightly manywheel smoke test failure for CUDA 13.2 builds in test_cuda_gds_errors_captured().

CI failure: https://github.com/pytorch/test-infra/actions/runs/24004820054/job/70006692110

CUDA 13.2 ships cuFile 1.17, which expanded compatibility mode so that cuFileHandleRegister succeeds
when the nvidia-fs kernel driver is absent — subsequent I/O silently falls back to POSIX. Previously (cuFile 1.11 in CUDA 12.6), registration would fail with "cuFileHandleRegister failed".

The smoke test assumed registration always fails without the GDS driver and would error with "Expected cuFileHandleRegister failed RuntimeError but have not received!".

┌──────────────┬────────────────┬─────────────────────────────────────┐                                
│ CUDA Toolkit │ cuFile Version │ cuFileHandleRegister without driver │
├──────────────┼────────────────┼─────────────────────────────────────┤                                
│ 12.6         │ 1.11.0.17      │ Fails (error)                       │
├──────────────┼────────────────┼─────────────────────────────────────┤
│ 13.0         │ 1.15.1.6       │ Unknown                             │
├──────────────┼────────────────┼─────────────────────────────────────┤                                
│ 13.2         │ 1.17.1.22      │ Succeeds (compat mode)              │
└──────────────┴────────────────┴─────────────────────────────────────┘

The fix accepts both outcomes: exception with "cuFileHandleRegister failed" (older cuFile) and
successful registration via compatibility mode (cuFile >= 1.17).

Note: the cuFile userspace library (libcufile.so) is closed-source so we cannot pinpoint the exact
version where this changed. The kernel driver is OSS at https://github.com/NVIDIA/gds-nvidia-fs but the compat mode logic lives in the userspace library.

Test plan

  • Verify CUDA 13.2 nightly manywheel smoke test passes
  • Verify CUDA 12.6 smoke test still passes (exception path unchanged)

Changed files

  • .ci/pytorch/smoke_test/smoke_test.py (modified, +6/-1)

Code Example

┌──────────────┬────────────────┐
CUDA Toolkit │ cuFile Version  ├──────────────┼────────────────┤
12.61.11.0.17  ├──────────────┼────────────────┤
13.01.15.1.6  ├──────────────┼────────────────┤
13.21.17.1.22  └──────────────┴────────────────┘
RAW_BUFFERClick to expand / collapse

Summary

The nightly manywheel smoke test for CUDA 13.2 (manywheel-py3_10-cuda13_2) is failing in test_cuda_gds_errors_captured().

CI failure: https://github.com/pytorch/test-infra/actions/runs/24004820054/job/70006692110

Root Cause

The smoke test at https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/smoke_test/smoke_test.py#L190 expects cuFileHandleRegister to raise a RuntimeError("cuFileHandleRegister failed") when the GDS kernel driver (nvidia-fs) is not installed in the CI environment. It then validates that PyTorch correctly surfaces this error.

This assumption held for cuFile 1.11 (bundled with CUDA 12.6) but no longer holds for cuFile 1.17 (bundled with CUDA 13.2). NVIDIA expanded cuFile's compatibility mode so that cuFileHandleRegister now succeeds when nvidia-fs is absent — subsequent I/O transparently falls back to POSIX. The relevant GDS release notes entries:

  • GDS 1.16: "Removed batch io size limitation... would now perform a posix I/O"
  • GDS 1.17: "Updated the compatibility path to support ZFS and BTRFS across all I/O APIs"

As a result, no exception is raised, and the smoke test fails with: RuntimeError: Expected cuFileHandleRegister failed RuntimeError but have not received!

Note: the cuFile library (libcufile.so) is closed-source so we cannot pinpoint the exact version where this changed. The kernel driver is OSS at https://github.com/NVIDIA/gds-nvidia-fs but the compatibility mode logic lives in the userspace library.

  ┌──────────────┬────────────────┐
  │ CUDA Toolkit │ cuFile Version │
  ├──────────────┼────────────────┤
  │ 12.6         │ 1.11.0.17      │
  ├──────────────┼────────────────┤
  │ 13.0         │ 1.15.1.6       │
  ├──────────────┼────────────────┤
  │ 13.2         │ 1.17.1.22      │
  └──────────────┴────────────────┘

Proposed fix

https://github.com/pytorch/pytorch/pull/180577

Version

2.12.0

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @mruberry

extent analysis

TL;DR

The most likely fix is to update the smoke test to account for the changed behavior of cuFileHandleRegister in cuFile 1.17, as proposed in the pull request https://github.com/pytorch/pytorch/pull/180577.

Guidance

  • Review the proposed fix in the pull request to understand the changes needed to update the smoke test.
  • Verify that the updated smoke test correctly handles the case where cuFileHandleRegister succeeds when the GDS kernel driver is not installed.
  • Test the updated smoke test with different versions of CUDA and cuFile to ensure compatibility.
  • Consider adding additional test cases to cover other scenarios where the compatibility mode of cuFile may affect the behavior of PyTorch.

Example

No code snippet is provided as the issue does not contain sufficient information to create a minimal example.

Notes

The fix may not be applicable to all versions of CUDA and cuFile, and additional testing may be necessary to ensure compatibility.

Recommendation

Apply the workaround proposed in the pull request https://github.com/pytorch/pytorch/pull/180577, as it updates the smoke test to account for the changed behavior of cuFileHandleRegister in cuFile 1.17.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [CUDA 13.2] Smoke test test_cuda_gds_errors_captured fails most likely due to cuFile compatibility mode change [1 pull requests, 1 participants]