pytorch - 💡(How to fix) Fix Silent CUDA hang (no exception raised) under high VRAM pressure on Blackwell RTX 5090 (SM100) — async error never propagated [7 comments, 3 participants]

pytorch2026-03-26 08:28:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178491•Fetched 2026-04-08 01:30:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

azizketata

Participants

azizketata

eqy

johannesz-codes

Timeline (top)

mentioned ×12subscribed ×12labeled ×9commented ×7

Error Message

When training a YOLO model on an RTX 5090 (Blackwell, SM100) with PyTorch 2.9.0+cu128, CUDA operations silently fail under high VRAM utilization (~93%, 29.9/32GB), causing the process to hang indefinitely. No Python exception is raised — the process spins in userspace (State: R, wchan: 0) with 0% GPU compute utilization while holding all allocated VRAM. The GPU stops executing kernels but never returns an error to the host. 21:11:07 [nvidia-smi NVML query error] — GPU enters error state When CUDA_LAUNCH_BLOCKING=1 is set to force synchronous error checking, the underlying CUDA error surfaces immediately as a proper exception: torch.AcceleratorError: CUDA error: the launch timed out and was terminated However, CUDA_LAUNCH_BLOCKING=1 is not a viable workaround — synchronous execution monopolizes the GPU, triggering the X11 display driver's watchdog timeout (cudaErrorLaunchTimeout) since the desktop runs on the same GPU. The actual error with async execution is a different, silent failure mode.

No exception, no OOM error — just infinite spin.

A torch.cuda.OutOfMemoryError or torch.AcceleratorError exception when the CUDA operation fails, allowing the training framework to handle the error (retry with smaller batch, log the failure, etc.). nvidia-smi briefly returns an NVML error at the exact moment of the hang (one poll cycle), then resumes returning 0% utilization — suggesting the GPU enters and exits an error recovery state without notifying the CUDA runtime on the host side.

Root Cause

Minimal reproduction context This cannot be easily reduced to a self-contained snippet because it depends on sustained high VRAM pressure (~93%) on a Blackwell GPU. The failure occurs stochastically after ~800 batches of a training loop, suggesting memory fragmentation or a specific allocation pattern triggers it.

Fix Action

Fix / Workaround

However, CUDA_LAUNCH_BLOCKING=1 is not a viable workaround — synchronous execution monopolizes the GPU, triggering the X11 display driver's watchdog timeout (cudaErrorLaunchTimeout) since the desktop runs on the same GPU. The actual error with async execution is a different, silent failure mode.

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

This was reproduced 7 times across multiple training runs with identical behavior. A custom watchdog monitoring GPU metrics captured the exact transition:

Watchdog metrics: timestamp, gpu_util%, mem_util%, mem_used_MiB, mem_free_MiB, temp, ...log_size, stall_sec

21:10:21 88% 52% 29989 MiB 2121 MiB 69°C — last normal batch 21:11:07 [nvidia-smi NVML query error] — GPU enters error state
21:11:37 0% 0% 29989 MiB 2121 MiB 54°C — hung: 0% compute, VRAM held 21:12:07 0% 0% 29989 MiB 2121 MiB 51°C — still hung ... (continues for 5+ minutes until externally killed) 21:16:07 0% 0% 29989 MiB 2121 MiB 42°C — temp dropping, GPU idle

Key observations from /proc/<pid>/status during the hang: State: R (running) ← CPU is spinning, not sleeping wchan: 0 ← not blocked on any kernel syscall Threads: 41 GPU: 0% compute, 29,989 MiB held

When CUDA_LAUNCH_BLOCKING=1 is set to force synchronous error checking, the underlying CUDA error surfaces immediately as a proper exception: File ".../ultralytics/nn/modules/conv.py", line 78, in forward return self.act(self.bn(self.conv(x))) File ".../torch/nn/modules/activation.py", line 473, in forward return F.silu(input, inplace=self.inplace) torch.AcceleratorError: CUDA error: the launch timed out and was terminated Search for `cudaErrorLaunchTimeout' in https://docs.nvidia.com/cuda/cuda-runtime-api/...

The bug: Without CUDA_LAUNCH_BLOCKING=1, async CUDA errors at high VRAM utilization on Blackwell GPUs are never propagated to the Python host. The process hangs forever instead of raising torch.AcceleratorError or torch.cuda.OutOfMemoryError.

import torch from ultralytics import YOLO

Model: YOLO26m-OBB (~23.5M params, OBB detection head)

Dataset: ~10,600 images with oriented bounding box annotations

At imgsz=1024, batch=8: steady-state VRAM = 28.5-29.9 GB / 32 GB (89-93%)

model = YOLO("yolo26m-obb.pt") model.train( data="dataset.yaml", epochs=100, batch=8, imgsz=1024, # <-- high VRAM pressure at this resolution workers=0, # single-threaded dataloader amp=False, # FP32 only mosaic=0.0, # no mosaic augmentation device="cuda:0", )

Training hangs silently around batch 600-810 of any epoch

when VRAM usage reaches ~29.9/32 GB.

No exception, no OOM error — just infinite spin.

Reducing imgsz to 640 (~10GB VRAM, 30% utilization) eliminates the issue entirely.

What I expected A torch.cuda.OutOfMemoryError or torch.AcceleratorError exception when the CUDA operation fails, allowing the training framework to handle the error (retry with smaller batch, log the failure, etc.).

What actually happens The process hangs forever. GPU compute drops to 0%, VRAM stays allocated, CPU spins in userspace. The only way to detect the hang is external monitoring (nvidia-smi shows 0% util). The only way to recover is killing the process.

Additional context:

nvidia-smi briefly returns an NVML error at the exact moment of the hang (one poll cycle), then resumes returning 0% utilization — suggesting the GPU enters and exits an error recovery state without notifying the CUDA runtime on the host side. The process was also registered with faulthandler.register(signal.SIGUSR1), but PyTorch overwrites this handler after import (confirmed via /proc/<pid>/status showing SigCgt: 0000000100000002 — SIGUSR1/signal 10 not caught). This is a separate minor issue: PyTorch's signal handler management silently overrides user-registered handlers. Reducing VRAM pressure to ~30% utilization (imgsz=640) completely eliminates the issue across 100+ epochs — confirming the bug is tied to high VRAM pressure, not a general kernel defect. GPU persistence mode was disabled (nvidia-smi -pm 0). X11 desktop running on the same GPU.

Versions

PyTorch version: 2.9.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A

OS: Pop!_OS 22.04 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 Python version: 3.12.6 (64-bit runtime) Python platform: Linux-6.16.3-76061603-generic-x86_64-with-glibc2.35 Is CUDA available: True GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090 Nvidia driver version: 580.82.09

CPU: AMD Ryzen 9 9950X 16-Core Processor RAM: 62 GB

[pip3] torch==2.9.0 [pip3] torchvision==0.24.0 [pip3] pytorch-triton==3.5.0+git27664085 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cublas-cu12==12.8.4.1

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @malfet

extent analysis

Fix Plan

To address the issue of silent CUDA operation failures under high VRAM utilization, we will implement the following steps:

Enable CUDA_LAUNCH_BLOCKING: Although setting CUDA_LAUNCH_BLOCKING=1 is not a viable workaround due to synchronous execution issues, we can explore alternative solutions that balance asynchronous execution with error checking.
Implement Custom Error Handling: Develop a custom error handling mechanism to detect and handle CUDA errors, especially those related to high VRAM utilization.
Optimize VRAM Utilization: Implement strategies to reduce VRAM pressure, such as:
- Batch Size Reduction: Dynamically adjust the batch size based on VRAM availability.
- Model Pruning or Quantization: Apply model pruning or quantization techniques to reduce memory requirements.
- Gradient Checkpointing: Utilize gradient checkpointing to store intermediate gradients, reducing VRAM usage.

Example Code

Here's an example of how you can implement custom error handling and batch size reduction:

import torch
from ultralytics import YOLO

# Define a custom error handling function
def handle_cuda_error(device):
    # Check for CUDA errors
    error = torch.cuda.get_device_capability(device)
    if error:
        # Handle the error (e.g., reduce batch size, log the error)
        print(f"CUDA error detected: {error}")
        return True
    return False

# Define a function to reduce batch size
def reduce_batch_size(model, batch_size):
    # Reduce the batch size and update the model
    model.batch = batch_size // 2
    return model

# Initialize the model and device
model = YOLO("yolo26m-obb.pt")
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Train the model with custom error handling
batch_size = 8
while True:
    try:
        model.train(
            data="dataset.yaml",
            epochs=100,
            batch=batch_size,
            imgsz=1024,
            workers=0,
            amp=False,
            mosaic=0.0,
            device=device,
        )
    except torch.cuda.OutOfMemoryError:
        # Reduce batch size and continue training
        batch_size //= 2
        model = reduce_batch_size(model, batch_size)
        print(f"Reduced batch size to {batch_size}")
    except Exception as e:
        # Handle other exceptions
        print(f"An error occurred: {e}")
        break

    # Check for CUDA errors
    if handle_cuda_error(device):
        # Reduce batch size and continue training
        batch_size //= 2
        model = reduce_batch_size(model, batch_size)
        print(f"Reduced batch size to {batch_size}")

Verification

To verify that the fix worked, monitor the training process and check for the following:

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #training loop #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Silent CUDA hang (no exception raised) under high VRAM pressure on Blackwell RTX 5090 (SM100) — async error never propagated [7 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

No exception, no OOM error — just infinite spin.

Root Cause

Fix Action

Fix / Workaround

🐛 Describe the bug

Watchdog metrics: timestamp, gpu_util%, mem_util%, mem_used_MiB, mem_free_MiB, temp, ...log_size, stall_sec

Model: YOLO26m-OBB (~23.5M params, OBB detection head)

Dataset: ~10,600 images with oriented bounding box annotations

At imgsz=1024, batch=8: steady-state VRAM = 28.5-29.9 GB / 32 GB (89-93%)

Training hangs silently around batch 600-810 of any epoch

when VRAM usage reaches ~29.9/32 GB.

No exception, no OOM error — just infinite spin.

Reducing imgsz to 640 (~10GB VRAM, 30% utilization) eliminates the issue entirely.

Versions

extent analysis

Fix Plan

Example Code

Verification

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Silent CUDA hang (no exception raised) under high VRAM pressure on Blackwell RTX 5090 (SM100) — async error never propagated [7 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

No exception, no OOM error — just infinite spin.

Root Cause

Fix Action

Fix / Workaround

🐛 Describe the bug

Watchdog metrics: timestamp, gpu_util%, mem_util%, mem_used_MiB, mem_free_MiB, temp, ...log_size, stall_sec

Model: YOLO26m-OBB (~23.5M params, OBB detection head)

Dataset: ~10,600 images with oriented bounding box annotations

At imgsz=1024, batch=8: steady-state VRAM = 28.5-29.9 GB / 32 GB (89-93%)

Training hangs silently around batch 600-810 of any epoch

when VRAM usage reaches ~29.9/32 GB.

No exception, no OOM error — just infinite spin.

Reducing imgsz to 640 (~10GB VRAM, 30% utilization) eliminates the issue entirely.

Versions

extent analysis

Fix Plan

Example Code

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING