pytorch - 💡(How to fix) Fix [Bug] DataLoader worker segmentation fault in multi-task training with long sequences and numerical instability [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181853Fetched 2026-04-30 06:18:12
View on GitHub
Comments
0
Participants
1
Timeline
41
Reactions
0
Author
Participants
Timeline (top)
mentioned ×18subscribed ×18labeled ×5

Error Message

Error Message: 1 (train_bohb pid=420171, ip=192.168.235.225) ERROR: Unexpected segmentation fault encountered in worker. 2 (train_bohb pid=420171, ip=192.168.235.225) ERROR: Training failed: DataLoader worker (pid(s) 423100) exited unexpectedly

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

GitHub Issue Title: [Bug] DataLoader worker segmentation fault in multi-task training with long sequences and numerical instability

Description: A DataLoader worker process encountered a segmentation fault during a multi-task BERT training loop. The crash occurred after several thousand steps, consistently preceded by NaN/Inf warnings in the input tensors.

Error Message:

1 (train_bohb pid=420171, ip=192.168.235.225) ERROR: Unexpected segmentation fault encountered in worker. 2 (train_bohb pid=420171, ip=192.168.235.225) ERROR: Training failed: DataLoader worker (pid(s) 423100) exited unexpectedly

Environment:

  • OS: Linux (Ubuntu 22.04/24.04 candidate)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Driver Version: 580.126.09
  • CUDA Version: 13.0
  • Python Version: 3.12
  • PyTorch/Transformers: (Please fill in the versions from your specific env, e.g., PyTorch 2.4.0)

Context & Observed Behavior:

  • Training Config: max_length=506, num_workers=4, per_device_eval_batch_size=64.
  • Symptoms:
    1. Training runs stably for ~3 hours (up to epoch 3.45).
    2. Numerical instability occurs: WARNING:tensorboardX.x2num:NaN or Inf found in input tensor.
    3. Shortly after, the DataLoader worker triggers a segmentation fault.
  • Hypothesis: The combination of near-limit sequence lengths (506/512) and NaN gradients might be triggering an illegal memory access in

Versions

Context:

  • Batch Size: 32 (Train) / 64 (Eval)
  • Max Sequence Length: 506 (near the 512 BERT limit)
  • Num Workers: 4
  • Shared Memory (/dev/shm): 32GB (Adequate)

Preliminary Analysis: The crash seems to be triggered when the model encounters numerical instability (NaN/Inf in tensors) combined with high memory/VRAM pressure from long sequences. It does not appear to be a simple OOM, as the system does not trigger the OOM Killer, but rather a segmentation fault in the worker process, possibly within a C++ extension or a CUDA kernel invoked during data collation/processing.

Steps to Reproduce (if applicable):

  1. Use a Multi-task BERT model with max_length set to 500+.
  2. Train with num_workers > 0 and fp16=True.
  3. Introduce or wait for a condition where gradients/loss become inf.

cc @andrewkho @divyanshk @SsnL @VitalyFedyunin @dzhulgakov @scotts

extent analysis

TL;DR

The segmentation fault in the DataLoader worker process may be resolved by addressing numerical instability and reducing memory pressure, potentially by adjusting the sequence length or implementing gradient clipping.

Guidance

  • Verify that the issue is indeed caused by numerical instability by checking the tensor values before the segmentation fault occurs.
  • Consider reducing the sequence length to a value below 500 to alleviate memory pressure and potential illegal memory access.
  • Implement gradient clipping to prevent NaN/Inf values in the input tensors, which may help mitigate the numerical instability.
  • Monitor the system's memory and VRAM usage to ensure that the segmentation fault is not caused by an out-of-memory condition.

Example

No specific code snippet can be provided without more information about the custom model or training loop. However, implementing gradient clipping might look something like this:

import torch

# Assuming 'optimizer' is the optimizer being used
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Notes

The root cause of the issue is not explicitly stated, and the provided information suggests a complex interaction between numerical instability, memory pressure, and potential issues with the DataLoader worker process. Further investigation and debugging may be necessary to determine the exact cause and most effective solution.

Recommendation

Apply a workaround by reducing the sequence length and implementing gradient clipping, as this may help alleviate the numerical instability and memory pressure contributing to the segmentation fault. This approach is recommended because it addresses the potential root causes of the issue without requiring significant changes to the underlying code or infrastructure.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING