pytorch - 💡(How to fix) Fix RuntimeError: pidfd_getfd: Bad file descriptor [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179220Fetched 2026-04-08 02:32:49
View on GitHub
Comments
1
Participants
1
Timeline
110
Reactions
0
Author
Participants
Timeline (top)
mentioned ×49subscribed ×49labeled ×7cross-referenced ×2

Error Message

$ srun --container-image=nvcr.io/nvidia/pytorch:26.03-py3 --container-mounts ~/test_gpu_tensor_ipc.py:/test_gpu_tensor_ipc.py env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 /test_gpu_tensor_ipc.py pyxis: importing docker image: nvcr.io/nvidia/pytorch:26.03-py3 pyxis: imported docker image: nvcr.io/nvidia/pytorch:26.03-py3 CUDA available: True Number of GPUs: 4 Main PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True Main PYTORCH_ALLOC_CONF: not set Starting processes... Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/test_gpu_tensor_ipc.py", line 48, in consumer item = queue.get(timeout=5) ^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py", line 180, in rebuild_cuda_tensor storage = storage_cls._new_shared_cuda( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/storage.py", line 1464, in _new_shared_cuda return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: pidfd_getfd: Bad file descriptor Consumer: Starting on GPU 0 Consumer PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True Consumer PYTORCH_ALLOC_CONF: not set Producer: Starting on GPU 0 Producer PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True Producer PYTORCH_ALLOC_CONF: not set Producer: Created tensor 0 with shape torch.Size([3, 4]) on cuda:0 Producer: Tensor 0 mean value: 0.4615 Producer: Put tensor 0 in queue Producer: Created tensor 1 with shape torch.Size([3, 4]) on cuda:0 Producer: Tensor 1 mean value: 0.3834 Producer: Put tensor 1 in queue Producer: Created tensor 2 with shape torch.Size([3, 4]) on cuda:0 Producer: Tensor 2 mean value: -0.6317 Producer: Put tensor 2 in queue Producer: Created tensor 3 with shape torch.Size([3, 4]) on cuda:0 Producer: Tensor 3 mean value: -0.1617 Producer: Put tensor 3 in queue Producer: Created tensor 4 with shape torch.Size([3, 4]) on cuda:0 Producer: Tensor 4 mean value: -0.4775 Producer: Put tensor 4 in queue Producer: Finished [W402 20:55:40.208040611 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

All processes completed

Fix Action

Fix / Workaround

Since the consumer has not allocated anything yet and it is a different process, it is not aware that fabric handles are supported and it's not going to check if fabric handles are supported. Thus it tries to deserialize the handle it received as POSIX_FD, so reading a portion of CUmemFabricHandle as a bogus file descriptor. The workaround of creating a dummy tensor forces the consumer to realizes that it should be using fabric handles.

Code Example

#!/usr/bin/env python3
import torch
import torch.multiprocessing as mp
from queue import Empty
import time
import os


def producer(queue, device_id=0):
    """Producer process that creates GPU tensors and puts them in the queue."""
    print(f"Producer: Starting on GPU {device_id}")
    print(f"Producer PYTORCH_CUDA_ALLOC_CONF: {os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')}")
    print(f"Producer PYTORCH_ALLOC_CONF: {os.environ.get('PYTORCH_ALLOC_CONF', 'not set')}")

    # Set CUDA device
    torch.cuda.set_device(device_id)
    device = torch.device(f'cuda:{device_id}')

    # Create some GPU tensors
    for i in range(5):
        # Create a tensor on GPU
        tensor = torch.randn(3, 4, device=device) * (i + 1)
        print(f"Producer: Created tensor {i} with shape {tensor.shape} on {tensor.device}")
        print(f"Producer: Tensor {i} mean value: {tensor.mean().item():.4f}")

        # Share tensor directly via CUDA IPC (zero-copy)
        queue.put((i, tensor))
        print(f"Producer: Put tensor {i} in queue")
        time.sleep(0.5)

    # Signal completion
    queue.put(None)
    print("Producer: Finished")


def consumer(queue, device_id=0):
    """Consumer process that retrieves GPU tensors from the queue."""
    print(f"Consumer: Starting on GPU {device_id}")
    print(f"Consumer PYTORCH_CUDA_ALLOC_CONF: {os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')}")
    print(f"Consumer PYTORCH_ALLOC_CONF: {os.environ.get('PYTORCH_ALLOC_CONF', 'not set')}")

    # Set CUDA device
    torch.cuda.set_device(device_id)

    while True:
        try:
            # Get tensor from queue (zero-copy CUDA IPC)
            item = queue.get(timeout=5)

            if item is None:
                print("Consumer: Received termination signal")
                break

            idx, tensor = item
            print(f"Consumer: Received tensor {idx} with shape {tensor.shape} on {tensor.device}")
            print(f"Consumer: Tensor {idx} mean value: {tensor.mean().item():.4f}")

            # Perform some operation on the tensor
            result = tensor * 2
            print(f"Consumer: Processed tensor {idx}, new mean: {result.mean().item():.4f}")

        except Empty:
            print("Consumer: Queue timeout")
            break

    print("Consumer: Finished")


def main():
    """Main function to set up and run the multiprocessing example."""
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available. This example requires GPU support.")
        return

    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    # Check current allocator configuration
    cuda_conf = os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')
    alloc_conf = os.environ.get('PYTORCH_ALLOC_CONF', 'not set')
    print(f"Main PYTORCH_CUDA_ALLOC_CONF: {cuda_conf}")
    print(f"Main PYTORCH_ALLOC_CONF: {alloc_conf}")

    # Set multiprocessing start method to 'spawn' (required for CUDA)
    mp.set_start_method('spawn', force=True)

    # Create a multiprocessing queue
    queue = mp.Queue()

    # Determine which GPU to use
    device_id = 0

    # Create producer and consumer processes
    print("Starting processes...")
    producer_process = mp.Process(target=producer, args=(queue, device_id))
    consumer_process = mp.Process(target=consumer, args=(queue, device_id))

    # Start processes
    producer_process.start()
    consumer_process.start()

    # Wait for processes to complete
    producer_process.join()
    consumer_process.join()

    print("\nAll processes completed!")


if __name__ == '__main__':
    main()

---

$ srun --container-image=nvcr.io/nvidia/pytorch:26.03-py3 --container-mounts ~/test_gpu_tensor_ipc.py:/test_gpu_tensor_ipc.py env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 /test_gpu_tensor_ipc.py
pyxis: importing docker image: nvcr.io/nvidia/pytorch:26.03-py3
pyxis: imported docker image: nvcr.io/nvidia/pytorch:26.03-py3
CUDA available: True
Number of GPUs: 4
Main PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
Main PYTORCH_ALLOC_CONF: not set
Starting processes...
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/test_gpu_tensor_ipc.py", line 48, in consumer
    item = queue.get(timeout=5)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py", line 180, in rebuild_cuda_tensor
    storage = storage_cls._new_shared_cuda(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/storage.py", line 1464, in _new_shared_cuda
    return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: pidfd_getfd: Bad file descriptor
Consumer: Starting on GPU 0
Consumer PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
Consumer PYTORCH_ALLOC_CONF: not set
Producer: Starting on GPU 0
Producer PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
Producer PYTORCH_ALLOC_CONF: not set
Producer: Created tensor 0 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 0 mean value: 0.4615
Producer: Put tensor 0 in queue
Producer: Created tensor 1 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 1 mean value: 0.3834
Producer: Put tensor 1 in queue
Producer: Created tensor 2 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 2 mean value: -0.6317
Producer: Put tensor 2 in queue
Producer: Created tensor 3 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 3 mean value: -0.1617
Producer: Put tensor 3 in queue
Producer: Created tensor 4 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 4 mean value: -0.4775
Producer: Put tensor 4 in queue
Producer: Finished
[W402 20:55:40.208040611 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

All processes completed

---

[pid 1587003] pidfd_getfd(61, 2097152, 0) = -1 EBADF (Bad file descriptor)

---

torch.empty(1, device=f'cuda:{device_id}')

---

$ python collect_env.py
Collecting environment information...
PyTorch version: 2.11.0a0+a6c236b9fd.nv26.03.46836102
Is debug build: False
CUDA used to build PyTorch: 13.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (aarch64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

My colleague @sbak5 reported an issue to me regarding CUDA IPC when using expandable segments, with the following code:

#!/usr/bin/env python3
import torch
import torch.multiprocessing as mp
from queue import Empty
import time
import os


def producer(queue, device_id=0):
    """Producer process that creates GPU tensors and puts them in the queue."""
    print(f"Producer: Starting on GPU {device_id}")
    print(f"Producer PYTORCH_CUDA_ALLOC_CONF: {os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')}")
    print(f"Producer PYTORCH_ALLOC_CONF: {os.environ.get('PYTORCH_ALLOC_CONF', 'not set')}")

    # Set CUDA device
    torch.cuda.set_device(device_id)
    device = torch.device(f'cuda:{device_id}')

    # Create some GPU tensors
    for i in range(5):
        # Create a tensor on GPU
        tensor = torch.randn(3, 4, device=device) * (i + 1)
        print(f"Producer: Created tensor {i} with shape {tensor.shape} on {tensor.device}")
        print(f"Producer: Tensor {i} mean value: {tensor.mean().item():.4f}")

        # Share tensor directly via CUDA IPC (zero-copy)
        queue.put((i, tensor))
        print(f"Producer: Put tensor {i} in queue")
        time.sleep(0.5)

    # Signal completion
    queue.put(None)
    print("Producer: Finished")


def consumer(queue, device_id=0):
    """Consumer process that retrieves GPU tensors from the queue."""
    print(f"Consumer: Starting on GPU {device_id}")
    print(f"Consumer PYTORCH_CUDA_ALLOC_CONF: {os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')}")
    print(f"Consumer PYTORCH_ALLOC_CONF: {os.environ.get('PYTORCH_ALLOC_CONF', 'not set')}")

    # Set CUDA device
    torch.cuda.set_device(device_id)

    while True:
        try:
            # Get tensor from queue (zero-copy CUDA IPC)
            item = queue.get(timeout=5)

            if item is None:
                print("Consumer: Received termination signal")
                break

            idx, tensor = item
            print(f"Consumer: Received tensor {idx} with shape {tensor.shape} on {tensor.device}")
            print(f"Consumer: Tensor {idx} mean value: {tensor.mean().item():.4f}")

            # Perform some operation on the tensor
            result = tensor * 2
            print(f"Consumer: Processed tensor {idx}, new mean: {result.mean().item():.4f}")

        except Empty:
            print("Consumer: Queue timeout")
            break

    print("Consumer: Finished")


def main():
    """Main function to set up and run the multiprocessing example."""
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available. This example requires GPU support.")
        return

    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

    # Check current allocator configuration
    cuda_conf = os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')
    alloc_conf = os.environ.get('PYTORCH_ALLOC_CONF', 'not set')
    print(f"Main PYTORCH_CUDA_ALLOC_CONF: {cuda_conf}")
    print(f"Main PYTORCH_ALLOC_CONF: {alloc_conf}")

    # Set multiprocessing start method to 'spawn' (required for CUDA)
    mp.set_start_method('spawn', force=True)

    # Create a multiprocessing queue
    queue = mp.Queue()

    # Determine which GPU to use
    device_id = 0

    # Create producer and consumer processes
    print("Starting processes...")
    producer_process = mp.Process(target=producer, args=(queue, device_id))
    consumer_process = mp.Process(target=consumer, args=(queue, device_id))

    # Start processes
    producer_process.start()
    consumer_process.start()

    # Wait for processes to complete
    producer_process.join()
    consumer_process.join()

    print("\nAll processes completed!")


if __name__ == '__main__':
    main()

Without expandable segments: it works fine. With expandable segments, we get an error, for example on GB200:

$ srun --container-image=nvcr.io/nvidia/pytorch:26.03-py3 --container-mounts ~/test_gpu_tensor_ipc.py:/test_gpu_tensor_ipc.py env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python3 /test_gpu_tensor_ipc.py
pyxis: importing docker image: nvcr.io/nvidia/pytorch:26.03-py3
pyxis: imported docker image: nvcr.io/nvidia/pytorch:26.03-py3
CUDA available: True
Number of GPUs: 4
Main PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
Main PYTORCH_ALLOC_CONF: not set
Starting processes...
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/test_gpu_tensor_ipc.py", line 48, in consumer
    item = queue.get(timeout=5)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/multiprocessing/reductions.py", line 180, in rebuild_cuda_tensor
    storage = storage_cls._new_shared_cuda(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/storage.py", line 1464, in _new_shared_cuda
    return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: pidfd_getfd: Bad file descriptor
Consumer: Starting on GPU 0
Consumer PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
Consumer PYTORCH_ALLOC_CONF: not set
Producer: Starting on GPU 0
Producer PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
Producer PYTORCH_ALLOC_CONF: not set
Producer: Created tensor 0 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 0 mean value: 0.4615
Producer: Put tensor 0 in queue
Producer: Created tensor 1 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 1 mean value: 0.3834
Producer: Put tensor 1 in queue
Producer: Created tensor 2 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 2 mean value: -0.6317
Producer: Put tensor 2 in queue
Producer: Created tensor 3 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 3 mean value: -0.1617
Producer: Put tensor 3 in queue
Producer: Created tensor 4 with shape torch.Size([3, 4]) on cuda:0
Producer: Tensor 4 mean value: -0.4775
Producer: Put tensor 4 in queue
Producer: Finished
[W402 20:55:40.208040611 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

All processes completed

Using strace:

[pid 1587003] pidfd_getfd(61, 2097152, 0) = -1 EBADF (Bad file descriptor)

The second argument targetfd is 2097152 which is bogus for this example, but it's not random as it is 2^21.

I observed that adding a dummy tensor creation in the consumer solves the problem:

 torch.empty(1, device=f'cuda:{device_id}')

I think this shows that this is a mismatch between the serialization and deserialization processes.

The producer allocates a buffer and probes for fabric handle support, fabric handles are supported and thus a CUmemFabricHandle is serialized.

Since the consumer has not allocated anything yet and it is a different process, it is not aware that fabric handles are supported and it's not going to check if fabric handles are supported. Thus it tries to deserialize the handle it received as POSIX_FD, so reading a portion of CUmemFabricHandle as a bogus file descriptor. The workaround of creating a dummy tensor forces the consumer to realizes that it should be using fabric handles.

Versions

$ python collect_env.py
Collecting environment information...
PyTorch version: 2.11.0a0+a6c236b9fd.nv26.03.46836102
Is debug build: False
CUDA used to build PyTorch: 13.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (aarch64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

cc @VitalyFedyunin @albanD @pragupta @ppwwyyxx @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @mruberry @mikaylagawarecki

extent analysis

TL;DR

The issue can be resolved by adding a dummy tensor creation in the consumer process to force it to realize that it should be using fabric handles for CUDA IPC.

Guidance

  • The error occurs due to a mismatch between the serialization and deserialization processes of CUDA tensors when using expandable segments.
  • The producer process allocates a buffer and probes for fabric handle support, but the consumer process is not aware of this and tries to deserialize the handle as a POSIX_FD, resulting in a bogus file descriptor error.
  • Creating a dummy tensor in the consumer process, such as torch.empty(1, device=f'cuda:{device_id}'), forces the consumer to realize that it should be using fabric handles, resolving the issue.
  • This workaround suggests that the problem is related to the order of operations and the awareness of fabric handle support in the consumer process.

Example

# In the consumer function, add the following line before the while loop
torch.empty(1, device=f'cuda:{device_id}')

Notes

  • The issue is specific to the use of expandable segments and CUDA IPC.
  • The workaround provided may not be the only solution, and further investigation into the underlying cause of the mismatch between serialization and deserialization processes may be necessary.
  • The use of strace to diagnose the issue was helpful in identifying the bogus file descriptor error.

Recommendation

Apply the workaround by adding a dummy tensor creation in the consumer process, as it is a simple and effective solution to the problem. This will allow the consumer process to correctly deserialize the CUDA tensors using fabric handles.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING