vllm - 💡(How to fix) Fix [Bug]: Forked workers retain stale CUDA primary contexts from parent process

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

In vllm/v1/worker/gpu_worker.py, Worker.init_device() calls torch.accelerator.set_device_index(self.device) to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.

Fix Action

Fix

Call cuDevicePrimaryCtxRelease() for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.

Code Example

# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)

for dev_id in range(torch.cuda.device_count()):
    dev = ctypes.c_int()
    libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
    flags = ctypes.c_uint()
    state = ctypes.c_int()
    libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
    print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
#   GPU 0: active=True   <-- STALE, inherited from parent
#   GPU 1: active=True   <-- worker's actual device
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM main (latest) and v0.8.5.post1, tested on H100 multi-GPU (tp=2).

How would you like to use vllm

I'm working on integrating NVIDIA's cuda-checkpoint tool with vLLM for near-zero cold starts (related to RFC #34303). During multi-GPU testing, I discovered that forked worker processes retain the parent's CUDA primary context for GPU 0, even when the worker is assigned to GPU 1.

Before submitting a new issue...

  • I have searched existing issues
  • I have read the relevant documentation

Describe the bug

When vLLM uses the fork multiprocessing method (the default on Linux), child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 ends up with two active primary contexts: GPU 0 (inherited from parent) and GPU 1 (its own).

This causes two problems:

  1. Wasted GPU memory - the stale GPU 0 context in the GPU 1 worker holds driver-level allocations that never get freed
  2. NVIDIA cuda-checkpoint failures - cuda-checkpoint --action restore fails with "invalid argument" because it tries to restore both contexts in the worker process, but the GPU 0 context is stale and cannot be restored

Reproduction

# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)

for dev_id in range(torch.cuda.device_count()):
    dev = ctypes.c_int()
    libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
    flags = ctypes.c_uint()
    state = ctypes.c_int()
    libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
    print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
#   GPU 0: active=True   <-- STALE, inherited from parent
#   GPU 1: active=True   <-- worker's actual device

Root cause

In vllm/v1/worker/gpu_worker.py, Worker.init_device() calls torch.accelerator.set_device_index(self.device) to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.

Fix

Call cuDevicePrimaryCtxRelease() for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.

Impact

This is a correctness issue for any external tooling that enumerates per-process CUDA contexts (cuda-checkpoint, GPU memory profilers, container checkpoint/restore). It also causes a small but unnecessary memory waste per worker process.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Forked workers retain stale CUDA primary contexts from parent process