vllm - 💡(How to fix) Fix [Bug]: Forked workers retain stale CUDA primary contexts from parent process

vllm2026-05-17 08:16:43

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

In vllm/v1/worker/gpu_worker.py, Worker.init_device() calls torch.accelerator.set_device_index(self.device) to set the worker's device, but never releases inherited primary contexts from other devices. The parent process may have initialized CUDA on GPU 0 before forking, and that context persists in the child.

Fix Action

Fix

Call cuDevicePrimaryCtxRelease() for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.

Code Example

# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)

for dev_id in range(torch.cuda.device_count()):
    dev = ctypes.c_int()
    libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
    flags = ctypes.c_uint()
    state = ctypes.c_int()
    libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
    print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
#   GPU 0: active=True   <-- STALE, inherited from parent
#   GPU 1: active=True   <-- worker's actual device

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM main (latest) and v0.8.5.post1, tested on H100 multi-GPU (tp=2).

How would you like to use vllm

I'm working on integrating NVIDIA's cuda-checkpoint tool with vLLM for near-zero cold starts (related to RFC #34303). During multi-GPU testing, I discovered that forked worker processes retain the parent's CUDA primary context for GPU 0, even when the worker is assigned to GPU 1.

Before submitting a new issue...

I have searched existing issues
I have read the relevant documentation

Describe the bug

When vLLM uses the fork multiprocessing method (the default on Linux), child worker processes inherit the parent's active CUDA primary contexts for all devices. A worker assigned to GPU 1 ends up with two active primary contexts: GPU 0 (inherited from parent) and GPU 1 (its own).

This causes two problems:

Wasted GPU memory - the stale GPU 0 context in the GPU 1 worker holds driver-level allocations that never get freed
NVIDIA cuda-checkpoint failures - cuda-checkpoint --action restore fails with "invalid argument" because it tries to restore both contexts in the worker process, but the GPU 0 context is stale and cannot be restored

Reproduction

# In a forked worker process assigned to GPU 1:
import ctypes, torch
libcuda = ctypes.CDLL("libcuda.so.1")
libcuda.cuInit(0)

for dev_id in range(torch.cuda.device_count()):
    dev = ctypes.c_int()
    libcuda.cuDeviceGet(ctypes.byref(dev), dev_id)
    flags = ctypes.c_uint()
    state = ctypes.c_int()
    libcuda.cuDevicePrimaryCtxGetState(dev, ctypes.byref(flags), ctypes.byref(state))
    print(f"GPU {dev_id}: active={state.value != 0}")
# Output:
#   GPU 0: active=True   <-- STALE, inherited from parent
#   GPU 1: active=True   <-- worker's actual device

Root cause

Fix

Call cuDevicePrimaryCtxRelease() for all non-assigned devices after setting the worker's device. I have a PR ready with the fix and a test.

Impact

This is a correctness issue for any external tooling that enumerates per-process CUDA contexts (cuda-checkpoint, GPU memory profilers, container checkpoint/restore). It also causes a small but unnecessary memory waste per worker process.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#prompt issue #agent setup #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Forked workers retain stale CUDA primary contexts from parent process

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

Code Example

Your current environment

How would you like to use vllm

Before submitting a new issue...

Describe the bug

Reproduction

Root cause

Fix

Impact

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Forked workers retain stale CUDA primary contexts from parent process

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

Code Example

Your current environment

How would you like to use vllm

Before submitting a new issue...

Describe the bug

Reproduction

Root cause

Fix

Impact

Still need to ship something?

RELATED_DISCOVERY

TRENDING