vllm - 💡(How to fix) Fix [Feature]: Async cudaHostRegister in SimpleCPUOffloadConnector to unblock startup

# vllm/v1/simple_kv_offload/worker.py:165-167 tensor = torch.zeros(cpu_shape, dtype=gpu_tensor.dtype, device="cpu") if pin_memory: pin_tensor(tensor) # -> cudaHostRegister, vllm/v1/simple_kv_offload/cuda_mem_ops.py:24

Motivation

KVConnectorBase_V1.register_kv_caches() is called synchronously from the worker init path (vllm/v1/worker/gpu_model_runner.py:7041, inside initialize_kv_cache). For SimpleCPUOffloadConnector, the worker-side handler issues a single cudaHostRegister over the entire CPU KV region:

# vllm/v1/simple_kv_offload/worker.py:165-167
tensor = torch.zeros(cpu_shape, dtype=gpu_tensor.dtype, device="cpu")
if pin_memory:
    pin_tensor(tensor)   # -> cudaHostRegister, vllm/v1/simple_kv_offload/cuda_mem_ops.py:24

At TB-scale this is operationally fatal. Modern offload deployments are sizing cpu_bytes_to_use into the multi-TB range to maximize prefix reuse. cudaHostRegister over that much memory is dominated by the kernel pinning every page (NVIDIA driver lock + RDMA cgroup accounting + page-table walks) and we have observed this single call running for many minutes on commodity hardware.

For the entire pin window:

register_kv_caches() does not return.
Worker initialize_cache collective_rpc does not return → engine startup does not finish.
The HTTP server cannot accept any request, even short prompts that would never touch the offload tier.
/health stays un-ready; autoscaler readiness probes time out.

This breaks elastic scaling under bursty traffic. When the autoscaler brings new vLLM replicas up in response to a spike, those replicas sink the entire pin window before serving a single request. The burst arrives, the replicas are still pinning, and traffic goes unserved. Slow-start probes don't help — the engine has literally not finished booting.

The fix is to let the engine become serve-ready first and pin in the background. While cudaHostRegister runs, requests are served on the GPU-only path; once pinning completes, the connector silently joins in and prefix hits from CPU resume.

The same blocking pattern exists in OffloadingConnector (CPU spec, vllm/v1/kv_offload/cpu/gpu_worker.py:89, pin_mmap_region) and is a candidate for a follow-up PR.

Proposal

Make SimpleCPUOffloadConnector's host-memory pin asynchronous, with the scheduler gating connector traffic until every TP rank reports "register done". Default-off via an extra_config flag, behavior-preserving.

Worker side (vllm/v1/simple_kv_offload/worker.py):

When async_register_cache is set, register_kv_caches() allocates CPU tensors on the main thread (no pin work) and kicks off a threading.Thread that does:
1. pin_tensor(...) per tensor (the long part).
2. _backend.init(...) (wires load_stream / store_stream and the copy backend).
3. Flips self._ready = True.
Main thread returns immediately so worker init completes and the engine becomes serve-ready.

Scheduler side (vllm/v1/simple_kv_offload/manager.py):

SimpleCPUOffloadScheduler gains _ready: bool (initialized to not async_register_cache) and _ready_ranks: set[int].
get_num_new_matched_tokens(...) short-circuits to (0, False) while not self._ready — requests are served, just without prefix hits from CPU offload, until the gate opens.
build_connector_meta(...) returns an empty SimpleCPUOffloadMetadata while not ready so the worker never tries to load/store before the backend is initialized.

Ready signal (SimpleCPUOffloadWorkerMetadata):

Adds ready_tprank: int | None = None. Worker's build_connector_worker_meta() sets it once when self._ready flips True. update_connector_output(...) accumulates ready_tprank across ranks; when the set size hits _expected_worker_count, scheduler flips self._ready = True. Re-uses the existing worker→scheduler channel — no new RPC.

Config:

New extra_config["async_register_cache"] = bool (default False). Documented in the connector docstring as recommended once cpu_bytes_to_use_per_rank enters the multi-hundred-GB / TB range.

Failure modes considered

If the background pin thread fails, worker logs logger.exception(...) and the gate never opens → safe degradation to GPU-only serving (no CPU offload).
Crash inside the thread can't reach transfer_async: the scheduler-side gate prevents transfer calls from being built at all while not ready.
Race with bind_connector_metadata during the gap: build_connector_meta produces empty metadata while not ready, so there is nothing for the worker to bind.

Scope of the proposed first PR

Only SimpleCPUOffloadConnector. Follow-ups (separate PRs, after this lands and the pattern is reviewed):

OffloadingConnector with CPU spec: same pattern around pin_mmap_region.
NixlConnector / MooncakeConnector: their registration goes through third-party libs (NIXL register_memory, Mooncake batch_register_memory); handshake-vs-register ordering needs separate discussion.

Test plan

Unit: SimpleCPUOffloadScheduler returns (0, False) while not ready; flips after tp_size ready_tprank reports.
Unit: SimpleCPUOffloadWorker.register_kv_caches(async_register_cache=True) returns in milliseconds; self._ready flips after the thread joins.
Integration: launch with cpu_bytes_to_use=200GB, async_register_cache=true; mock cudaHostRegister with a sleep, confirm /health becomes ready during the pin and a request submitted in the gap completes (without prefix hit from CPU offload).
Regression: with async_register_cache=false (default), behavior identical to today.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Async cudaHostRegister in SimpleCPUOffloadConnector to unblock startup

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Motivation

Proposal

Failure modes considered

Scope of the proposed first PR

Test plan

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Async cudaHostRegister in SimpleCPUOffloadConnector to unblock startup

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Motivation

Proposal

Failure modes considered

Scope of the proposed first PR

Test plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING