vllm - 💡(How to fix) Fix [Feature]: Async cudaHostRegister in SimpleCPUOffloadConnector to unblock startup

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • If the background pin thread fails, worker logs logger.exception(...) and the gate never opens → safe degradation to GPU-only serving (no CPU offload).

Code Example

# vllm/v1/simple_kv_offload/worker.py:165-167
tensor = torch.zeros(cpu_shape, dtype=gpu_tensor.dtype, device="cpu")
if pin_memory:
    pin_tensor(tensor)   # -> cudaHostRegister, vllm/v1/simple_kv_offload/cuda_mem_ops.py:24
RAW_BUFFERClick to expand / collapse

Motivation

KVConnectorBase_V1.register_kv_caches() is called synchronously from the worker init path (vllm/v1/worker/gpu_model_runner.py:7041, inside initialize_kv_cache). For SimpleCPUOffloadConnector, the worker-side handler issues a single cudaHostRegister over the entire CPU KV region:

# vllm/v1/simple_kv_offload/worker.py:165-167
tensor = torch.zeros(cpu_shape, dtype=gpu_tensor.dtype, device="cpu")
if pin_memory:
    pin_tensor(tensor)   # -> cudaHostRegister, vllm/v1/simple_kv_offload/cuda_mem_ops.py:24

At TB-scale this is operationally fatal. Modern offload deployments are sizing cpu_bytes_to_use into the multi-TB range to maximize prefix reuse. cudaHostRegister over that much memory is dominated by the kernel pinning every page (NVIDIA driver lock + RDMA cgroup accounting + page-table walks) and we have observed this single call running for many minutes on commodity hardware.

For the entire pin window:

  • register_kv_caches() does not return.
  • Worker initialize_cache collective_rpc does not return → engine startup does not finish.
  • The HTTP server cannot accept any request, even short prompts that would never touch the offload tier.
  • /health stays un-ready; autoscaler readiness probes time out.

This breaks elastic scaling under bursty traffic. When the autoscaler brings new vLLM replicas up in response to a spike, those replicas sink the entire pin window before serving a single request. The burst arrives, the replicas are still pinning, and traffic goes unserved. Slow-start probes don't help — the engine has literally not finished booting.

The fix is to let the engine become serve-ready first and pin in the background. While cudaHostRegister runs, requests are served on the GPU-only path; once pinning completes, the connector silently joins in and prefix hits from CPU resume.

The same blocking pattern exists in OffloadingConnector (CPU spec, vllm/v1/kv_offload/cpu/gpu_worker.py:89, pin_mmap_region) and is a candidate for a follow-up PR.

Proposal

Make SimpleCPUOffloadConnector's host-memory pin asynchronous, with the scheduler gating connector traffic until every TP rank reports "register done". Default-off via an extra_config flag, behavior-preserving.

Worker side (vllm/v1/simple_kv_offload/worker.py):

  • When async_register_cache is set, register_kv_caches() allocates CPU tensors on the main thread (no pin work) and kicks off a threading.Thread that does:
    1. pin_tensor(...) per tensor (the long part).
    2. _backend.init(...) (wires load_stream / store_stream and the copy backend).
    3. Flips self._ready = True.
  • Main thread returns immediately so worker init completes and the engine becomes serve-ready.

Scheduler side (vllm/v1/simple_kv_offload/manager.py):

  • SimpleCPUOffloadScheduler gains _ready: bool (initialized to not async_register_cache) and _ready_ranks: set[int].
  • get_num_new_matched_tokens(...) short-circuits to (0, False) while not self._ready — requests are served, just without prefix hits from CPU offload, until the gate opens.
  • build_connector_meta(...) returns an empty SimpleCPUOffloadMetadata while not ready so the worker never tries to load/store before the backend is initialized.

Ready signal (SimpleCPUOffloadWorkerMetadata):

  • Adds ready_tprank: int | None = None. Worker's build_connector_worker_meta() sets it once when self._ready flips True. update_connector_output(...) accumulates ready_tprank across ranks; when the set size hits _expected_worker_count, scheduler flips self._ready = True. Re-uses the existing worker→scheduler channel — no new RPC.

Config:

  • New extra_config["async_register_cache"] = bool (default False). Documented in the connector docstring as recommended once cpu_bytes_to_use_per_rank enters the multi-hundred-GB / TB range.

Failure modes considered

  • If the background pin thread fails, worker logs logger.exception(...) and the gate never opens → safe degradation to GPU-only serving (no CPU offload).
  • Crash inside the thread can't reach transfer_async: the scheduler-side gate prevents transfer calls from being built at all while not ready.
  • Race with bind_connector_metadata during the gap: build_connector_meta produces empty metadata while not ready, so there is nothing for the worker to bind.

Scope of the proposed first PR

Only SimpleCPUOffloadConnector. Follow-ups (separate PRs, after this lands and the pattern is reviewed):

  • OffloadingConnector with CPU spec: same pattern around pin_mmap_region.
  • NixlConnector / MooncakeConnector: their registration goes through third-party libs (NIXL register_memory, Mooncake batch_register_memory); handshake-vs-register ordering needs separate discussion.

Test plan

  • Unit: SimpleCPUOffloadScheduler returns (0, False) while not ready; flips after tp_size ready_tprank reports.
  • Unit: SimpleCPUOffloadWorker.register_kv_caches(async_register_cache=True) returns in milliseconds; self._ready flips after the thread joins.
  • Integration: launch with cpu_bytes_to_use=200GB, async_register_cache=true; mock cudaHostRegister with a sleep, confirm /health becomes ready during the pin and a request submitted in the gap completes (without prefix hit from CPU offload).
  • Regression: with async_register_cache=false (default), behavior identical to today.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Async cudaHostRegister in SimpleCPUOffloadConnector to unblock startup