vllm - 💡(How to fix) Fix [RFC]: Zero-copy LoRA loading from tmpfs via mmap + cudaHostRegister

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

vllm/lora/lora_model.py

_SAFETENSORS_DTYPE_MAP = { "F32": torch.float32, "F16": torch.float16, "BF16": torch.bfloat16, "F64": torch.float64, "I64": torch.int64, "I32": torch.int32, "I16": torch.int16, "I8": torch.int8, "U8": torch.uint8, "BOOL": torch.bool, }

Filesystem magic numbers from <linux/magic.h>.

tmpfs returns either TMPFS_MAGIC or RAMFS_MAGIC depending on kernel.

_HOST_RAM_FS_TYPES = frozenset([ 0x01021994, # TMPFS_MAGIC 0x858458F6, # RAMFS_MAGIC 0x958458F6, # HUGETLBFS_MAGIC (2M / 1G huge pages — even better) ])

class _Statfs(ctypes.Structure): fields = [ ("f_type", ctypes.c_long), ("f_bsize", ctypes.c_long), ("f_blocks", ctypes.c_ulong), ("f_bfree", ctypes.c_ulong), ("f_bavail", ctypes.c_ulong), ("f_files", ctypes.c_ulong), ("f_ffree", ctypes.c_ulong), ("f_fsid", ctypes.c_long * 2), ("f_namelen", ctypes.c_long), ("f_frsize", ctypes.c_long), ("f_flags", ctypes.c_long), ("f_spare", ctypes.c_long * 4), ]

_libc = ctypes.CDLL(ctypes.util.find_library("c") or "libc.so.6", use_errno=True) _libc.statfs.argtypes = [ctypes.c_char_p, ctypes.POINTER(_Statfs)] _libc.statfs.restype = ctypes.c_int

def _is_on_host_ram_fs(path: str) -> bool: """True iff path lives on tmpfs / ramfs / hugetlbfs — i.e., the file's bytes are already in host RAM (not on disk), so cudaHostRegister doesn't trigger any disk IO during page-pinning.""" try: buf = _Statfs() if _libc.statfs(os.fsencode(path), ctypes.byref(buf)) != 0: return False return buf.f_type in _HOST_RAM_FS_TYPES except Exception: return False

class _ZeroCopyMmapHandle: """Owns mmap + cudaHostRegister lifetime. Attached to LoRAModel so that cudaHostUnregister runs when the LoRAModel is dropped.""" def init(self, mm, base_ptr: int, size: int): self.mm, self.base_ptr, self.size = mm, base_ptr, size self._registered = True

def close(self):
    if self._registered:
        _cuda_host_unregister(self.base_ptr)
        self._registered = False

def __del__(self):
    try: self.close()
    except Exception: pass

def _load_safetensors_zero_copy(path: str): """mmap the file + cudaHostRegister once + reconstruct tensors as views. Returns (tensors_dict, handle) or (None, None) if any dtype is unsupported or registration fails — caller falls back to safe_open.""" fd = os.open(path, os.O_RDWR) try: size = os.fstat(fd).st_size mm = mmap.mmap(fd, size, flags=mmap.MAP_SHARED, prot=mmap.PROT_READ | mmap.PROT_WRITE) finally: os.close(fd) cbuf = (ctypes.c_char * size).from_buffer(mm) base_ptr = ctypes.addressof(cbuf) if _cuda_host_register(base_ptr, size) != 0: return None, None handle = _ZeroCopyMmapHandle(mm, base_ptr, size)

hsz = int.from_bytes(mm[:8], "little")
header = json.loads(bytes(mm[8 : 8 + hsz]))
data_off = 8 + hsz
tensors = {}
for k, meta in header.items():
    if k == "__metadata__":
        continue
    dt = _SAFETENSORS_DTYPE_MAP.get(meta["dtype"])
    if dt is None:
        handle.close()
        return None, None  # unsupported dtype → fallback
    s, e = meta["data_offsets"]
    raw = torch.frombuffer(mm, dtype=torch.uint8, count=e - s, offset=data_off + s)
    tensors[k] = raw.view(dt).view(meta["shape"])
return tensors, handle

Root Cause

  1. The caller-side cudaHostRegister on the tmpfs file region is completely invisible to PyTorch — is_pinned() returns False because the tensor's data_ptr is on Python heap, not in the registered region.
  2. from_lora_tensors's .pin_memory() allocates a fresh pinned buffer and memcpys (the work the caller was trying to avoid).
  3. .to("cuda", non_blocking=True) goes through PyTorch's bounce buffer for unpinned source.

Fix Action

Fix / Workaround

For SSD-backed LoRA loading (the common case), this is fine — the copy fits in the disk read latency. For tmpfs-backed loading (the in-memory rollout case), the copy + pin + bounce buffer dominate end-to-end cost and explain why verl had to monkey-patch _load_adapter to bypass the safetensors path entirely.

Concrete downstream patterns where this RFC eliminates workarounds or unlocks new perf headroom:

  1. Large fused / merged LoRA staging: rank-128+ adapters across all attention + MLP modules, multi-domain fused adapters, or stacked composition LoRAs commonly hit 1–4 GiB. These are where the absolute savings (tens of ms per load) compound across reload-heavy workloads.
  2. Encrypted-weight serving: decrypt to a tmpfs directory, hand off path, never touch disk. Works the same whether the tmpfs is /dev/shm, a custom mount -t tmpfs target, or K8s emptyDir { medium: Memory }.
  3. Multi-tenant inference: LRU adapter cache in /dev/shm (or custom tmpfs mount) with OS-enforced memory bounding via tmpfs size limit. New cache entries written from upstream loaders get the fast path automatically.
  4. Hot-swap inference with shm-backed adapter cache: runtime adapter download / refresh, with /dev/shm as the working area. Fast path active without any user-side flag.
  5. In-process actor → rollout weight sync (concrete instance: verl-project/verl#5616): the staging-into-tmpfs pattern an in-memory rollout backend uses to avoid disk IO becomes a first-class path; the workarounds that currently bypass _load_adapter are no longer needed.

Code Example

Storage::Mmap(mmap) => {
    let data = &mmap[...];
    PyByteArray::new(py, data)   // alloc + memcpy, NOT a view
}

---

# vllm/lora/lora_model.py

_SAFETENSORS_DTYPE_MAP = {
    "F32": torch.float32, "F16": torch.float16, "BF16": torch.bfloat16,
    "F64": torch.float64, "I64": torch.int64,  "I32": torch.int32,
    "I16": torch.int16,   "I8": torch.int8,    "U8": torch.uint8,
    "BOOL": torch.bool,
}

# Filesystem magic numbers from <linux/magic.h>.
# tmpfs returns either TMPFS_MAGIC or RAMFS_MAGIC depending on kernel.
_HOST_RAM_FS_TYPES = frozenset([
    0x01021994,  # TMPFS_MAGIC
    0x858458F6,  # RAMFS_MAGIC
    0x958458F6,  # HUGETLBFS_MAGIC (2M / 1G huge pages — even better)
])


class _Statfs(ctypes.Structure):
    _fields_ = [
        ("f_type",    ctypes.c_long),  ("f_bsize",   ctypes.c_long),
        ("f_blocks",  ctypes.c_ulong), ("f_bfree",   ctypes.c_ulong),
        ("f_bavail",  ctypes.c_ulong), ("f_files",   ctypes.c_ulong),
        ("f_ffree",   ctypes.c_ulong), ("f_fsid",    ctypes.c_long * 2),
        ("f_namelen", ctypes.c_long),  ("f_frsize",  ctypes.c_long),
        ("f_flags",   ctypes.c_long),  ("f_spare",   ctypes.c_long * 4),
    ]


_libc = ctypes.CDLL(ctypes.util.find_library("c") or "libc.so.6", use_errno=True)
_libc.statfs.argtypes = [ctypes.c_char_p, ctypes.POINTER(_Statfs)]
_libc.statfs.restype = ctypes.c_int


def _is_on_host_ram_fs(path: str) -> bool:
    """True iff path lives on tmpfs / ramfs / hugetlbfs — i.e., the file's
    bytes are already in host RAM (not on disk), so cudaHostRegister doesn't
    trigger any disk IO during page-pinning."""
    try:
        buf = _Statfs()
        if _libc.statfs(os.fsencode(path), ctypes.byref(buf)) != 0:
            return False
        return buf.f_type in _HOST_RAM_FS_TYPES
    except Exception:
        return False


class _ZeroCopyMmapHandle:
    """Owns mmap + cudaHostRegister lifetime. Attached to LoRAModel so that
    cudaHostUnregister runs when the LoRAModel is dropped."""
    def __init__(self, mm, base_ptr: int, size: int):
        self.mm, self.base_ptr, self.size = mm, base_ptr, size
        self._registered = True

    def close(self):
        if self._registered:
            _cuda_host_unregister(self.base_ptr)
            self._registered = False

    def __del__(self):
        try: self.close()
        except Exception: pass


def _load_safetensors_zero_copy(path: str):
    """mmap the file + cudaHostRegister once + reconstruct tensors as views.
    Returns (tensors_dict, handle) or (None, None) if any dtype is unsupported
    or registration fails — caller falls back to safe_open."""
    fd = os.open(path, os.O_RDWR)
    try:
        size = os.fstat(fd).st_size
        mm = mmap.mmap(fd, size, flags=mmap.MAP_SHARED,
                       prot=mmap.PROT_READ | mmap.PROT_WRITE)
    finally:
        os.close(fd)
    cbuf = (ctypes.c_char * size).from_buffer(mm)
    base_ptr = ctypes.addressof(cbuf)
    if _cuda_host_register(base_ptr, size) != 0:
        return None, None
    handle = _ZeroCopyMmapHandle(mm, base_ptr, size)

    hsz = int.from_bytes(mm[:8], "little")
    header = json.loads(bytes(mm[8 : 8 + hsz]))
    data_off = 8 + hsz
    tensors = {}
    for k, meta in header.items():
        if k == "__metadata__":
            continue
        dt = _SAFETENSORS_DTYPE_MAP.get(meta["dtype"])
        if dt is None:
            handle.close()
            return None, None  # unsupported dtype → fallback
        s, e = meta["data_offsets"]
        raw = torch.frombuffer(mm, dtype=torch.uint8, count=e - s, offset=data_off + s)
        tensors[k] = raw.view(dt).view(meta["shape"])
    return tensors, handle

---

elif os.path.isfile(lora_tensor_path):
    _zc_handle = None
    if _is_on_host_ram_fs(lora_tensor_path):
        tensors, _zc_handle = _load_safetensors_zero_copy(lora_tensor_path)
    if _zc_handle is None:
        # Existing path: SSD-backed LoRAs, unsupported dtypes, register failure
        tensors = {}
        with safetensors.safe_open(lora_tensor_path, framework="pt") as f:
            check_unexpected_modules(f)
            for module in f.keys():
                tensors[module] = f.get_tensor(module)
    else:
        check_unexpected_modules(tensors)  # works on dict directly

---

lora = cls.from_lora_tensors(...)
if _zc_handle is not None:
    lora._zero_copy_handle = _zc_handle  # __del__ chain releases mmap+register
return lora

---

# 1. Built-in /dev/shm (raise default 64 MB → enough for your adapters)
docker run --shm-size=8G ...

# 2. Custom tmpfs mount (no shm constraints, full control)
mount -t tmpfs -o size=8G tmpfs /mnt/lora_stage
# then stage to /mnt/lora_stage/

# 3. Kubernetes emptyDir backed by RAM
volumes:
- name: lora-stage
  emptyDir:
    medium: Memory
    sizeLimit: 8Gi

---

# 4. hugetlbfs (1 GiB or 2 MiB pages — ~10× faster cudaHostRegister at 2 GiB scale)
echo 2048 > /proc/sys/vm/nr_hugepages          # reserve 2048 × 2 MiB
mount -t hugetlbfs -o pagesize=2M hugetlbfs /mnt/huge

---

gh search issues "LoRA pin_memory" --repo vllm-project/vllm
gh search issues "safetensors mmap LoRA" --repo vllm-project/vllm
gh search prs "LoRA load tmpfs" --repo vllm-project/vllm
gh search issues "RFC LoRA loading" --repo vllm-project/vllm
gh search issues "cudaHostRegister" --repo vllm-project/vllm
gh search issues "pinned memory LoRA" --repo vllm-project/vllm
RAW_BUFFERClick to expand / collapse

Motivation.

vLLM's LoRA load path goes through safetensors.safe_open(...).get_tensor(...). The library's Mmap backend, while called "mmap", actually copies each tensor's bytes from the mmap'd file region into a freshly-allocated Python PyByteArray (safetensors/bindings/python/src/lib.rs:825-830):

Storage::Mmap(mmap) => {
    let data = &mmap[...];
    PyByteArray::new(py, data)   // alloc + memcpy, NOT a view
}

The resulting tensor views the PyByteArray, not the original mmap region. Three consequences for a downstream that stages LoRA into tmpfs + cudaHostRegister to skip disk IO and pinning overhead (e.g. verl-project/verl#5616):

  1. The caller-side cudaHostRegister on the tmpfs file region is completely invisible to PyTorch — is_pinned() returns False because the tensor's data_ptr is on Python heap, not in the registered region.
  2. from_lora_tensors's .pin_memory() allocates a fresh pinned buffer and memcpys (the work the caller was trying to avoid).
  3. .to("cuda", non_blocking=True) goes through PyTorch's bounce buffer for unpinned source.

For SSD-backed LoRA loading (the common case), this is fine — the copy fits in the disk read latency. For tmpfs-backed loading (the in-memory rollout case), the copy + pin + bounce buffer dominate end-to-end cost and explain why verl had to monkey-patch _load_adapter to bypass the safetensors path entirely.

Proposed Change.

A new helper that performs zero-copy safetensors load when lora_path resolves to a tmpfs filesystem. The existing safetensors.safe_open path remains the default for everything else.

# vllm/lora/lora_model.py

_SAFETENSORS_DTYPE_MAP = {
    "F32": torch.float32, "F16": torch.float16, "BF16": torch.bfloat16,
    "F64": torch.float64, "I64": torch.int64,  "I32": torch.int32,
    "I16": torch.int16,   "I8": torch.int8,    "U8": torch.uint8,
    "BOOL": torch.bool,
}

# Filesystem magic numbers from <linux/magic.h>.
# tmpfs returns either TMPFS_MAGIC or RAMFS_MAGIC depending on kernel.
_HOST_RAM_FS_TYPES = frozenset([
    0x01021994,  # TMPFS_MAGIC
    0x858458F6,  # RAMFS_MAGIC
    0x958458F6,  # HUGETLBFS_MAGIC (2M / 1G huge pages — even better)
])


class _Statfs(ctypes.Structure):
    _fields_ = [
        ("f_type",    ctypes.c_long),  ("f_bsize",   ctypes.c_long),
        ("f_blocks",  ctypes.c_ulong), ("f_bfree",   ctypes.c_ulong),
        ("f_bavail",  ctypes.c_ulong), ("f_files",   ctypes.c_ulong),
        ("f_ffree",   ctypes.c_ulong), ("f_fsid",    ctypes.c_long * 2),
        ("f_namelen", ctypes.c_long),  ("f_frsize",  ctypes.c_long),
        ("f_flags",   ctypes.c_long),  ("f_spare",   ctypes.c_long * 4),
    ]


_libc = ctypes.CDLL(ctypes.util.find_library("c") or "libc.so.6", use_errno=True)
_libc.statfs.argtypes = [ctypes.c_char_p, ctypes.POINTER(_Statfs)]
_libc.statfs.restype = ctypes.c_int


def _is_on_host_ram_fs(path: str) -> bool:
    """True iff path lives on tmpfs / ramfs / hugetlbfs — i.e., the file's
    bytes are already in host RAM (not on disk), so cudaHostRegister doesn't
    trigger any disk IO during page-pinning."""
    try:
        buf = _Statfs()
        if _libc.statfs(os.fsencode(path), ctypes.byref(buf)) != 0:
            return False
        return buf.f_type in _HOST_RAM_FS_TYPES
    except Exception:
        return False


class _ZeroCopyMmapHandle:
    """Owns mmap + cudaHostRegister lifetime. Attached to LoRAModel so that
    cudaHostUnregister runs when the LoRAModel is dropped."""
    def __init__(self, mm, base_ptr: int, size: int):
        self.mm, self.base_ptr, self.size = mm, base_ptr, size
        self._registered = True

    def close(self):
        if self._registered:
            _cuda_host_unregister(self.base_ptr)
            self._registered = False

    def __del__(self):
        try: self.close()
        except Exception: pass


def _load_safetensors_zero_copy(path: str):
    """mmap the file + cudaHostRegister once + reconstruct tensors as views.
    Returns (tensors_dict, handle) or (None, None) if any dtype is unsupported
    or registration fails — caller falls back to safe_open."""
    fd = os.open(path, os.O_RDWR)
    try:
        size = os.fstat(fd).st_size
        mm = mmap.mmap(fd, size, flags=mmap.MAP_SHARED,
                       prot=mmap.PROT_READ | mmap.PROT_WRITE)
    finally:
        os.close(fd)
    cbuf = (ctypes.c_char * size).from_buffer(mm)
    base_ptr = ctypes.addressof(cbuf)
    if _cuda_host_register(base_ptr, size) != 0:
        return None, None
    handle = _ZeroCopyMmapHandle(mm, base_ptr, size)

    hsz = int.from_bytes(mm[:8], "little")
    header = json.loads(bytes(mm[8 : 8 + hsz]))
    data_off = 8 + hsz
    tensors = {}
    for k, meta in header.items():
        if k == "__metadata__":
            continue
        dt = _SAFETENSORS_DTYPE_MAP.get(meta["dtype"])
        if dt is None:
            handle.close()
            return None, None  # unsupported dtype → fallback
        s, e = meta["data_offsets"]
        raw = torch.frombuffer(mm, dtype=torch.uint8, count=e - s, offset=data_off + s)
        tensors[k] = raw.view(dt).view(meta["shape"])
    return tensors, handle

The integration into from_local_checkpoint (single new branch, existing branch untouched):

elif os.path.isfile(lora_tensor_path):
    _zc_handle = None
    if _is_on_host_ram_fs(lora_tensor_path):
        tensors, _zc_handle = _load_safetensors_zero_copy(lora_tensor_path)
    if _zc_handle is None:
        # Existing path: SSD-backed LoRAs, unsupported dtypes, register failure
        tensors = {}
        with safetensors.safe_open(lora_tensor_path, framework="pt") as f:
            check_unexpected_modules(f)
            for module in f.keys():
                tensors[module] = f.get_tensor(module)
    else:
        check_unexpected_modules(tensors)  # works on dict directly

LoRAModel gains one optional attribute populated after construction:

lora = cls.from_lora_tensors(...)
if _zc_handle is not None:
    lora._zero_copy_handle = _zc_handle  # __del__ chain releases mmap+register
return lora

Total: ~80 lines added to vllm/lora/lora_model.py, zero changes anywhere else.

Behavior contract

  • Existing safe_open path runs byte-for-byte unchanged for any LoRA whose underlying filesystem is not tmpfs / ramfs / hugetlbfs (as determined by statfs(2) f_type).
  • For tmpfs-located LoRAs, the new path is tried first; on any failure (unsupported dtype, registration error) it falls through to the existing path.
  • cudaHostUnregister + munmap lifecycle is tied to LoRAModel's lifetime via attribute attachment — no manual cleanup required from callers.

Benchmark

tools/profile_lora_pin.py (to be added in the implementation PR), bf16 adapter, 5-run P50 (3-run for ≥512 MiB). e2e = load + pin_memory + H2D (the LoRA-load critical path).

adapter sizepathload (ms)pin (ms)H2D (ms)e2e (ms)Δ savedvs B
4 MiBB safe_open (today)0.930.230.821.90ref
F zero-copy (this RFC)1.410.040.652.14−0.24+13%
32 MiBB safe_open (today)0.860.411.943.30ref
F zero-copy (this RFC)2.070.040.803.150.15−4%
128 MiBB safe_open (today)0.976.054.9212.25ref
F zero-copy (this RFC)5.630.042.728.393.86−32%
512 MiBB safe_open (today)3.0224.2721.2950.41ref
F zero-copy (this RFC)17.940.0710.5029.5420.87−41%
1024 MiBB safe_open (today)3.4845.7844.0891.28ref
F zero-copy (this RFC)35.790.0722.8258.6832.60−36%
2048 MiBB safe_open (today)2.6476.4678.34160.51ref
F zero-copy (this RFC)74.280.0944.74118.5042.01−27%

Three observations:

  • cudaHostRegister has a per-call cost that grows ~linearly with bytes pinned: 0.5 ms at 4 MiB, 5.6 ms at 128 MiB, 74 ms at 2 GiB. This pushes F slower than B on tiny adapters and bounds the relative win on very large ones.
  • pin_memory cost in B grows linearly with adapter size (0.23 ms → 76 ms) because it's alloc + memcpy over the whole state_dict. F's cost stays flat at 0.04–0.09 ms because the tensors are already pinned (no allocation, no copy).
  • H2D in B is ~50–80% slower than F on large adapters because PyTorch falls back to bounce-buffered H2D for non-is_pinned source. The bounce buffer cost also grows linearly.

Crossover ≈ 8–16 MiB. For typical training LoRAs (8–128 MiB), F wins by 5–32%. For large adapters (512 MiB–2 GiB, e.g. rank-128+ full-coverage LoRA or fused multi-adapter staging), F wins by 27–41% — saving 20–42 ms per load in absolute terms.

For tiny adapters (<8 MiB), F is ~0.2 ms slower in absolute terms — but tiny adapters don't get staged into tmpfs to begin with (no point), so the gate auto-routes them to the unchanged safe_open path.

Hardware: RTX-class GPU, CUDA 13.0, PyTorch 2.11. Variance: B's pin_memory cost has high jitter (6–14 ms at 128 MiB, 76–90 ms at 2 GiB); F is tight throughout.

Disk-backed comparison (FYI; this RFC does not change disk path behavior)

For full transparency, we also benchmarked F against B with the file living on a local NVMe SSD instead of tmpfs. This is not the path the RFC proposes; the new code is gated to RAM-backed filesystems via statfs(2) precisely because of the trade-offs below.

sizestorageB today (ms, P50)F (ms, P50)Δ
32 MiBtmpfs (this RFC)3.303.15−0.15
disk (SSD, warm)3.353.46+0.11
128 MiBtmpfs (this RFC)12.258.39−3.86
disk (SSD, warm)9.9910.17+0.18
512 MiBtmpfs (this RFC)50.4129.54−20.87
disk (SSD, warm)39.3134.95(would be −4.36 if enabled)

Why disk-warm looks so close to tmpfs: the kernel page cache does its job. Once the file has been read once (the bench's 3-run warmup populates it), its bytes live in RAM as page cache pages. Subsequent mmap / read / cudaHostRegister against that file all touch those in-RAM pages without going back to the block device. From a steady-state read-path perspective, a warm SSD-backed file is, byte-for-byte, the same as a tmpfs file — same page cache pages, same memory access patterns.

The "disk vs tmpfs" distinction matters only at three specific moments:

  1. First load — file isn't cached yet; mmap-then-touch (B does this implicitly through safe_open, F does it explicitly through cudaHostRegister) page-faults to read from the block device. For 2 GiB on NVMe (~3 GB/s) this is roughly 700 ms; on slower storage / network FS it's worse.
  2. After page cache eviction — kernel reclaims pages under memory pressure; next access re-faults to disk. tmpfs pages are not subject to reclaim (modulo swap, which is rarely configured in inference deployments).
  3. F path specificallycudaHostRegister makes the affected pages unevictable. On tmpfs this is identity (the file is its memory); on disk it converts the page cache into permanent host residency for the LoRA's lifetime, diverging from typical "file in page cache" memory accounting.

Two reasons the gate stays on regardless:

  • Cold-cache cliff: the first load of a multi-GiB file on cold pages takes ~1 second through cudaHostRegister's synchronous page-in. This is a hard SLO break for latency-sensitive pipelines that can't pre-warm.
  • Semantic divergence: pinning page cache changes memory accounting in ways users would have to debug if it surprised them. tmpfs makes the same operation a no-op semantically.

If a user explicitly wants F-tier perf on an SSD-resident LoRA, the supported route is to stage it to tmpfs first (single cp + load). Operational layering, not vLLM core behavior.

A note on hugetlbfs for very large adapters

The _HOST_RAM_FS_TYPES set also includes HUGETLBFS_MAGIC. For multi-GiB adapters, staging to a hugetlbfs mount (mount -t hugetlbfs hugetlbfs /mnt/huge) is meaningfully better than tmpfs:

  • 2 MiB (or 1 GiB) pages instead of 4 KiB → 512×–262144× fewer page table entries
  • cudaHostRegister walks the page table during pinning; fewer entries → linearly less work
  • Estimated cudaHostRegister cost on 2 GiB hugetlbfs ≈ a few ms instead of 74 ms

The proposed detection picks this up automatically — no code or RFC scope change needed.

Non-goals

  • Not modifying the existing safe_open path. SSD-backed LoRA loading sees zero change.
  • Not touching .bin / .pt / tensorizer branches.
  • Not asking safetensors or PyTorch upstream to change anything. This RFC is self-contained to vLLM.
  • Not adding a LoRARequest flag. tmpfs detection from lora_path is sufficient signal.
  • Not changing post-pack pin_memory() in _create_merged_loras_inplace — that's intentional and load-bearing for packed MoE layers.

Compatibility

Two storage tiers, two code paths, no overlap:

storage tierexample pathscode pathbehavior change
disk-backedlocal SSD, NFS, FUSE, S3-FUSE, ~/.cache/huggingface/...existing safe_opennone — byte-for-byte unchanged
host-RAM-backedtmpfs, ramfs, hugetlbfs, K8s emptyDir { medium: Memory }new zero-copy F pathnew path activated automatically

The two tiers do not share any code. SSD users never enter the new path; tmpfs users never enter the old path. Routing is by statfs(2) f_type, evaluated once per load.

Other compatibility points:

  • statfs(2) detection covers every host-RAM-backed filesystem regardless of mount path: /dev/shm, /run/user/$UID, /run, /tmp (when distro mounts it tmpfs, e.g. Arch / systemd default), custom mounts (mount -t tmpfs -o size=8G tmpfs /mnt/lora_stage), K8s emptyDir { medium: Memory } (path chosen by user via volumeMounts), Docker --mount type=tmpfs,destination=/cache, and hugetlbfs mounts. No prefix maintenance required as new conventions emerge.
  • Non-tmpfs paths (SSD, NFS, FUSE, network FS) bypass the new code entirely.
  • Unsupported safetensors dtypes (FP8 variants, FP4) fall back to safe_open automatically — no surprises.
  • CPU-only builds: cuda_host_register returns error; falls back transparently to safe_open.
  • statfs syscall failure (path race, permission edge case): caught by the try/except, falls back to safe_open.

Test coverage:

  • Existing tests/lora/test_lora_checkpoints.py continues to validate SSD path (unchanged).
  • New tests/lora/test_lora_zero_copy_tmpfs.py:
    • Loads a synthetic adapter from /dev/shm, asserts data_ptr is inside the registered mmap region and is_pinned() returns True.
    • Verifies cudaHostUnregister is called exactly once on LoRAModel drop (mock the runtime call).
    • Verifies fallback to safe_open when registration fails (mock).

Use cases unblocked

Concrete downstream patterns where this RFC eliminates workarounds or unlocks new perf headroom:

  1. Large fused / merged LoRA staging: rank-128+ adapters across all attention + MLP modules, multi-domain fused adapters, or stacked composition LoRAs commonly hit 1–4 GiB. These are where the absolute savings (tens of ms per load) compound across reload-heavy workloads.
  2. Encrypted-weight serving: decrypt to a tmpfs directory, hand off path, never touch disk. Works the same whether the tmpfs is /dev/shm, a custom mount -t tmpfs target, or K8s emptyDir { medium: Memory }.
  3. Multi-tenant inference: LRU adapter cache in /dev/shm (or custom tmpfs mount) with OS-enforced memory bounding via tmpfs size limit. New cache entries written from upstream loaders get the fast path automatically.
  4. Hot-swap inference with shm-backed adapter cache: runtime adapter download / refresh, with /dev/shm as the working area. Fast path active without any user-side flag.
  5. In-process actor → rollout weight sync (concrete instance: verl-project/verl#5616): the staging-into-tmpfs pattern an in-memory rollout backend uses to avoid disk IO becomes a first-class path; the workarounds that currently bypass _load_adapter are no longer needed.

None of these require new vLLM APIs. They all work today through the existing LoRARequest(lora_path=...) interface — they just leave performance on the table because of the safetensors copy. This RFC reclaims that performance without breaking the existing API contract or downstream expectations.

Deployment patterns

Three common ways operators expose a host-RAM-backed filesystem for LoRA staging — all auto-detected by this RFC:

# 1. Built-in /dev/shm (raise default 64 MB → enough for your adapters)
docker run --shm-size=8G ...

# 2. Custom tmpfs mount (no shm constraints, full control)
mount -t tmpfs -o size=8G tmpfs /mnt/lora_stage
# then stage to /mnt/lora_stage/

# 3. Kubernetes emptyDir backed by RAM
volumes:
- name: lora-stage
  emptyDir:
    medium: Memory
    sizeLimit: 8Gi

For multi-GiB adapters specifically, hugetlbfs trades a one-time setup for materially better register cost:

# 4. hugetlbfs (1 GiB or 2 MiB pages — ~10× faster cudaHostRegister at 2 GiB scale)
echo 2048 > /proc/sys/vm/nr_hugepages          # reserve 2048 × 2 MiB
mount -t hugetlbfs -o pagesize=2M hugetlbfs /mnt/huge

All four patterns are recognized identically by _is_on_host_ram_fs() via statfs(2). No code change as deployment patterns evolve.

Feedback Period.

1 week (until 2026-05-19), or longer if discussion is active.

CC List.

@jeejeelee @varun-sundar-rabindranath @xyang16

Any Other Things.

Duplicate-work check (per AGENTS.md):

gh search issues "LoRA pin_memory" --repo vllm-project/vllm
gh search issues "safetensors mmap LoRA" --repo vllm-project/vllm
gh search prs "LoRA load tmpfs" --repo vllm-project/vllm
gh search issues "RFC LoRA loading" --repo vllm-project/vllm
gh search issues "cudaHostRegister" --repo vllm-project/vllm
gh search issues "pinned memory LoRA" --repo vllm-project/vllm

None of these returned an open issue or PR addressing this specific gap (safetensors PyByteArray copy defeating caller-side cudaHostRegister). The closest prior work, safetensors/safetensors#760 (merged upstream), adds an internal pinned buffer for the safe_open(device="cuda") path but does not address external pinning of device="cpu" mmap regions.

AI assistance disclosure: Drafted with Claude Code. The human submitter has reviewed the proposal end-to-end, run the companion benchmark (tools/profile_lora_pin.py) locally on the target hardware to produce the numbers in the table above, and is committed to verifying the implementation patch passes the existing tests/lora/ suite before opening the implementation PR.

Companion benchmark script: a stand-alone, ~150-line reproducer (profile_lora_pin.py) covers all six adapter sizes and both storage tiers reported above. It will be included in tools/ as part of the implementation PR so reviewers can reproduce the numbers on their own hardware.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING