vllm - 💡(How to fix) Fix [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`? [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#42207Fetched 2026-05-11 03:13:49
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Assignees
Timeline (top)
assigned ×1labeled ×1

We are using runtime LoRA loading/unloading through the REST API.

We understand that after calling:

POST /v1/unload_lora_adapter

the LoRA adapter is correctly unregistered and becomes unavailable for inference.

At the same time, we also observed that the process CPU RSS memory does not noticeably decrease after unload. We assume this is expected behavior, since /v1/unload_lora_adapter may primarily remove the LoRA registration state rather than actively reclaiming CPU-resident memory.

So our main question is:

After a LoRA adapter has been unloaded via /v1/unload_lora_adapter, is there an officially recommended and safe way to clean up or release the LoRA-related CPU-resident/cache memory?


Root Cause

We are using runtime LoRA loading/unloading through the REST API.

We understand that after calling:

POST /v1/unload_lora_adapter

the LoRA adapter is correctly unregistered and becomes unavailable for inference.

At the same time, we also observed that the process CPU RSS memory does not noticeably decrease after unload. We assume this is expected behavior, since /v1/unload_lora_adapter may primarily remove the LoRA registration state rather than actively reclaiming CPU-resident memory.

So our main question is:

After a LoRA adapter has been unloaded via /v1/unload_lora_adapter, is there an officially recommended and safe way to clean up or release the LoRA-related CPU-resident/cache memory?


Code Example

OS:
  Ubuntu 22.04.4 LTS

vLLM version:
  v0.17.1

Docker image:
  vllm/vllm-openai:v0.17.1

Model:
  Qwen/Qwen3-4B-Instruct-2507

Relevant env:
  VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

---

sampler-Qwen3-4B-Instruct-2507-trio002-1:
  container_name: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
  image: vllm/vllm-openai:v0.17.1

  environment:
    HOSTNAME: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
    VLLM_ALLOW_RUNTIME_LORA_UPDATING: "True"
    VLLM_LOGGING_LEVEL: INFO

  volumes:
    - /data/models/Qwen:/data/models/Qwen:ro
    - /data/release/nano-tinker/outputs:/data/release/nano-tinker/outputs

  shm_size: "16g"

  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ["5"]
            capabilities: [gpu]

  command:
    - --model
    - /data/models/Qwen/Qwen3-4B-Instruct-2507
    - --served-model-name
    - Qwen/Qwen3-4B-Instruct-2507
    - --gpu-memory-utilization
    - "0.85"
    - --host
    - 0.0.0.0
    - --port
    - "80"
    - --trust-remote-code
    - --max-model-len
    - "32768"
    - --enable-lora
    - --max-loras
    - "4"
    - --max-lora-rank
    - "64"

  restart: unless-stopped

---

POST /v1/unload_lora_adapter
RAW_BUFFERClick to expand / collapse

Environment

OS:
  Ubuntu 22.04.4 LTS

vLLM version:
  v0.17.1

Docker image:
  vllm/vllm-openai:v0.17.1

Model:
  Qwen/Qwen3-4B-Instruct-2507

Relevant env:
  VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

Deployment Config

sampler-Qwen3-4B-Instruct-2507-trio002-1:
  container_name: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
  image: vllm/vllm-openai:v0.17.1

  environment:
    HOSTNAME: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
    VLLM_ALLOW_RUNTIME_LORA_UPDATING: "True"
    VLLM_LOGGING_LEVEL: INFO

  volumes:
    - /data/models/Qwen:/data/models/Qwen:ro
    - /data/release/nano-tinker/outputs:/data/release/nano-tinker/outputs

  shm_size: "16g"

  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ["5"]
            capabilities: [gpu]

  command:
    - --model
    - /data/models/Qwen/Qwen3-4B-Instruct-2507
    - --served-model-name
    - Qwen/Qwen3-4B-Instruct-2507
    - --gpu-memory-utilization
    - "0.85"
    - --host
    - 0.0.0.0
    - --port
    - "80"
    - --trust-remote-code
    - --max-model-len
    - "32768"
    - --enable-lora
    - --max-loras
    - "4"
    - --max-lora-rank
    - "64"

  restart: unless-stopped

Description

We are using runtime LoRA loading/unloading through the REST API.

We understand that after calling:

POST /v1/unload_lora_adapter

the LoRA adapter is correctly unregistered and becomes unavailable for inference.

At the same time, we also observed that the process CPU RSS memory does not noticeably decrease after unload. We assume this is expected behavior, since /v1/unload_lora_adapter may primarily remove the LoRA registration state rather than actively reclaiming CPU-resident memory.

So our main question is:

After a LoRA adapter has been unloaded via /v1/unload_lora_adapter, is there an officially recommended and safe way to clean up or release the LoRA-related CPU-resident/cache memory?


Questions

  1. After calling /v1/unload_lora_adapter, is there an officially recommended way to clean up CPU-resident memory left behind by unloaded LoRA adapters?

  2. If proactively reclaiming LoRA-related CPU memory is desired, is restarting the worker/server currently the only reliable approach, or is there a more fine-grained cleanup mechanism available?

  3. If fully reclaiming LoRA-related CPU memory is desired, is restarting the worker/server currently the only reliable approach, or is there a more fine-grained cleanup mechanism?


Additional Context

Our workload is a long-running dynamic multi-LoRA service where adapters are frequently loaded and unloaded.

So we would like to better understand:

the officially recommended CPU memory management strategy, the expected memory lifecycle behavior after LoRA unload, and whether there is a safe and proactive way to release CPU-resident/cache memory for already-unloaded LoRA adapters (instead of only relying on passive LRU-based eviction).

Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`? [1 participants]