vllm - 💡(How to fix) Fix [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`? [1 participants]

vllm2026-05-10 05:57:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#42207•Fetched 2026-05-11 03:13:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

HuskyLYL

Participants

HuskyLYL

Assignees

jeejeelee

Timeline (top)

assigned ×1labeled ×1

We are using runtime LoRA loading/unloading through the REST API.

We understand that after calling:

POST /v1/unload_lora_adapter

the LoRA adapter is correctly unregistered and becomes unavailable for inference.

At the same time, we also observed that the process CPU RSS memory does not noticeably decrease after unload. We assume this is expected behavior, since /v1/unload_lora_adapter may primarily remove the LoRA registration state rather than actively reclaiming CPU-resident memory.

So our main question is:

After a LoRA adapter has been unloaded via /v1/unload_lora_adapter, is there an officially recommended and safe way to clean up or release the LoRA-related CPU-resident/cache memory?

Root Cause

We are using runtime LoRA loading/unloading through the REST API.

We understand that after calling:

POST /v1/unload_lora_adapter

the LoRA adapter is correctly unregistered and becomes unavailable for inference.

So our main question is:

After a LoRA adapter has been unloaded via /v1/unload_lora_adapter, is there an officially recommended and safe way to clean up or release the LoRA-related CPU-resident/cache memory?

Code Example

OS:
  Ubuntu 22.04.4 LTS

vLLM version:
  v0.17.1

Docker image:
  vllm/vllm-openai:v0.17.1

Model:
  Qwen/Qwen3-4B-Instruct-2507

Relevant env:
  VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

---

sampler-Qwen3-4B-Instruct-2507-trio002-1:
  container_name: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
  image: vllm/vllm-openai:v0.17.1

  environment:
    HOSTNAME: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
    VLLM_ALLOW_RUNTIME_LORA_UPDATING: "True"
    VLLM_LOGGING_LEVEL: INFO

  volumes:
    - /data/models/Qwen:/data/models/Qwen:ro
    - /data/release/nano-tinker/outputs:/data/release/nano-tinker/outputs

  shm_size: "16g"

  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ["5"]
            capabilities: [gpu]

  command:
    - --model
    - /data/models/Qwen/Qwen3-4B-Instruct-2507
    - --served-model-name
    - Qwen/Qwen3-4B-Instruct-2507
    - --gpu-memory-utilization
    - "0.85"
    - --host
    - 0.0.0.0
    - --port
    - "80"
    - --trust-remote-code
    - --max-model-len
    - "32768"
    - --enable-lora
    - --max-loras
    - "4"
    - --max-lora-rank
    - "64"

  restart: unless-stopped

---

POST /v1/unload_lora_adapter

RAW_BUFFERClick to expand / collapse

Environment

OS:
  Ubuntu 22.04.4 LTS

vLLM version:
  v0.17.1

Docker image:
  vllm/vllm-openai:v0.17.1

Model:
  Qwen/Qwen3-4B-Instruct-2507

Relevant env:
  VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

Deployment Config

sampler-Qwen3-4B-Instruct-2507-trio002-1:
  container_name: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
  image: vllm/vllm-openai:v0.17.1

  environment:
    HOSTNAME: release-sampler-Qwen3-4B-Instruct-2507-trio002-1
    VLLM_ALLOW_RUNTIME_LORA_UPDATING: "True"
    VLLM_LOGGING_LEVEL: INFO

  volumes:
    - /data/models/Qwen:/data/models/Qwen:ro
    - /data/release/nano-tinker/outputs:/data/release/nano-tinker/outputs

  shm_size: "16g"

  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ["5"]
            capabilities: [gpu]

  command:
    - --model
    - /data/models/Qwen/Qwen3-4B-Instruct-2507
    - --served-model-name
    - Qwen/Qwen3-4B-Instruct-2507
    - --gpu-memory-utilization
    - "0.85"
    - --host
    - 0.0.0.0
    - --port
    - "80"
    - --trust-remote-code
    - --max-model-len
    - "32768"
    - --enable-lora
    - --max-loras
    - "4"
    - --max-lora-rank
    - "64"

  restart: unless-stopped

Description

We are using runtime LoRA loading/unloading through the REST API.

We understand that after calling:

POST /v1/unload_lora_adapter

the LoRA adapter is correctly unregistered and becomes unavailable for inference.

So our main question is:

After a LoRA adapter has been unloaded via /v1/unload_lora_adapter, is there an officially recommended and safe way to clean up or release the LoRA-related CPU-resident/cache memory?

Questions

After calling /v1/unload_lora_adapter, is there an officially recommended way to clean up CPU-resident memory left behind by unloaded LoRA adapters?
If proactively reclaiming LoRA-related CPU memory is desired, is restarting the worker/server currently the only reliable approach, or is there a more fine-grained cleanup mechanism available?
If fully reclaiming LoRA-related CPU memory is desired, is restarting the worker/server currently the only reliable approach, or is there a more fine-grained cleanup mechanism?

Additional Context

Our workload is a long-running dynamic multi-LoRA service where adapters are frequently loaded and unloaded.

So we would like to better understand:

the officially recommended CPU memory management strategy, the expected memory lifecycle behavior after LoRA unload, and whether there is a safe and proactive way to release CPU-resident/cache memory for already-unloaded LoRA adapters (instead of only relying on passive LRU-based eviction).

Thanks!

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory management #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`? [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Environment

Deployment Config

Description

Questions

Additional Context

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`? [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Environment

Deployment Config

Description

Questions

Additional Context

Still need to ship something?

RELATED_DISCOVERY

TRENDING