Replacing or reloading a LoRA adapter at runtime under an existing `lora_name` should either: - invalidate all prefix-cache blocks that depend on the previous adapter version, or - include a stable adapter version, path, content hash, or runtime generation identity in the prefix-cache key.

vllm - 💡(How to fix) Fix [Bug]: Runtime LoRA same-name reload can reuse stale prefix-cache blocks from previous adapter version

vllm2026-05-09 00:04:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Runtime LoRA replacement is useful for workflows where adapters are continuously updated under a stable tenant or adapter name. If prefix-cache keys only represent lora_name, or if runtime load/unload does not invalidate adapter-dependent KV cache state, requests after a reload can consume prefix-cache blocks produced with the previous adapter's weights.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12
Python platform              : Linux-6.8.0-1053-gcp-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
CUDA version from nvidia-smi : 13.0

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] torch==2.11.0+cu130
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.8.0
[pip3] triton==3.6.0
[pip3] peft==0.19.1
[pip3] accelerate==1.13.0
[pip3] datasets==4.8.5

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.2rc1.dev148+g0c2e9d489 (git sha: 0c2e9d489)
vLLM Build Flags             : CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

Validation host summary:
- Cloud VM type: GCP Spot A100, a2-highgpu-1g, us-west4-b
- GPU type: nvidia-tesla-a100
- Boot image: pytorch-2-9-cu129-ubuntu-2204-nvidia-580-v20260430
- Boot disk: 300GB pd-ssd

The raw collect_env output was checked before posting. Project ids, public IPs, service accounts, local user paths, and other host identifiers are intentionally omitted from this public report.

---

Now answer with exactly the adapter's private label and nothing else.

---

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 16,
  "seed": 0
}

---

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

---

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching

---

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_A"
}

---

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_B",
  "load_inplace": true
}

---

Success: LoRA adapter 'tenant-adapter' added successfully.

---

server_phase4_main.txt: after loading A at 23:32:27 and loading B with load_inplace=true at 23:32:29,
vLLM logged "Prefix cache hit rate: 94.4%" at 23:32:33.

server_phase4_main.txt: the same main run later logged "Prefix cache hit rate: 91.2%".

server_mode_enforce_eager_prefix.txt: enforce-eager with prefix caching logged "Prefix cache hit rate: 85.5%".

server_concurrency.txt: concurrent same-name reload logged "Prefix cache hit rate: 97.2%".

server_control_no_prefix.txt: the no-prefix-cache control logged "Prefix cache hit rate: 0.0%" and returned exact B.

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Sanitized output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12
Python platform              : Linux-6.8.0-1053-gcp-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
CUDA version from nvidia-smi : 13.0

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] torch==2.11.0+cu130
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.8.0
[pip3] triton==3.6.0
[pip3] peft==0.19.1
[pip3] accelerate==1.13.0
[pip3] datasets==4.8.5

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.2rc1.dev148+g0c2e9d489 (git sha: 0c2e9d489)
vLLM Build Flags             : CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

Validation host summary:
- Cloud VM type: GCP Spot A100, a2-highgpu-1g, us-west4-b
- GPU type: nvidia-tesla-a100
- Boot image: pytorch-2-9-cu129-ubuntu-2204-nvidia-580-v20260430
- Boot disk: 300GB pd-ssd

The raw collect_env output was checked before posting. Project ids, public IPs, service accounts, local user paths, and other host identifiers are intentionally omitted from this public report.

</details>

Describe the bug

When replacing a LoRA adapter at runtime under the same lora_name, vLLM can reuse prefix-cache blocks computed with the previous adapter version.

In this repro, adapter B is validated independently and produces BETA_ADAPTER_VERSION_B on a cold server. After warming adapter A under the same name and replacing it with adapter B using /v1/load_lora_adapter with load_inplace=true, the same prompt returns BETA_VERSION_B instead.

That non-cold B output only appears when prefix-cache blocks can be reused. The exact B output is restored when prefix caching is disabled, when cache_salt is changed, when the first prompt block is changed, when the server is restarted and B is loaded cold, or when B is loaded under a unique adapter name.

I also reproduced the same non-cold B output after /v1/unload_lora_adapter followed by /v1/load_lora_adapter under the same lora_name, so the issue appears to be same-name runtime adapter reload/replacement without adapter-version cache invalidation, not only the load_inplace branch.

Related but not exact duplicates:

#30931 covers the known/general same-name LoRA prefix-cache collision with separately constructed LoRA requests. This report is narrower: runtime /v1/load_lora_adapter and /v1/unload_lora_adapter replacement under the same name, with fresh-server and cache-bypass controls.
#38606 covers KV block corruption under rapid LoRA adapter alternation. This repro does not require alternating adapters or a rapid alternation race; the main failing path is a single same-name runtime A to B reload with prefix-cache reuse.

Why this matters

Reproduction summary

I trained two tiny PEFT LoRA adapters for the same base model:

Base model: Qwen/Qwen2.5-3B-Instruct
Adapter A expected output: ALPHA_ADAPTER_VERSION_A
Adapter B expected output: BETA_ADAPTER_VERSION_B
PEFT type: LORA
Rank: 16
Alpha: 32
Dropout: 0.0
Target modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
Adapter A directory hash: d09a1037c91af6216a3ddbf77efbd361a35c22ddbf54596c47da91fe3d63e8ea
Adapter B directory hash: 4c216f2f29105225e01e20e190665a8f9f68a2afd3af6e484729d52073f43032

The fixed prompt was 3,932 prompt tokens. It used a repeated deterministic prefix and ended with:

Now answer with exactly the adapter's private label and nothing else.

Sampling parameters for all comparisons:

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 16,
  "seed": 0
}

Prefix-cache server command:

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

No-prefix-cache control command:

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching

Load adapter A:

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_A"
}

Warm A by querying the fixed prompt repeatedly with model="tenant-adapter".

Replace A with B in the same running server:

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_B",
  "load_inplace": true
}

The endpoint returned HTTP 200:

Success: LoRA adapter 'tenant-adapter' added successfully.

Then query the exact same prompt again with the same sampling parameters.

Observed results

Scenario	Result
Cold no-prefix A, same name	`ALPHA_ADAPTER_VERSION_A` in 10/10 requests
Cold no-prefix B, same name	`BETA_ADAPTER_VERSION_B` in 10/10 requests
Fresh prefix-cache server, load B directly under same name	`BETA_ADAPTER_VERSION_B` in 10/10 requests
Prefix-cache server, warm A	`ALPHA_ADAPTER_VERSION_A` in 12/12 requests
Prefix-cache server, A to B via `load_inplace=true`, same name	`BETA_VERSION_B` in 10/10 requests
Same A to B reload with prefix caching disabled	`BETA_ADAPTER_VERSION_B` in 10/10 requests
Same A to B reload, then query with changed `cache_salt`	`BETA_ADAPTER_VERSION_B` in 10/10 requests
Same A to B reload, then query with modified first prompt block	`BETA_ADAPTER_VERSION_B` in 10/10 requests
Same process after A warm, load B under `tenant-adapter-v2`	`BETA_ADAPTER_VERSION_B` in 10/10 requests
Prefix-cache server, unload A then load B under same name	`BETA_VERSION_B` in 10/10 requests
Prefix-cache server with `--enforce-eager`, A to B via `load_inplace=true`	`BETA_VERSION_B` in 4/4 requests
Default no-prefix-cache mode matrix, A to B via `load_inplace=true`	`BETA_ADAPTER_VERSION_B` in 4/4 requests
Concurrent run: 32 A warm requests, idle A to B reload, 32 B requests	A warm 32/32 exact A; B after reload 32/32 `BETA_VERSION_B`

Representative token IDs:

Cold/wanted B output BETA_ADAPTER_VERSION_B: [33, 20695, 79602, 10678, 1668]
B after same-name runtime reload with prefix-cache hits: BETA_VERSION_B: [33, 20695, 10678, 1668]
A output ALPHA_ADAPTER_VERSION_A: [969, 28222, 79602, 10678, 1566]

The reverse B to A run returned exact A in this validation, so I do not claim symmetry. The forward A to B failure reproduced across the main run, unload/reload same-name run, enforce-eager prefix-cache run, and concurrent run.

Prefix-cache evidence

The validation harness attempted to scrape /metrics, but prefix-cache counters were not present or not parsed in this run. The raw vLLM server logs show cache reuse during the failing path:

server_phase4_main.txt: after loading A at 23:32:27 and loading B with load_inplace=true at 23:32:29,
vLLM logged "Prefix cache hit rate: 94.4%" at 23:32:33.

server_phase4_main.txt: the same main run later logged "Prefix cache hit rate: 91.2%".

server_mode_enforce_eager_prefix.txt: enforce-eager with prefix caching logged "Prefix cache hit rate: 85.5%".

server_concurrency.txt: concurrent same-name reload logged "Prefix cache hit rate: 97.2%".

server_control_no_prefix.txt: the no-prefix-cache control logged "Prefix cache hit rate: 0.0%" and returned exact B.

Expected behavior

Replacing or reloading a LoRA adapter at runtime under an existing lora_name should either:

invalidate all prefix-cache blocks that depend on the previous adapter version, or
include a stable adapter version, path, content hash, or runtime generation identity in the prefix-cache key.

Actual behavior

After same-name runtime replacement, vLLM can reuse prefix-cache blocks computed with the previous adapter version. The request after reload returns a deterministic non-cold B output (BETA_VERSION_B) while cold B, fresh-server B, unique-name B, changed-salt B, modified-prompt B, and no-prefix-cache B all return the correct BETA_ADAPTER_VERSION_B.

Suspected cause

Static inspection suggests the runtime update path and prefix-cache key path are not versioning the adapter:

Runtime /v1/load_lora_adapter accepts lora_name, lora_path, and load_inplace.
Same-name load_inplace=true reuses the existing adapter's lora_int_id.
The runtime load path replaces or reloads adapter weights but does not appear to call a prefix-cache reset or targeted eviction path.
The runtime unload path deletes the serving-layer adapter entry but does not appear to reset prefix-cache state either.
vllm/v1/core/kv_cache_utils.py uses LoRA extra hash keys, but the LoRA-specific key is only request.lora_request.lora_name. I did not see lora_path, adapter content hash, adapter config hash, lora_int_id, or a runtime version counter in the block hash key.
cache_salt is included in the first block hash, and changing cache_salt restores the correct B output, consistent with stale blocks being bypassed.

Duplicate search performed

I searched open and closed issues for:

load_inplace prefix cache lora
load_lora_adapter prefix cache
unload_lora_adapter prefix cache
runtime lora prefix cache
dynamic lora prefix cache wrong output
same lora_name prefix cache
lora path prefix cache
lora adapter version cache
load_inplace stale kv cache
LoRA runtime updating stale output
LoRA cache invalidation vLLM

I found related issues #30931 and #38606, described above, but did not find an exact issue for runtime /v1/load_lora_adapter or /v1/unload_lora_adapter replacement/reload failing to invalidate prefix-cache blocks after the adapter behind the same lora_name changes.

Before submitting a new issue...

I searched existing and past issues for relevant reports. I could not access the docs-page chatbot from this noninteractive environment, but I checked the current issue template and public contributing docs before posting.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Replacing or reloading a LoRA adapter at runtime under an existing lora_name should either:

invalidate all prefix-cache blocks that depend on the previous adapter version, or
include a stable adapter version, path, content hash, or runtime generation identity in the prefix-cache key.

#api #ssr #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Runtime LoRA same-name reload can reuse stale prefix-cache blocks from previous adapter version

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

Describe the bug

Why this matters

Reproduction summary

Observed results

Prefix-cache evidence

Expected behavior

Actual behavior

Suspected cause

Duplicate search performed

Before submitting a new issue...

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Runtime LoRA same-name reload can reuse stale prefix-cache blocks from previous adapter version

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

Describe the bug

Why this matters

Reproduction summary

Observed results

Prefix-cache evidence

Expected behavior

Actual behavior

Suspected cause

Duplicate search performed

Before submitting a new issue...

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING