vllm - 💡(How to fix) Fix [Bug]: Runtime LoRA same-name reload can reuse stale prefix-cache blocks from previous adapter version

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Runtime LoRA replacement is useful for workflows where adapters are continuously updated under a stable tenant or adapter name. If prefix-cache keys only represent lora_name, or if runtime load/unload does not invalidate adapter-dependent KV cache state, requests after a reload can consume prefix-cache blocks produced with the previous adapter's weights.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12
Python platform              : Linux-6.8.0-1053-gcp-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
CUDA version from nvidia-smi : 13.0

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] torch==2.11.0+cu130
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.8.0
[pip3] triton==3.6.0
[pip3] peft==0.19.1
[pip3] accelerate==1.13.0
[pip3] datasets==4.8.5

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.2rc1.dev148+g0c2e9d489 (git sha: 0c2e9d489)
vLLM Build Flags             : CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

Validation host summary:
- Cloud VM type: GCP Spot A100, a2-highgpu-1g, us-west4-b
- GPU type: nvidia-tesla-a100
- Boot image: pytorch-2-9-cu129-ubuntu-2204-nvidia-580-v20260430
- Boot disk: 300GB pd-ssd

The raw collect_env output was checked before posting. Project ids, public IPs, service accounts, local user paths, and other host identifiers are intentionally omitted from this public report.

---

Now answer with exactly the adapter's private label and nothing else.

---

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 16,
  "seed": 0
}

---

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

---

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching

---

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_A"
}

---

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_B",
  "load_inplace": true
}

---

Success: LoRA adapter 'tenant-adapter' added successfully.

---

server_phase4_main.txt: after loading A at 23:32:27 and loading B with load_inplace=true at 23:32:29,
vLLM logged "Prefix cache hit rate: 94.4%" at 23:32:33.

server_phase4_main.txt: the same main run later logged "Prefix cache hit rate: 91.2%".

server_mode_enforce_eager_prefix.txt: enforce-eager with prefix caching logged "Prefix cache hit rate: 85.5%".

server_concurrency.txt: concurrent same-name reload logged "Prefix cache hit rate: 97.2%".

server_control_no_prefix.txt: the no-prefix-cache control logged "Prefix cache hit rate: 0.0%" and returned exact B.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Sanitized output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12
Python platform              : Linux-6.8.0-1053-gcp-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
CUDA version from nvidia-smi : 13.0

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] torch==2.11.0+cu130
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.8.0
[pip3] triton==3.6.0
[pip3] peft==0.19.1
[pip3] accelerate==1.13.0
[pip3] datasets==4.8.5

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.2rc1.dev148+g0c2e9d489 (git sha: 0c2e9d489)
vLLM Build Flags             : CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

Validation host summary:
- Cloud VM type: GCP Spot A100, a2-highgpu-1g, us-west4-b
- GPU type: nvidia-tesla-a100
- Boot image: pytorch-2-9-cu129-ubuntu-2204-nvidia-580-v20260430
- Boot disk: 300GB pd-ssd

The raw collect_env output was checked before posting. Project ids, public IPs, service accounts, local user paths, and other host identifiers are intentionally omitted from this public report.
</details>

Describe the bug

When replacing a LoRA adapter at runtime under the same lora_name, vLLM can reuse prefix-cache blocks computed with the previous adapter version.

In this repro, adapter B is validated independently and produces BETA_ADAPTER_VERSION_B on a cold server. After warming adapter A under the same name and replacing it with adapter B using /v1/load_lora_adapter with load_inplace=true, the same prompt returns BETA_VERSION_B instead.

That non-cold B output only appears when prefix-cache blocks can be reused. The exact B output is restored when prefix caching is disabled, when cache_salt is changed, when the first prompt block is changed, when the server is restarted and B is loaded cold, or when B is loaded under a unique adapter name.

I also reproduced the same non-cold B output after /v1/unload_lora_adapter followed by /v1/load_lora_adapter under the same lora_name, so the issue appears to be same-name runtime adapter reload/replacement without adapter-version cache invalidation, not only the load_inplace branch.

Related but not exact duplicates:

  • #30931 covers the known/general same-name LoRA prefix-cache collision with separately constructed LoRA requests. This report is narrower: runtime /v1/load_lora_adapter and /v1/unload_lora_adapter replacement under the same name, with fresh-server and cache-bypass controls.
  • #38606 covers KV block corruption under rapid LoRA adapter alternation. This repro does not require alternating adapters or a rapid alternation race; the main failing path is a single same-name runtime A to B reload with prefix-cache reuse.

Why this matters

Runtime LoRA replacement is useful for workflows where adapters are continuously updated under a stable tenant or adapter name. If prefix-cache keys only represent lora_name, or if runtime load/unload does not invalidate adapter-dependent KV cache state, requests after a reload can consume prefix-cache blocks produced with the previous adapter's weights.

Reproduction summary

I trained two tiny PEFT LoRA adapters for the same base model:

  • Base model: Qwen/Qwen2.5-3B-Instruct
  • Adapter A expected output: ALPHA_ADAPTER_VERSION_A
  • Adapter B expected output: BETA_ADAPTER_VERSION_B
  • PEFT type: LORA
  • Rank: 16
  • Alpha: 32
  • Dropout: 0.0
  • Target modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj
  • Adapter A directory hash: d09a1037c91af6216a3ddbf77efbd361a35c22ddbf54596c47da91fe3d63e8ea
  • Adapter B directory hash: 4c216f2f29105225e01e20e190665a8f9f68a2afd3af6e484729d52073f43032

The fixed prompt was 3,932 prompt tokens. It used a repeated deterministic prefix and ended with:

Now answer with exactly the adapter's private label and nothing else.

Sampling parameters for all comparisons:

{
  "temperature": 0,
  "top_p": 1,
  "max_tokens": 16,
  "seed": 0
}

Prefix-cache server command:

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

No-prefix-cache control command:

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --trust-remote-code \
  --enable-lora \
  --max-loras 4 \
  --max-lora-rank 64 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching

Load adapter A:

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_A"
}

Warm A by querying the fixed prompt repeatedly with model="tenant-adapter".

Replace A with B in the same running server:

{
  "lora_name": "tenant-adapter",
  "lora_path": "/tmp/lora_B",
  "load_inplace": true
}

The endpoint returned HTTP 200:

Success: LoRA adapter 'tenant-adapter' added successfully.

Then query the exact same prompt again with the same sampling parameters.

Observed results

ScenarioResult
Cold no-prefix A, same nameALPHA_ADAPTER_VERSION_A in 10/10 requests
Cold no-prefix B, same nameBETA_ADAPTER_VERSION_B in 10/10 requests
Fresh prefix-cache server, load B directly under same nameBETA_ADAPTER_VERSION_B in 10/10 requests
Prefix-cache server, warm AALPHA_ADAPTER_VERSION_A in 12/12 requests
Prefix-cache server, A to B via load_inplace=true, same nameBETA_VERSION_B in 10/10 requests
Same A to B reload with prefix caching disabledBETA_ADAPTER_VERSION_B in 10/10 requests
Same A to B reload, then query with changed cache_saltBETA_ADAPTER_VERSION_B in 10/10 requests
Same A to B reload, then query with modified first prompt blockBETA_ADAPTER_VERSION_B in 10/10 requests
Same process after A warm, load B under tenant-adapter-v2BETA_ADAPTER_VERSION_B in 10/10 requests
Prefix-cache server, unload A then load B under same nameBETA_VERSION_B in 10/10 requests
Prefix-cache server with --enforce-eager, A to B via load_inplace=trueBETA_VERSION_B in 4/4 requests
Default no-prefix-cache mode matrix, A to B via load_inplace=trueBETA_ADAPTER_VERSION_B in 4/4 requests
Concurrent run: 32 A warm requests, idle A to B reload, 32 B requestsA warm 32/32 exact A; B after reload 32/32 BETA_VERSION_B

Representative token IDs:

  • Cold/wanted B output BETA_ADAPTER_VERSION_B: [33, 20695, 79602, 10678, 1668]
  • B after same-name runtime reload with prefix-cache hits: BETA_VERSION_B: [33, 20695, 10678, 1668]
  • A output ALPHA_ADAPTER_VERSION_A: [969, 28222, 79602, 10678, 1566]

The reverse B to A run returned exact A in this validation, so I do not claim symmetry. The forward A to B failure reproduced across the main run, unload/reload same-name run, enforce-eager prefix-cache run, and concurrent run.

Prefix-cache evidence

The validation harness attempted to scrape /metrics, but prefix-cache counters were not present or not parsed in this run. The raw vLLM server logs show cache reuse during the failing path:

server_phase4_main.txt: after loading A at 23:32:27 and loading B with load_inplace=true at 23:32:29,
vLLM logged "Prefix cache hit rate: 94.4%" at 23:32:33.

server_phase4_main.txt: the same main run later logged "Prefix cache hit rate: 91.2%".

server_mode_enforce_eager_prefix.txt: enforce-eager with prefix caching logged "Prefix cache hit rate: 85.5%".

server_concurrency.txt: concurrent same-name reload logged "Prefix cache hit rate: 97.2%".

server_control_no_prefix.txt: the no-prefix-cache control logged "Prefix cache hit rate: 0.0%" and returned exact B.

Expected behavior

Replacing or reloading a LoRA adapter at runtime under an existing lora_name should either:

  • invalidate all prefix-cache blocks that depend on the previous adapter version, or
  • include a stable adapter version, path, content hash, or runtime generation identity in the prefix-cache key.

Actual behavior

After same-name runtime replacement, vLLM can reuse prefix-cache blocks computed with the previous adapter version. The request after reload returns a deterministic non-cold B output (BETA_VERSION_B) while cold B, fresh-server B, unique-name B, changed-salt B, modified-prompt B, and no-prefix-cache B all return the correct BETA_ADAPTER_VERSION_B.

Suspected cause

Static inspection suggests the runtime update path and prefix-cache key path are not versioning the adapter:

  • Runtime /v1/load_lora_adapter accepts lora_name, lora_path, and load_inplace.
  • Same-name load_inplace=true reuses the existing adapter's lora_int_id.
  • The runtime load path replaces or reloads adapter weights but does not appear to call a prefix-cache reset or targeted eviction path.
  • The runtime unload path deletes the serving-layer adapter entry but does not appear to reset prefix-cache state either.
  • vllm/v1/core/kv_cache_utils.py uses LoRA extra hash keys, but the LoRA-specific key is only request.lora_request.lora_name. I did not see lora_path, adapter content hash, adapter config hash, lora_int_id, or a runtime version counter in the block hash key.
  • cache_salt is included in the first block hash, and changing cache_salt restores the correct B output, consistent with stale blocks being bypassed.

Duplicate search performed

I searched open and closed issues for:

  • load_inplace prefix cache lora
  • load_lora_adapter prefix cache
  • unload_lora_adapter prefix cache
  • runtime lora prefix cache
  • dynamic lora prefix cache wrong output
  • same lora_name prefix cache
  • lora path prefix cache
  • lora adapter version cache
  • load_inplace stale kv cache
  • LoRA runtime updating stale output
  • LoRA cache invalidation vLLM

I found related issues #30931 and #38606, described above, but did not find an exact issue for runtime /v1/load_lora_adapter or /v1/unload_lora_adapter replacement/reload failing to invalidate prefix-cache blocks after the adapter behind the same lora_name changes.

Before submitting a new issue...

  • I searched existing and past issues for relevant reports. I could not access the docs-page chatbot from this noninteractive environment, but I checked the current issue template and public contributing docs before posting.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Replacing or reloading a LoRA adapter at runtime under an existing lora_name should either:

  • invalidate all prefix-cache blocks that depend on the previous adapter version, or
  • include a stable adapter version, path, content hash, or runtime generation identity in the prefix-cache key.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Runtime LoRA same-name reload can reuse stale prefix-cache blocks from previous adapter version