vllm - ✅(Solved) Fix [Bug]: LMCache does not work with vLLM 0.17.0 (Qwen3Next) [1 pull requests, 3 comments, 3 participants]

Sanches166 · 2026-03-11T10:02:06Z

[vllm] PR 2863: Support hybrid KV cache models Mamba + attention in GPU connector V3 - Repository: LMCache/LMCache - Author: oceanplexian - State: open | merge… # PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3 - Repository: LMCache/LMCache - Author: oceanplexian - State: open | merged: False - Link: https://github.com/LMCache/LMCache/pull/2863 ## Description (problem / solution / changelog) ## Summary Adds support for hybrid KV cache models (Mamba/GDN + attention) in the V3 GPU connector. Models like Qwen3.5, Falcon-H1, and Jamba store multiple state tensors per recurrent layer, which crashes `build_kv_layer_groups` and `VLLMPagedMemGPUConnectorV3`. Fixes #2845. Related: vllm-project/vllm#36771, #2221. **Changes** - Import `SupportsHMA` and add it to `LMCacheConnectorV1` class bases - Implement `request_finished_all_groups()` which combines block IDs from all KV cache groups and delegates to the existing `request_finished` handler ## Testing - [x] `pytest tests/v1/test_kv_layer_groups_manager.py` — 10/10 pass (9 existing + 1 new) - [x] `ruff check` + `ruff format` clean - [x] Qwen3.5-35B-A3B-GPTQ-Int4 (GDN + attention, 30 recurrent + 10 attn layers) on 2x RTX 3090, TP=2, 256K context, LMCache V3 + vllm + prefix caching, Tested 1-8 parallel requests - [x] Falcon-H1-7B-Instruct (Mamba-2 + attention, 44 recurrent + 44 attn layers). Same setup, Tested 1-8 parallel requests ## Changed files - `lmcache/v1/gpu_connector/gpu_connectors.py` (modified, +21/-8) - `lmcache/v1/kv_layer_groups.py` (modified, +30/-11) - `tests/v1/test_kv_layer_groups_manager.py` (modified, +21/-0) ## Fixed - Fixed by PR: Support hybrid KV cache models (Mamba + attention) in GPU connector V3 (https://github.com/LMCache/LMCache/pull/2863) ### Your current environment - vLLM version: 0.17.0 - LMCache: latest nightly-2026-03-10 & vllm/vllm-openai:v0.17.0 - Model: Qwen3-Coder-Next-FP8 - Python: 3.12 - Deployment: Kubernetes - Load format: runai_streamer ### 🐛 Describe the bug We encountered two different problems when trying to use LMCache with vLLM 0.17.0. Case 1 — vllm/vllm-openai:v0.17.0 image Using the original vllm/vllm-openai:v0.17.0 image with LMCache enabled fails for all tested models (GLM, Qwen-Coder, etc.). LMCache initialization starts but crashes with a binary import error: ``` ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib ``` Relevant log snippet: ``` Creating v1 connector with name: LMCacheConnectorV1 Initializing latest dev LMCache connector ImportError: lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib ``` Case 2 — lmcache/vllm-openai image Using the lmcache/vllm-openai image: GLM-4.7-Flash works with LMCache Qwen3-Coder-Next fails during startup. The server crashes with: ``` ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type. ``` This happens when LMCache is enabled via: ``` --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' ``` Actual behavior - vllm-openai:v0.17.0 + LMCache → crashes with lmcache.c_ops undefined symbol error. - lmcache/vllm-openai + hybrid KV models → crashes due to hybrid KV cache manager being disabled. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-11 10:02:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36771•Fetched 2026-04-08 00:34:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×6commented ×3cross-referenced ×2labeled ×1

Error Message

ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Fix Action

Fixed

Fixed by PR: Support hybrid KV cache models (Mamba + attention) in GPU connector V3 (https://github.com/LMCache/LMCache/pull/2863)

PR fix notes

PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3

Repository: LMCache/LMCache
Author: oceanplexian
State: open | merged: False
Link: https://github.com/LMCache/LMCache/pull/2863

Description (problem / solution / changelog)

Summary

Adds support for hybrid KV cache models (Mamba/GDN + attention) in the V3 GPU connector. Models like Qwen3.5, Falcon-H1, and Jamba store multiple state tensors per recurrent layer, which crashes build_kv_layer_groups and VLLMPagedMemGPUConnectorV3. Fixes #2845. Related: vllm-project/vllm#36771, #2221.

Changes

Import SupportsHMA and add it to LMCacheConnectorV1 class bases
Implement request_finished_all_groups() which combines block IDs from all KV cache groups and delegates to the existing request_finished handler

Testing

pytest tests/v1/test_kv_layer_groups_manager.py — 10/10 pass (9 existing + 1 new)
ruff check + ruff format clean
Qwen3.5-35B-A3B-GPTQ-Int4 (GDN + attention, 30 recurrent + 10 attn layers) on 2x RTX 3090, TP=2, 256K context, LMCache V3 + vllm + prefix caching, Tested 1-8 parallel requests
Falcon-H1-7B-Instruct (Mamba-2 + attention, 44 recurrent + 44 attn layers). Same setup, Tested 1-8 parallel requests

Changed files

lmcache/v1/gpu_connector/gpu_connectors.py (modified, +21/-8)
lmcache/v1/kv_layer_groups.py (modified, +30/-11)
tests/v1/test_kv_layer_groups_manager.py (modified, +21/-0)

Code Example

ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

---

Creating v1 connector with name: LMCacheConnectorV1
Initializing latest dev LMCache connector
ImportError: lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

---

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

---

--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: 0.17.0
LMCache: latest nightly-2026-03-10 & vllm/vllm-openai:v0.17.0
Model: Qwen3-Coder-Next-FP8
Python: 3.12
Deployment: Kubernetes
Load format: runai_streamer

🐛 Describe the bug

We encountered two different problems when trying to use LMCache with vLLM 0.17.0. Case 1 — vllm/vllm-openai:v0.17.0 image Using the original vllm/vllm-openai:v0.17.0 image with LMCache enabled fails for all tested models (GLM, Qwen-Coder, etc.).

LMCache initialization starts but crashes with a binary import error:

ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Relevant log snippet:

Creating v1 connector with name: LMCacheConnectorV1
Initializing latest dev LMCache connector
ImportError: lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Case 2 — lmcache/vllm-openai image

Using the lmcache/vllm-openai image: GLM-4.7-Flash works with LMCache Qwen3-Coder-Next fails during startup.

The server crashes with:

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

This happens when LMCache is enabled via:

--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

Actual behavior

vllm-openai:v0.17.0 + LMCache → crashes with lmcache.c_ops undefined symbol error.
lmcache/vllm-openai + hybrid KV models → crashes due to hybrid KV cache manager being disabled.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issues with LMCache and vLLM 0.17.0, follow these steps:

For Case 1: vllm/vllm-openai:v0.17.0 image

Update LMCache: Ensure you're using the latest version of LMCache compatible with vLLM 0.17.0.
Check CUDA Version: Verify that the CUDA version installed on your system matches the one expected by LMCache. You might need to update or downgrade CUDA.
Rebuild LMCache: If using a custom build, rebuild LMCache with the correct CUDA version.

Example CUDA version check:

nvcc --version

For Case 2: lmcache/vllm-openai image

Enable Hybrid KV Cache Manager: Modify your configuration to enable the hybrid KV cache manager when using models like Qwen3-Coder-Next.
Unified KV Cache Specs: Ensure all KV cache specs are of a unified type to avoid conversion errors.

Example configuration change:

--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both", "enable_hybrid_kv": true}'

Verification

For Case 1, verify that LMCache initializes without crashing after updating and rebuilding.
For Case 2, check that the server starts up successfully with the hybrid KV cache manager enabled and unified KV cache specs.

Extra Tips

Always check the compatibility of your CUDA version with LMCache.
Refer to the official LMCache documentation for the latest configuration options and troubleshooting guides.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #prompt issue #agent setup #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: LMCache does not work with vLLM 0.17.0 (Qwen3Next) [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3

Description (problem / solution / changelog)

Summary

Testing

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

For Case 1: vllm/vllm-openai:v0.17.0 image

For Case 2: lmcache/vllm-openai image

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: LMCache does not work with vLLM 0.17.0 (Qwen3Next) [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3

Description (problem / solution / changelog)

Summary

Testing

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

For Case 1: vllm/vllm-openai:v0.17.0 image

For Case 2: lmcache/vllm-openai image

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING