vllm - ✅(Solved) Fix [Bug]: LMCache does not work with vLLM 0.17.0 (Qwen3Next) [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36771Fetched 2026-04-08 00:34:55
View on GitHub
Comments
3
Participants
3
Timeline
13
Reactions
3
Timeline (top)
subscribed ×6commented ×3cross-referenced ×2labeled ×1

Error Message

ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Fix Action

Fixed

PR fix notes

PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3

Description (problem / solution / changelog)

Summary

Adds support for hybrid KV cache models (Mamba/GDN + attention) in the V3 GPU connector. Models like Qwen3.5, Falcon-H1, and Jamba store multiple state tensors per recurrent layer, which crashes build_kv_layer_groups and VLLMPagedMemGPUConnectorV3. Fixes #2845. Related: vllm-project/vllm#36771, #2221.

Changes

  • Import SupportsHMA and add it to LMCacheConnectorV1 class bases
  • Implement request_finished_all_groups() which combines block IDs from all KV cache groups and delegates to the existing request_finished handler

Testing

  • pytest tests/v1/test_kv_layer_groups_manager.py — 10/10 pass (9 existing + 1 new)
  • ruff check + ruff format clean
  • Qwen3.5-35B-A3B-GPTQ-Int4 (GDN + attention, 30 recurrent + 10 attn layers) on 2x RTX 3090, TP=2, 256K context, LMCache V3 + vllm + prefix caching, Tested 1-8 parallel requests
  • Falcon-H1-7B-Instruct (Mamba-2 + attention, 44 recurrent + 44 attn layers). Same setup, Tested 1-8 parallel requests

Changed files

  • lmcache/v1/gpu_connector/gpu_connectors.py (modified, +21/-8)
  • lmcache/v1/kv_layer_groups.py (modified, +30/-11)
  • tests/v1/test_kv_layer_groups_manager.py (modified, +21/-0)

Code Example

ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

---

Creating v1 connector with name: LMCacheConnectorV1
Initializing latest dev LMCache connector
ImportError: lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

---

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

---

--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: 0.17.0
  • LMCache: latest nightly-2026-03-10 & vllm/vllm-openai:v0.17.0
  • Model: Qwen3-Coder-Next-FP8
  • Python: 3.12
  • Deployment: Kubernetes
  • Load format: runai_streamer

🐛 Describe the bug

We encountered two different problems when trying to use LMCache with vLLM 0.17.0. Case 1 — vllm/vllm-openai:v0.17.0 image Using the original vllm/vllm-openai:v0.17.0 image with LMCache enabled fails for all tested models (GLM, Qwen-Coder, etc.).

LMCache initialization starts but crashes with a binary import error:

ImportError: /usr/local/lib/python3.12/dist-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Relevant log snippet:

Creating v1 connector with name: LMCacheConnectorV1
Initializing latest dev LMCache connector
ImportError: lmcache/c_ops.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

Case 2 — lmcache/vllm-openai image

Using the lmcache/vllm-openai image: GLM-4.7-Flash works with LMCache Qwen3-Coder-Next fails during startup.

The server crashes with:

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

This happens when LMCache is enabled via:

--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

Actual behavior

  • vllm-openai:v0.17.0 + LMCache → crashes with lmcache.c_ops undefined symbol error.
  • lmcache/vllm-openai + hybrid KV models → crashes due to hybrid KV cache manager being disabled.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issues with LMCache and vLLM 0.17.0, follow these steps:

For Case 1: vllm/vllm-openai:v0.17.0 image

  1. Update LMCache: Ensure you're using the latest version of LMCache compatible with vLLM 0.17.0.
  2. Check CUDA Version: Verify that the CUDA version installed on your system matches the one expected by LMCache. You might need to update or downgrade CUDA.
  3. Rebuild LMCache: If using a custom build, rebuild LMCache with the correct CUDA version.

Example CUDA version check:

nvcc --version

For Case 2: lmcache/vllm-openai image

  1. Enable Hybrid KV Cache Manager: Modify your configuration to enable the hybrid KV cache manager when using models like Qwen3-Coder-Next.
  2. Unified KV Cache Specs: Ensure all KV cache specs are of a unified type to avoid conversion errors.

Example configuration change:

--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both", "enable_hybrid_kv": true}'

Verification

  • For Case 1, verify that LMCache initializes without crashing after updating and rebuilding.
  • For Case 2, check that the server starts up successfully with the hybrid KV cache manager enabled and unified KV cache specs.

Extra Tips

  • Always check the compatibility of your CUDA version with LMCache.
  • Refer to the official LMCache documentation for the latest configuration options and troubleshooting guides.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING