vllm - ✅(Solved) Fix Hybrid KV offload: MultiConnector + planner for mamba+attention models [3 pull requests, 1 participants]

malaiwah · 2026-03-26T12:06:46Z

[vllm] PR 2863: Support hybrid KV cache models Mamba + attention in GPU connector V3 - Repository: LMCache/LMCache - Author: oceanplexian - State: open | merge… # PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3 - Repository: LMCache/LMCache - Author: oceanplexian - State: open | merged: False - Link: https://github.com/LMCache/LMCache/pull/2863 ## Description (problem / solution / changelog) ## Summary Adds support for hybrid KV cache models (Mamba/GDN + attention) in the V3 GPU connector. Models like Qwen3.5, Falcon-H1, and Jamba store multiple state tensors per recurrent layer, which crashes `build_kv_layer_groups` and `VLLMPagedMemGPUConnectorV3`. Fixes #2845. Related: vllm-project/vllm#36771, #2221. **Changes** - Import `SupportsHMA` and add it to `LMCacheConnectorV1` class bases - Implement `request_finished_all_groups()` which combines block IDs from all KV cache groups and delegates to the existing `request_finished` handler ## Testing - [x] `pytest tests/v1/test_kv_layer_groups_manager.py` — 10/10 pass (9 existing + 1 new) - [x] `ruff check` + `ruff format` clean - [x] Qwen3.5-35B-A3B-GPTQ-Int4 (GDN + attention, 30 recurrent + 10 attn layers) on 2x RTX 3090, TP=2, 256K context, LMCache V3 + vllm + prefix caching, Tested 1-8 parallel requests - [x] Falcon-H1-7B-Instruct (Mamba-2 + attention, 44 recurrent + 44 attn layers). Same setup, Tested 1-8 parallel requests ## Changed files - `lmcache/integration/vllm/vllm_v1_adapter.py` (modified, +22/-0) - `lmcache/v1/gpu_connector/gpu_connectors.py` (modified, +26/-8) - `lmcache/v1/kv_layer_groups.py` (modified, +30/-11) - `tests/v1/test_kv_layer_groups_manager.py` (modified, +21/-0) --- # PR #466: Hybrid KV cache support for mamba+attention models - Repository: llm-d/llm-d-kv-cache - Author: malaiwah - State: closed | merged: False - Link: https://github.com/llm-d/llm-d-kv-cache/pull/466 ## Description (problem / solution / changelog) ## Summary Extends the llm-d fs-backend to support hybrid models like Qwen3.5 that interleave mamba and attention layers with different KV cache group structures. Key changes: - **Per-group file mappers and storage engines** in `worker.py` — each KV cache group gets its own file layout, tensor set, and I/O engine - **Canonical tensor normalization** — attention tensors are reshaped from the backend's kernel block size to the deterministic vLLM page size, ensuring file sizes are identical across restarts and GPU hardware - **Graceful group-disagree handling** — if some groups fail to load (stale cache), the scheduler takes the minimum prefix length and warns instead of crashing - **Separated load/store engines** to avoid polling races - **C++ tensor copier** extended for partial sub-block transfers and hybrid block offset/count support ## Test plan Validated on Qwen3.5-4B-FP8 (4 KV groups: 3 mamba + 1 attention): - [x] All 4 groups store/load correctly to NFS across container restarts - [x] 99% cache hit on cold restart (17408/17435 tokens from NFS) - [x] Canonical format deterministic: no file size mismatches across 3 consecutive restarts - [x] Graceful fallback when group prefix lengths disagree (no crash) - [x] Cross-references: LMCache/LMCache#2879, vllm-project/vllm#38230 Closes #465 > AI-assisted: developed with Claude. All changes reviewed and tested by a human. 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Changed files - `kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp` (modified, +99/-17) - `kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.hpp` (modified, +7/-2) - `kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp` (modified, +17/-4) - `kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.cu` (modified, +84/-17) - `kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp` (modified, +14/-2) - `kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier_kernels.cu` (modified, +4/-0) - `kv_connectors/llmd_fs_backend/llmd_fs_backend/manager.py` (modified, +24/-6) - `kv_connectors/llmd_fs_backend/llmd_fs_backend/spec.py` (modified, +44/-26) - `kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py` (modified, +547/-243) - `kv_connectors/llmd_fs_backend/tests/conftest.py` (modified, +3/-2) - `kv_connectors/llmd_fs_backend/tests/test_fs_backend.py` (modified, +383/-4) --- # PR #38261: Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models - Repository: vllm-project/vllm - Author: malaiwah - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/38261 ## Description (problem / solution / changelog) ## Summary Enables external KV cache offloading for hybrid models (mamba + attention) like Qwen3.5. The stock offload path requires LCM of all group block sizes, which is impractical when mamba groups have very different sizes from attention groups. ### Core changes **HybridOffloadPlanner** (`v1/kv_offload/planner.py`): - Configurable `hybrid_chunk_size` splits groups where `g

vllm2026-03-26 12:06:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38230•Fetched 2026-04-08 01:37:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

malaiwah

Participants

malaiwah

Timeline (top)

cross-referenced ×4subscribed ×1

RAW_BUFFERClick to expand / collapse

Problem

Hybrid models like Qwen3.5 (mamba + attention layers) can't use external KV cache offloading out of the box. The stock offload path requires LCM of all group block sizes, which is impractical for hybrid models where mamba groups have very different block sizes than attention groups.

What we built

A working hybrid offload stack for Qwen3.5-4B-FP8 (24 mamba + 8 attention layers), validated on RTX 4080 Super with max_model_len=98304:

HybridOffloadPlanner (vllm/v1/kv_offload/planner.py): configurable hybrid_chunk_size splits groups where gpu_block_size % chunk_size == 0, with per-group coverage tracking and binary search for efficient chunk counting.
MultiConnector (multi_connector.py): wraps multiple KV connectors (e.g., LMCache CPU + llm-d disk) with weighted load selection, HMA support (SupportsHMA), and preemption compatibility with stock vLLM's set[str] signature.
Metrics safety: clamp negative prompt_tokens_by_source values in loggers.py that crash Prometheus counters under high concurrency with external cache hits.

Results

79% cross-restart cache hit rate (llm-d disk, PYTHONHASHSEED=0)
97% same-session hit rate (vLLM APC + LMCache CPU)
50 concurrent requests stable at 1061 tok/s, 95% external hit rate
Identified and fixed garbled output bug in LMCache hybrid support (LMCache/LMCache#2879)

Branch

malaiwah/vllm:codex/hybrid-kv-offload — 261 files changed (includes upstream merge), key changes in v1/kv_offload/, v1/core/sched/scheduler.py, distributed/kv_transfer/kv_connector/v1/.

Related: LMCache/LMCache#2879, LMCache/LMCache#2845

AI-assisted: developed with Claude. All changes reviewed and tested by a human.

extent analysis

Fix Plan

To enable external KV cache offloading for hybrid models like Qwen3.5, follow these steps:

Implement a HybridOffloadPlanner that splits groups based on a configurable hybrid_chunk_size.
Create a MultiConnector that wraps multiple KV connectors with weighted load selection and HMA support.
Update Metrics safety to clamp negative prompt_tokens_by_source values.

Example Code

# HybridOffloadPlanner (vllm/v1/kv_offload/planner.py)
class HybridOffloadPlanner:
    def __init__(self, hybrid_chunk_size):
        self.hybrid_chunk_size = hybrid_chunk_size

    def split_groups(self, groups):
        split_groups = []
        for group in groups:
            if group['gpu_block_size'] % self.hybrid_chunk_size == 0:
                split_groups.append(group)
        return split_groups

# MultiConnector (multi_connector.py)
class MultiConnector:
    def __init__(self, connectors):
        self.connectors = connectors

    def get_connector(self, weight):
        # weighted load selection
        return self.connectors[weight]

# Metrics safety (loggers.py)
def clamp_negative_values(value):
    if value < 0:
        return 0
    return value

Verification

To verify the fix, test the hybrid offload stack with a model like Qwen3.5-4B-FP8 and measure the cache hit rate and concurrent request stability.

Extra Tips

Ensure the hybrid_chunk_size is configured correctly for the specific model architecture.
Monitor the cache hit rate and adjust the hybrid_chunk_size as needed to optimize performance.
Refer to the malaiwah/vllm:codex/hybrid-kv-offload branch for the complete implementation and key changes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#installation #tensor shape #autograd error #model save/load #optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Hybrid KV offload: MultiConnector + planner for mamba+attention models [3 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3

Description (problem / solution / changelog)

Summary

Testing

Changed files

PR #466: Hybrid KV cache support for mamba+attention models

Description (problem / solution / changelog)

Summary

Test plan

Changed files

PR #38261: Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models

Description (problem / solution / changelog)

Summary

Core changes

Test plan

Changed files

Problem

What we built

Results

Branch

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING