vllm - ✅(Solved) Fix Hybrid KV offload: MultiConnector + planner for mamba+attention models [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38230Fetched 2026-04-08 01:37:12
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×4subscribed ×1

Fix Action

Fixed

PR fix notes

PR #2863: Support hybrid KV cache models (Mamba + attention) in GPU connector V3

Description (problem / solution / changelog)

Summary

Adds support for hybrid KV cache models (Mamba/GDN + attention) in the V3 GPU connector. Models like Qwen3.5, Falcon-H1, and Jamba store multiple state tensors per recurrent layer, which crashes build_kv_layer_groups and VLLMPagedMemGPUConnectorV3. Fixes #2845. Related: vllm-project/vllm#36771, #2221.

Changes

  • Import SupportsHMA and add it to LMCacheConnectorV1 class bases
  • Implement request_finished_all_groups() which combines block IDs from all KV cache groups and delegates to the existing request_finished handler

Testing

  • pytest tests/v1/test_kv_layer_groups_manager.py — 10/10 pass (9 existing + 1 new)
  • ruff check + ruff format clean
  • Qwen3.5-35B-A3B-GPTQ-Int4 (GDN + attention, 30 recurrent + 10 attn layers) on 2x RTX 3090, TP=2, 256K context, LMCache V3 + vllm + prefix caching, Tested 1-8 parallel requests
  • Falcon-H1-7B-Instruct (Mamba-2 + attention, 44 recurrent + 44 attn layers). Same setup, Tested 1-8 parallel requests

Changed files

  • lmcache/integration/vllm/vllm_v1_adapter.py (modified, +22/-0)
  • lmcache/v1/gpu_connector/gpu_connectors.py (modified, +26/-8)
  • lmcache/v1/kv_layer_groups.py (modified, +30/-11)
  • tests/v1/test_kv_layer_groups_manager.py (modified, +21/-0)

PR #466: Hybrid KV cache support for mamba+attention models

Description (problem / solution / changelog)

Summary

Extends the llm-d fs-backend to support hybrid models like Qwen3.5 that interleave mamba and attention layers with different KV cache group structures.

Key changes:

  • Per-group file mappers and storage engines in worker.py — each KV cache group gets its own file layout, tensor set, and I/O engine
  • Canonical tensor normalization — attention tensors are reshaped from the backend's kernel block size to the deterministic vLLM page size, ensuring file sizes are identical across restarts and GPU hardware
  • Graceful group-disagree handling — if some groups fail to load (stale cache), the scheduler takes the minimum prefix length and warns instead of crashing
  • Separated load/store engines to avoid polling races
  • C++ tensor copier extended for partial sub-block transfers and hybrid block offset/count support

Test plan

Validated on Qwen3.5-4B-FP8 (4 KV groups: 3 mamba + 1 attention):

  • All 4 groups store/load correctly to NFS across container restarts
  • 99% cache hit on cold restart (17408/17435 tokens from NFS)
  • Canonical format deterministic: no file size mismatches across 3 consecutive restarts
  • Graceful fallback when group prefix lengths disagree (no crash)
  • Cross-references: LMCache/LMCache#2879, vllm-project/vllm#38230

Closes #465

AI-assisted: developed with Claude. All changes reviewed and tested by a human.

🤖 Generated with Claude Code

Changed files

  • kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.cpp (modified, +99/-17)
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_offload.hpp (modified, +7/-2)
  • kv_connectors/llmd_fs_backend/csrc/storage/storage_offload_bindings.cpp (modified, +17/-4)
  • kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.cu (modified, +84/-17)
  • kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier.hpp (modified, +14/-2)
  • kv_connectors/llmd_fs_backend/csrc/storage/tensor_copier_kernels.cu (modified, +4/-0)
  • kv_connectors/llmd_fs_backend/llmd_fs_backend/manager.py (modified, +24/-6)
  • kv_connectors/llmd_fs_backend/llmd_fs_backend/spec.py (modified, +44/-26)
  • kv_connectors/llmd_fs_backend/llmd_fs_backend/worker.py (modified, +547/-243)
  • kv_connectors/llmd_fs_backend/tests/conftest.py (modified, +3/-2)
  • kv_connectors/llmd_fs_backend/tests/test_fs_backend.py (modified, +383/-4)

PR #38261: Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models

Description (problem / solution / changelog)

Summary

Enables external KV cache offloading for hybrid models (mamba + attention) like Qwen3.5. The stock offload path requires LCM of all group block sizes, which is impractical when mamba groups have very different sizes from attention groups.

Core changes

HybridOffloadPlanner (v1/kv_offload/planner.py):

  • Configurable hybrid_chunk_size splits groups where gpu_block_size % chunk_size == 0
  • Per-group coverage tracking, binary search for chunk counting
  • Handles groups that can't be split (mamba with block_size=max_model_len in non-align mode)

MultiConnector (multi_connector.py):

  • Wraps multiple KV connectors (e.g., LMCache CPU + llm-d disk) via MultiConnector
  • Weighted load selection: matched_tokens × load_weight scoring
  • HMA support (SupportsHMA mixin) for hybrid memory allocator compatibility
  • Preemption handling compatible with stock vLLM's set[str] signature
  • Per-connector Prometheus metrics

Scheduler (scheduler.py):

  • Skip mamba block alignment during async KV load
  • Handle disagreeing KV group prefix lengths gracefully (warn + use minimum)

Metrics (loggers.py):

  • Clamp negative prompt_tokens_by_source values that crash Prometheus counters under concurrent external cache hits

Test plan

Validated on Qwen3.5-4B-FP8 (24 mamba + 8 attention layers, RTX 4080 Super):

  • 99% cross-restart cache hit (llm-d NFS disk, PYTHONHASHSEED=0)
  • 97% same-session hit (APC + LMCache CPU)
  • 50 concurrent requests stable at 1061 tok/s
  • Preemption under memory pressure handled correctly
  • Metrics counter crash fixed under high concurrency
  • Cross-references: LMCache/LMCache#2879, llm-d/llm-d-kv-cache#466

Closes #38230

AI-assisted: developed with Claude. All changes reviewed and tested by a human.

🤖 Generated with Claude Code

Changed files

  • .buildkite/ci_config_intel.yaml (added, +23/-0)
  • .buildkite/hardware_tests/cpu.yaml (modified, +1/-7)
  • .buildkite/image_build/image_build_xpu.sh (added, +34/-0)
  • .buildkite/intel_jobs/test-intel.yaml (added, +64/-0)
  • .buildkite/release-pipeline.yaml (modified, +195/-219)
  • .buildkite/scripts/annotate-release.sh (modified, +4/-2)
  • .buildkite/scripts/annotate-rocm-release.sh (modified, +6/-5)
  • .buildkite/scripts/cache-rocm-base-wheels.sh (modified, +7/-16)
  • .buildkite/scripts/cleanup-nightly-builds.sh (modified, +10/-7)
  • .buildkite/scripts/hardware_ci/run-amd-test.sh (modified, +3/-1)
  • .buildkite/scripts/hardware_ci/run-cpu-test-arm.sh (modified, +7/-2)
  • .buildkite/scripts/hardware_ci/run-intel-test.sh (added, +276/-0)
  • .buildkite/scripts/push-nightly-builds-rocm.sh (added, +62/-0)
  • .buildkite/test-amd.yaml (modified, +40/-0)
  • .buildkite/test_areas/misc.yaml (modified, +15/-0)
  • .buildkite/test_areas/model_runner_v2.yaml (modified, +2/-0)
  • .buildkite/test_areas/pytorch.yaml (modified, +11/-1)
  • .github/workflows/new_pr_bot.yml (modified, +10/-4)
  • .pre-commit-config.yaml (modified, +36/-1)
  • AGENTS.md (modified, +27/-13)
  • CMakeLists.txt (modified, +7/-6)
  • benchmarks/benchmark_long_document_qa_throughput.py (modified, +1/-2)
  • benchmarks/benchmark_prefix_caching.py (modified, +1/-2)
  • benchmarks/benchmark_prioritization.py (modified, +1/-2)
  • benchmarks/kernels/benchmark_fused_collective.py (modified, +19/-7)
  • cmake/utils.cmake (modified, +4/-2)
  • csrc/cache_kernels.cu (modified, +2/-1)
  • csrc/cpu/torch_bindings.cpp (modified, +12/-0)
  • csrc/cpu/utils.cpp (modified, +35/-0)
  • csrc/layernorm_kernels.cu (modified, +1/-1)
  • csrc/layernorm_quant_kernels.cu (modified, +1/-1)
  • csrc/libtorch_stable/dispatch_utils.h (added, +60/-0)
  • csrc/libtorch_stable/ops.h (modified, +21/-0)
  • csrc/libtorch_stable/quantization/vectorization.cuh (renamed, +2/-2)
  • csrc/libtorch_stable/quantization/vectorization_utils.cuh (renamed, +0/-0)
  • csrc/libtorch_stable/quantization/w8a8/fp8/per_token_group_quant.cu (renamed, +50/-46)
  • csrc/libtorch_stable/quantization/w8a8/int8/per_token_group_quant.cu (added, +12/-0)
  • csrc/libtorch_stable/quantization/w8a8/per_token_group_quant_8bit.h (added, +10/-0)
  • csrc/libtorch_stable/torch_bindings.cpp (modified, +35/-4)
  • csrc/libtorch_stable/torch_utils.h (modified, +3/-1)
  • csrc/ops.h (modified, +0/-19)
  • csrc/quantization/fused_kernels/layernorm_utils.cuh (modified, +1/-1)
  • csrc/quantization/fused_kernels/quant_conversions.cuh (modified, +1/-1)
  • csrc/quantization/w8a8/cutlass/c3x/scaled_mm_blockwise_sm120_fp8_dispatch.cuh (modified, +36/-5)
  • csrc/quantization/w8a8/fp8/common.cu (modified, +1/-1)
  • csrc/quantization/w8a8/fp8/common.cuh (modified, +1/-1)
  • csrc/quantization/w8a8/int8/per_token_group_quant.cu (removed, +0/-12)
  • csrc/quantization/w8a8/int8/scaled_quant.cu (modified, +1/-1)
  • csrc/quantization/w8a8/per_token_group_quant_8bit.h (removed, +0/-9)
  • csrc/torch_bindings.cpp (modified, +0/-28)
  • docker/Dockerfile (modified, +9/-6)
  • docker/Dockerfile.cpu (modified, +1/-1)
  • docker/Dockerfile.rocm (modified, +5/-0)
  • docker/docker-bake.hcl (modified, +30/-0)
  • docker/versions.json (modified, +3/-0)
  • docs/api/README.md (modified, +3/-12)
  • docs/contributing/editing-agent-instructions.md (added, +74/-0)
  • docs/contributing/model/transcription.md (modified, +2/-2)
  • docs/design/attention_backends.md (modified, +1/-1)
  • docs/design/cuda_graphs.md (modified, +1/-0)
  • docs/design/cuda_graphs_multimodal.md (added, +169/-0)
  • docs/design/custom_op.md (modified, +2/-2)
  • docs/design/debug_vllm_compile.md (modified, +20/-0)
  • docs/design/fusions.md (modified, +1/-1)
  • docs/design/optimization_levels.md (modified, +3/-1)
  • docs/features/multimodal_inputs.md (modified, +1/-1)
  • docs/features/reasoning_outputs.md (modified, +75/-0)
  • docs/models/pooling_models/README.md (modified, +35/-10)
  • docs/models/pooling_models/classify.md (modified, +1/-1)
  • docs/models/pooling_models/embed.md (modified, +1/-1)
  • docs/models/pooling_models/token_classify.md (modified, +6/-0)
  • docs/models/pooling_models/token_embed.md (modified, +6/-0)
  • docs/models/supported_models.md (modified, +2/-1)
  • examples/offline_inference/audio_language.py (modified, +2/-4)
  • examples/offline_inference/encoder_decoder_multimodal.py (modified, +5/-7)
  • examples/offline_inference/load_sharded_state.py (modified, +1/-3)
  • examples/offline_inference/save_sharded_state.py (modified, +1/-2)
  • examples/offline_inference/vision_language.py (modified, +6/-7)
  • examples/offline_inference/vision_language_multi_image.py (modified, +8/-7)
  • examples/online_serving/batched_chat_completions.py (added, +194/-0)
  • examples/pooling/embed/vision_embedding_offline.py (modified, +19/-25)
  • examples/pooling/score/vision_reranker_offline.py (modified, +1/-2)
  • examples/pooling/token_embed/jina_embeddings_v4_offline.py (modified, +1/-1)
  • requirements/cuda.txt (modified, +1/-0)
  • requirements/rocm-test.in (added, +83/-0)
  • requirements/rocm-test.txt (modified, +1354/-97)
  • requirements/test.in (modified, +1/-1)
  • setup.py (modified, +65/-0)
  • tests/compile/fullgraph/test_basic_correctness.py (modified, +2/-2)
  • tests/compile/h100/__init__.py (renamed, +0/-0)
  • tests/compile/h100/test_startup.py (added, +249/-0)
  • tests/compile/passes/test_fusion_attn.py (modified, +7/-1)
  • tests/compile/test_aot_compile.py (modified, +31/-0)
  • tests/compile/test_config.py (modified, +62/-2)
  • tests/compile/test_startup.py (removed, +0/-87)
  • tests/entrypoints/offline_mode/test_offline_mode.py (modified, +9/-6)
  • tests/entrypoints/openai/chat_completion/test_batched_chat_completions.py (added, +113/-0)
  • tests/entrypoints/openai/chat_completion/test_chat_error.py (modified, +4/-6)
  • tests/entrypoints/openai/chat_completion/test_enable_force_include_usage.py (modified, +12/-14)
  • tests/entrypoints/openai/chat_completion/test_serving_chat.py (modified, +11/-13)
RAW_BUFFERClick to expand / collapse

Problem

Hybrid models like Qwen3.5 (mamba + attention layers) can't use external KV cache offloading out of the box. The stock offload path requires LCM of all group block sizes, which is impractical for hybrid models where mamba groups have very different block sizes than attention groups.

What we built

A working hybrid offload stack for Qwen3.5-4B-FP8 (24 mamba + 8 attention layers), validated on RTX 4080 Super with max_model_len=98304:

  1. HybridOffloadPlanner (vllm/v1/kv_offload/planner.py): configurable hybrid_chunk_size splits groups where gpu_block_size % chunk_size == 0, with per-group coverage tracking and binary search for efficient chunk counting.

  2. MultiConnector (multi_connector.py): wraps multiple KV connectors (e.g., LMCache CPU + llm-d disk) with weighted load selection, HMA support (SupportsHMA), and preemption compatibility with stock vLLM's set[str] signature.

  3. Metrics safety: clamp negative prompt_tokens_by_source values in loggers.py that crash Prometheus counters under high concurrency with external cache hits.

Results

  • 79% cross-restart cache hit rate (llm-d disk, PYTHONHASHSEED=0)
  • 97% same-session hit rate (vLLM APC + LMCache CPU)
  • 50 concurrent requests stable at 1061 tok/s, 95% external hit rate
  • Identified and fixed garbled output bug in LMCache hybrid support (LMCache/LMCache#2879)

Branch

malaiwah/vllm:codex/hybrid-kv-offload — 261 files changed (includes upstream merge), key changes in v1/kv_offload/, v1/core/sched/scheduler.py, distributed/kv_transfer/kv_connector/v1/.

Related: LMCache/LMCache#2879, LMCache/LMCache#2845

AI-assisted: developed with Claude. All changes reviewed and tested by a human.

extent analysis

Fix Plan

To enable external KV cache offloading for hybrid models like Qwen3.5, follow these steps:

  • Implement a HybridOffloadPlanner that splits groups based on a configurable hybrid_chunk_size.
  • Create a MultiConnector that wraps multiple KV connectors with weighted load selection and HMA support.
  • Update Metrics safety to clamp negative prompt_tokens_by_source values.

Example Code

# HybridOffloadPlanner (vllm/v1/kv_offload/planner.py)
class HybridOffloadPlanner:
    def __init__(self, hybrid_chunk_size):
        self.hybrid_chunk_size = hybrid_chunk_size

    def split_groups(self, groups):
        split_groups = []
        for group in groups:
            if group['gpu_block_size'] % self.hybrid_chunk_size == 0:
                split_groups.append(group)
        return split_groups

# MultiConnector (multi_connector.py)
class MultiConnector:
    def __init__(self, connectors):
        self.connectors = connectors

    def get_connector(self, weight):
        # weighted load selection
        return self.connectors[weight]

# Metrics safety (loggers.py)
def clamp_negative_values(value):
    if value < 0:
        return 0
    return value

Verification

To verify the fix, test the hybrid offload stack with a model like Qwen3.5-4B-FP8 and measure the cache hit rate and concurrent request stability.

Extra Tips

  • Ensure the hybrid_chunk_size is configured correctly for the specific model architecture.
  • Monitor the cache hit rate and adjust the hybrid_chunk_size as needed to optimize performance.
  • Refer to the malaiwah/vllm:codex/hybrid-kv-offload branch for the complete implementation and key changes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING