vllm - ✅(Solved) Fix [Bug]: --kv-transfer-config unconditionally disables HMA, ignoring SupportsHMA on the connector [1 pull requests, 1 comments, 2 participants]

vllm2026-05-06 15:03:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41830•Fetched 2026-05-07 03:32:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

v1b3coder

Participants

chfeng-cs

v1b3coder

Timeline (top)

commented ×1cross-referenced ×1mentioned ×1subscribed ×1

Error Message

WARNING [config.py:1297] Turning off hybrid kv cache manager because --kv-transfer-config is set... WARNING [kv_cache_utils.py:1334] Hybrid KV cache manager is disabled for this hybrid model... ValueError: To serve at least one request with the models's max seq len (128000), (38.87 GiB KV cache is needed, which is larger than the available KV cache memory (8.44 GiB).

Root Cause

vllm/config/vllm.py:1292-1306:

if self.scheduler_config.disable_hybrid_kv_cache_manager is None:
    if self.kv_transfer_config is not None:
        # NOTE(Kuntai): turn HMA off for connector unless specifically enabled.
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

This disables HMA for every connector, even ones that subclass SupportsHMA. The factory at vllm/distributed/kv_transfer/kv_connector/factory.py:56-62 already does the correct check at connector-creation time:

hma_enabled = not config.scheduler_config.disable_hybrid_kv_cache_manager
if hma_enabled and not supports_hma(connector_cls):
    raise ValueError(...)

So the auto-disable in config/vllm.py is making a more pessimistic decision than the factory's source-of-truth check would.

Fix Action

Fix / Workaround

Current workaround

#29805 — the PR that introduced the current unconditional auto-disable + the manual override flag (cc @NickLucche).
#39269 — open PR addressing the analogous bug in the kv_events_config branch (lines 1226-1228), with a different approach: respect explicit --no-disable-hybrid-kv-cache-manager rather than auto-detect. The auto-detect approach proposed here would also generalize to that branch.
#39702 — independent runtime crash (TOCTOU race) in SimpleCPUOffloadConnector. Only reachable on hybrid models after the workaround above is applied. See my comment there for cross-context.

Investigation done with AI assistance reading the connector / config source. The proposed fix has not been built and tested locally (current production has the workaround in place); upstream maintainers should validate before merging.

PR fix notes

PR #41847: [Bugfix][KV Transfer] Respect HMA support when auto-disabling

Repository: vllm-project/vllm
Author: chfeng-cs
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41847

Description (problem / solution / changelog)

Purpose

Fix #41830 — --kv-transfer-config unconditionally disables the Hybrid KV cache Manager (HMA), even when the connector implements SupportsHMA. For hybrid-attention models (DeepSeek-V4-Flash, Mixtral, Mamba hybrids), HMA is required to allocate per-group KV pools sized to each layer type's actual needs. With HMA unconditionally disabled, unify_hybrid_kv_cache_specs collapses all sliding-window/chunked-local specs into FullAttentionSpec, ballooning KV memory and causing startup OOM. Root cause: vllm/config/vllm.py:1292-1306 auto-disables HMA whenever kv_transfer_config is not None, without checking whether the connector supports HMA. The factory at factory.py:56-62 already has the correct supports_hma() check, but it never gets a chance to run because config has already disabled HMA. Fix: Add KVConnectorFactory.supports_hma_config() that resolves effective HMA support from the transfer config:

Plain connectors: checks SupportsHMA on the class. MultiConnector: recurses into all configured children — returns True only if every child supports HMA. Empty/missing children → False.

Both the config auto-disable path and the factory runtime check now use this single helper, keeping them consistent. No change for users without --kv-transfer-config. Non-HMA connectors retain the existing auto-disable + warning. Related:

#29805 introduced the unconditional auto-disable #39269 addresses the analogous issue in the kv_events_config path (different approach); this PR only changes the kv_transfer_config path

Test Plan

pytest tests/v1/kv_connector/unit/test_hma_auto_config.py -v

5 unit tests covering:

HMA-capable connector (SimpleCPUOffloadConnector) → not auto-disabled Non-HMA connector (ExampleConnector) → auto-disabled (with explicit class-level assertion) Explicit --no-disable-hybrid-kv-cache-manager + non-HMA connector → factory raises ValueError MultiConnector with all HMA children → not auto-disabled MultiConnector with mixed HMA/non-HMA children → auto-disabled

Hardware validation on a hybrid model with SimpleCPUOffloadConnector is pending — will update before marking ready for review.

Test Result

$ pytest tests/v1/kv_connector/unit/test_hma_auto_config.py -q
.....                                                                    [100%]
5 passed in 2.51s

Changed files

tests/v1/kv_connector/unit/test_hma_auto_config.py (added, +127/-0)
vllm/config/vllm.py (modified, +19/-11)
vllm/distributed/kv_transfer/kv_connector/factory.py (modified, +33/-1)

Code Example

- --kv-transfer-config={"kv_connector":"SimpleCPUOffloadConnector","kv_role":"kv_both","kv_connector_extra_config":{"cpu_bytes_to_use":137438953472}}
- --enable-prefix-caching --enable-chunked-prefill
- --max-model-len=128000 --tensor-parallel-size=2
- --kv-cache-dtype=fp8 --block-size=256

---

WARNING [config.py:1297] Turning off hybrid kv cache manager because `--kv-transfer-config` is set...
WARNING [kv_cache_utils.py:1334] Hybrid KV cache manager is disabled for this hybrid model...
ValueError: To serve at least one request with the models's max seq len (128000),
  (38.87 GiB KV cache is needed, which is larger than the available KV cache memory (8.44 GiB).

---

if self.scheduler_config.disable_hybrid_kv_cache_manager is None:
    if self.kv_transfer_config is not None:
        # NOTE(Kuntai): turn HMA off for connector unless specifically enabled.
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

---

hma_enabled = not config.scheduler_config.disable_hybrid_kv_cache_manager
if hma_enabled and not supports_hma(connector_cls):
    raise ValueError(...)

---

if self.kv_transfer_config is not None:
    from vllm.distributed.kv_transfer.kv_connector.factory import KVConnectorFactory
    from vllm.distributed.kv_transfer.kv_connector.v1 import supports_hma
    connector_cls, _ = KVConnectorFactory._get_connector_class_with_compat(
        self.kv_transfer_config
    )
    if not supports_hma(connector_cls):
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

RAW_BUFFERClick to expand / collapse

Bug

vllm/config/vllm.py:1292-1306 unconditionally disables the Hybrid KV cache Manager (HMA) whenever --kv-transfer-config is set, regardless of whether the connector subclasses SupportsHMA.

For hybrid models (sliding-window, chunked-local, Mamba), HMA is what allocates per-group KV pools sized to each layer type's actual needs. With HMA disabled, unify_hybrid_kv_cache_specs (vllm/v1/core/kv_cache_utils.py:1334) collapses all sliding-window / chunked-local specs into FullAttentionSpec, ballooning per-request KV memory and OOMing the engine at startup.

Reproduction

DeepSeek-V4-Flash, 2× RTX PRO 6000 Blackwell, TP=2:

- --kv-transfer-config={"kv_connector":"SimpleCPUOffloadConnector","kv_role":"kv_both","kv_connector_extra_config":{"cpu_bytes_to_use":137438953472}}
- --enable-prefix-caching --enable-chunked-prefill
- --max-model-len=128000 --tensor-parallel-size=2
- --kv-cache-dtype=fp8 --block-size=256

Result:

WARNING [config.py:1297] Turning off hybrid kv cache manager because `--kv-transfer-config` is set...
WARNING [kv_cache_utils.py:1334] Hybrid KV cache manager is disabled for this hybrid model...
ValueError: To serve at least one request with the models's max seq len (128000),
  (38.87 GiB KV cache is needed, which is larger than the available KV cache memory (8.44 GiB).

The 8.44 GiB available shrinks from ~38+ GiB (with HMA on) to 8.44 GiB (with HMA off) because sliding-window groups, which would normally consume small per-token state, are now budgeted as full-attention. Per-GPU KV requirement balloons by ~30 GiB. SimpleCPUOffloadConnector itself never gets a chance to load — startup fails first.

Root cause

vllm/config/vllm.py:1292-1306:

if self.scheduler_config.disable_hybrid_kv_cache_manager is None:
    if self.kv_transfer_config is not None:
        # NOTE(Kuntai): turn HMA off for connector unless specifically enabled.
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

hma_enabled = not config.scheduler_config.disable_hybrid_kv_cache_manager
if hma_enabled and not supports_hma(connector_cls):
    raise ValueError(...)

So the auto-disable in config/vllm.py is making a more pessimistic decision than the factory's source-of-truth check would.

Affected connectors

Several upstream connectors subclass SupportsHMA and are unnecessarily penalized:

SimpleCPUOffloadConnector
OffloadingConnector
NixlConnector
MultiConnector (when all children support HMA — see #39571)

Affected models

Any hybrid-attention model:

DeepSeek-V4-Flash (full + sliding-window + chunked-local)
Mixtral with sliding-window
Mamba / SSM hybrids (e.g. Nemotron-H — see #39269)
Llama-3.1+ with chunked-local attention

For these models, the auto-disable causes startup OOM that a typical user will misdiagnose as "the connector is incompatible with my model" or "I need more VRAM."

Proposed fix

Gate the auto-disable on supports_hma(connector_cls), mirroring the factory check:

if self.kv_transfer_config is not None:
    from vllm.distributed.kv_transfer.kv_connector.factory import KVConnectorFactory
    from vllm.distributed.kv_transfer.kv_connector.v1 import supports_hma
    connector_cls, _ = KVConnectorFactory._get_connector_class_with_compat(
        self.kv_transfer_config
    )
    if not supports_hma(connector_cls):
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

Connectors that support HMA (the majority of upstream ones) get HMA left alone. Connectors that don't get the existing warning + auto-disable behavior. No user-visible change for non-HMA connectors.

A cleaner version would expose a public class-resolution helper on KVConnectorFactory so this doesn't depend on the _-prefixed _get_connector_class_with_compat.

Current workaround

Pass --no-disable-hybrid-kv-cache-manager explicitly. The factory check at factory.py:58 then accepts it because the connector does support HMA. This is undocumented and unintuitive — a user has to discover both that HMA was auto-disabled silently, and that a specific opt-in flag fixes it.

#29805 — the PR that introduced the current unconditional auto-disable + the manual override flag (cc @NickLucche).
#39269 — open PR addressing the analogous bug in the kv_events_config branch (lines 1226-1228), with a different approach: respect explicit --no-disable-hybrid-kv-cache-manager rather than auto-detect. The auto-detect approach proposed here would also generalize to that branch.
#39702 — independent runtime crash (TOCTOU race) in SimpleCPUOffloadConnector. Only reachable on hybrid models after the workaround above is applied. See my comment there for cross-context.

Environment

vLLM jasl/vllm ds4-sm120-full. The affected file (vllm/config/vllm.py) is byte-identical to upstream/main (verified via git diff); this is reproducible against vanilla vLLM.
DeepSeek-V4-Flash, 2× RTX PRO 6000 Blackwell, TP=2.

Disclosure

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: --kv-transfer-config unconditionally disables HMA, ignoring SupportsHMA on the connector [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current workaround

PR fix notes

PR #41847: [Bugfix][KV Transfer] Respect HMA support when auto-disabling

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Bug

Reproduction

Root cause

Affected connectors

Affected models

Proposed fix

Current workaround

Related

Environment

Disclosure

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: --kv-transfer-config unconditionally disables HMA, ignoring SupportsHMA on the connector [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current workaround

PR fix notes

PR #41847: [Bugfix][KV Transfer] Respect HMA support when auto-disabling

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Bug

Reproduction

Root cause

Affected connectors

Affected models

Proposed fix

Current workaround

Related

Environment

Disclosure

Still need to ship something?

RELATED_DISCOVERY

TRENDING