vllm - ✅(Solved) Fix [Bug]: --kv-transfer-config unconditionally disables HMA, ignoring SupportsHMA on the connector [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41830Fetched 2026-05-07 03:32:37
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1mentioned ×1subscribed ×1

Error Message

WARNING [config.py:1297] Turning off hybrid kv cache manager because --kv-transfer-config is set... WARNING [kv_cache_utils.py:1334] Hybrid KV cache manager is disabled for this hybrid model... ValueError: To serve at least one request with the models's max seq len (128000), (38.87 GiB KV cache is needed, which is larger than the available KV cache memory (8.44 GiB).

Root Cause

vllm/config/vllm.py:1292-1306:

if self.scheduler_config.disable_hybrid_kv_cache_manager is None:
    if self.kv_transfer_config is not None:
        # NOTE(Kuntai): turn HMA off for connector unless specifically enabled.
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

This disables HMA for every connector, even ones that subclass SupportsHMA. The factory at vllm/distributed/kv_transfer/kv_connector/factory.py:56-62 already does the correct check at connector-creation time:

hma_enabled = not config.scheduler_config.disable_hybrid_kv_cache_manager
if hma_enabled and not supports_hma(connector_cls):
    raise ValueError(...)

So the auto-disable in config/vllm.py is making a more pessimistic decision than the factory's source-of-truth check would.

Fix Action

Fix / Workaround

Current workaround

  • #29805 — the PR that introduced the current unconditional auto-disable + the manual override flag (cc @NickLucche).
  • #39269 — open PR addressing the analogous bug in the kv_events_config branch (lines 1226-1228), with a different approach: respect explicit --no-disable-hybrid-kv-cache-manager rather than auto-detect. The auto-detect approach proposed here would also generalize to that branch.
  • #39702 — independent runtime crash (TOCTOU race) in SimpleCPUOffloadConnector. Only reachable on hybrid models after the workaround above is applied. See my comment there for cross-context.

Investigation done with AI assistance reading the connector / config source. The proposed fix has not been built and tested locally (current production has the workaround in place); upstream maintainers should validate before merging.

PR fix notes

PR #41847: [Bugfix][KV Transfer] Respect HMA support when auto-disabling

Description (problem / solution / changelog)

Purpose

Fix #41830 — --kv-transfer-config unconditionally disables the Hybrid KV cache Manager (HMA), even when the connector implements SupportsHMA. For hybrid-attention models (DeepSeek-V4-Flash, Mixtral, Mamba hybrids), HMA is required to allocate per-group KV pools sized to each layer type's actual needs. With HMA unconditionally disabled, unify_hybrid_kv_cache_specs collapses all sliding-window/chunked-local specs into FullAttentionSpec, ballooning KV memory and causing startup OOM. Root cause: vllm/config/vllm.py:1292-1306 auto-disables HMA whenever kv_transfer_config is not None, without checking whether the connector supports HMA. The factory at factory.py:56-62 already has the correct supports_hma() check, but it never gets a chance to run because config has already disabled HMA. Fix: Add KVConnectorFactory.supports_hma_config() that resolves effective HMA support from the transfer config:

Plain connectors: checks SupportsHMA on the class. MultiConnector: recurses into all configured children — returns True only if every child supports HMA. Empty/missing children → False.

Both the config auto-disable path and the factory runtime check now use this single helper, keeping them consistent. No change for users without --kv-transfer-config. Non-HMA connectors retain the existing auto-disable + warning. Related:

#29805 introduced the unconditional auto-disable #39269 addresses the analogous issue in the kv_events_config path (different approach); this PR only changes the kv_transfer_config path

Test Plan

pytest tests/v1/kv_connector/unit/test_hma_auto_config.py -v

5 unit tests covering:

HMA-capable connector (SimpleCPUOffloadConnector) → not auto-disabled Non-HMA connector (ExampleConnector) → auto-disabled (with explicit class-level assertion) Explicit --no-disable-hybrid-kv-cache-manager + non-HMA connector → factory raises ValueError MultiConnector with all HMA children → not auto-disabled MultiConnector with mixed HMA/non-HMA children → auto-disabled

Hardware validation on a hybrid model with SimpleCPUOffloadConnector is pending — will update before marking ready for review.

Test Result

$ pytest tests/v1/kv_connector/unit/test_hma_auto_config.py -q
.....                                                                    [100%]
5 passed in 2.51s

Changed files

  • tests/v1/kv_connector/unit/test_hma_auto_config.py (added, +127/-0)
  • vllm/config/vllm.py (modified, +19/-11)
  • vllm/distributed/kv_transfer/kv_connector/factory.py (modified, +33/-1)

Code Example

- --kv-transfer-config={"kv_connector":"SimpleCPUOffloadConnector","kv_role":"kv_both","kv_connector_extra_config":{"cpu_bytes_to_use":137438953472}}
- --enable-prefix-caching --enable-chunked-prefill
- --max-model-len=128000 --tensor-parallel-size=2
- --kv-cache-dtype=fp8 --block-size=256

---

WARNING [config.py:1297] Turning off hybrid kv cache manager because `--kv-transfer-config` is set...
WARNING [kv_cache_utils.py:1334] Hybrid KV cache manager is disabled for this hybrid model...
ValueError: To serve at least one request with the models's max seq len (128000),
  (38.87 GiB KV cache is needed, which is larger than the available KV cache memory (8.44 GiB).

---

if self.scheduler_config.disable_hybrid_kv_cache_manager is None:
    if self.kv_transfer_config is not None:
        # NOTE(Kuntai): turn HMA off for connector unless specifically enabled.
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

---

hma_enabled = not config.scheduler_config.disable_hybrid_kv_cache_manager
if hma_enabled and not supports_hma(connector_cls):
    raise ValueError(...)

---

if self.kv_transfer_config is not None:
    from vllm.distributed.kv_transfer.kv_connector.factory import KVConnectorFactory
    from vllm.distributed.kv_transfer.kv_connector.v1 import supports_hma
    connector_cls, _ = KVConnectorFactory._get_connector_class_with_compat(
        self.kv_transfer_config
    )
    if not supports_hma(connector_cls):
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)
RAW_BUFFERClick to expand / collapse

Bug

vllm/config/vllm.py:1292-1306 unconditionally disables the Hybrid KV cache Manager (HMA) whenever --kv-transfer-config is set, regardless of whether the connector subclasses SupportsHMA.

For hybrid models (sliding-window, chunked-local, Mamba), HMA is what allocates per-group KV pools sized to each layer type's actual needs. With HMA disabled, unify_hybrid_kv_cache_specs (vllm/v1/core/kv_cache_utils.py:1334) collapses all sliding-window / chunked-local specs into FullAttentionSpec, ballooning per-request KV memory and OOMing the engine at startup.

Reproduction

DeepSeek-V4-Flash, 2× RTX PRO 6000 Blackwell, TP=2:

- --kv-transfer-config={"kv_connector":"SimpleCPUOffloadConnector","kv_role":"kv_both","kv_connector_extra_config":{"cpu_bytes_to_use":137438953472}}
- --enable-prefix-caching --enable-chunked-prefill
- --max-model-len=128000 --tensor-parallel-size=2
- --kv-cache-dtype=fp8 --block-size=256

Result:

WARNING [config.py:1297] Turning off hybrid kv cache manager because `--kv-transfer-config` is set...
WARNING [kv_cache_utils.py:1334] Hybrid KV cache manager is disabled for this hybrid model...
ValueError: To serve at least one request with the models's max seq len (128000),
  (38.87 GiB KV cache is needed, which is larger than the available KV cache memory (8.44 GiB).

The 8.44 GiB available shrinks from ~38+ GiB (with HMA on) to 8.44 GiB (with HMA off) because sliding-window groups, which would normally consume small per-token state, are now budgeted as full-attention. Per-GPU KV requirement balloons by ~30 GiB. SimpleCPUOffloadConnector itself never gets a chance to load — startup fails first.

Root cause

vllm/config/vllm.py:1292-1306:

if self.scheduler_config.disable_hybrid_kv_cache_manager is None:
    if self.kv_transfer_config is not None:
        # NOTE(Kuntai): turn HMA off for connector unless specifically enabled.
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

This disables HMA for every connector, even ones that subclass SupportsHMA. The factory at vllm/distributed/kv_transfer/kv_connector/factory.py:56-62 already does the correct check at connector-creation time:

hma_enabled = not config.scheduler_config.disable_hybrid_kv_cache_manager
if hma_enabled and not supports_hma(connector_cls):
    raise ValueError(...)

So the auto-disable in config/vllm.py is making a more pessimistic decision than the factory's source-of-truth check would.

Affected connectors

Several upstream connectors subclass SupportsHMA and are unnecessarily penalized:

  • SimpleCPUOffloadConnector
  • OffloadingConnector
  • NixlConnector
  • MultiConnector (when all children support HMA — see #39571)

Affected models

Any hybrid-attention model:

  • DeepSeek-V4-Flash (full + sliding-window + chunked-local)
  • Mixtral with sliding-window
  • Mamba / SSM hybrids (e.g. Nemotron-H — see #39269)
  • Llama-3.1+ with chunked-local attention

For these models, the auto-disable causes startup OOM that a typical user will misdiagnose as "the connector is incompatible with my model" or "I need more VRAM."

Proposed fix

Gate the auto-disable on supports_hma(connector_cls), mirroring the factory check:

if self.kv_transfer_config is not None:
    from vllm.distributed.kv_transfer.kv_connector.factory import KVConnectorFactory
    from vllm.distributed.kv_transfer.kv_connector.v1 import supports_hma
    connector_cls, _ = KVConnectorFactory._get_connector_class_with_compat(
        self.kv_transfer_config
    )
    if not supports_hma(connector_cls):
        need_disable_hybrid_kv_cache_manager = True
        logger.warning(...)

Connectors that support HMA (the majority of upstream ones) get HMA left alone. Connectors that don't get the existing warning + auto-disable behavior. No user-visible change for non-HMA connectors.

A cleaner version would expose a public class-resolution helper on KVConnectorFactory so this doesn't depend on the _-prefixed _get_connector_class_with_compat.

Current workaround

Pass --no-disable-hybrid-kv-cache-manager explicitly. The factory check at factory.py:58 then accepts it because the connector does support HMA. This is undocumented and unintuitive — a user has to discover both that HMA was auto-disabled silently, and that a specific opt-in flag fixes it.

Related

  • #29805 — the PR that introduced the current unconditional auto-disable + the manual override flag (cc @NickLucche).
  • #39269 — open PR addressing the analogous bug in the kv_events_config branch (lines 1226-1228), with a different approach: respect explicit --no-disable-hybrid-kv-cache-manager rather than auto-detect. The auto-detect approach proposed here would also generalize to that branch.
  • #39702 — independent runtime crash (TOCTOU race) in SimpleCPUOffloadConnector. Only reachable on hybrid models after the workaround above is applied. See my comment there for cross-context.

Environment

  • vLLM jasl/vllm ds4-sm120-full. The affected file (vllm/config/vllm.py) is byte-identical to upstream/main (verified via git diff); this is reproducible against vanilla vLLM.
  • DeepSeek-V4-Flash, 2× RTX PRO 6000 Blackwell, TP=2.

Disclosure

Investigation done with AI assistance reading the connector / config source. The proposed fix has not been built and tested locally (current production has the workaround in place); upstream maintainers should validate before merging.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: --kv-transfer-config unconditionally disables HMA, ignoring SupportsHMA on the connector [1 pull requests, 1 comments, 2 participants]