vllm - 💡(How to fix) Fix [Bug]: NIXL connector silently disables HMA, halving KV cache capacity — flip default to HMA=on even with connectors

Root Cause

In kv_cache_utils.py, the hybrid KV cache manager is disabled when any KV connector is configured. This was originally a safety measure before NIXL+HMA compatibility was implemented. Now that compatibility exists, the default should be flipped.

The relevant code paths:

kv_cache_utils.py:1175 — HMA disabled warning
gpu_worker.py:436-470 — KV cache memory profiling (reports identical memory in both cases, masking the problem)

Fix Action

Workaround

Adding --no-disable-hybrid-kv-cache-manager to the launch command re-enables HMA alongside NIXL and restores expected KV cache capacity:

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --no-disable-hybrid-kv-cache-manager

Confirmed: This restores max concurrency to expected levels.

Code Example

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

---

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy

---

curl http://localhost:8000/metrics | grep blocks

---

WARNING [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model,
This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the
KV cache outside the sliding window). The compute of layers like sliding window is still saved.

---

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --no-disable-hybrid-kv-cache-manager

Your current environment

GPU: RTX 6000 Pro (reproduced); expected to affect all GPU types
Model: openai/gpt-oss-120b (hybrid architecture with sliding window layers)
vLLM version: 0.18.x (current)
OS/Container: Standard vLLM container image

🐛 Describe the bug

Summary

Enabling the NIXL connector (NixlConnector) for prefill/decode KV transfer silently disables the Hybrid Memory Allocator (HMA). This causes all transformer layers to be treated as full attention, roughly doubling per-layer KV cache footprint and cutting effective KV cache capacity in half. There is no warning at startup that names the connector as the cause.

Severity: High — affects all P/D disaggregation deployments using NixlConnector. Component: KV cache / NIXL connector / Hybrid Memory Allocator

Steps to Reproduce

1. Run with NIXL connector (TP=2)

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

2. Run without NIXL connector (TP=2, same config otherwise)

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy

3. Compare KV cache metrics

curl http://localhost:8000/metrics | grep blocks

Observed Behavior

Both configurations report identical available KV cache memory (52.4 GiB) and identical GPU KV cache size (1,526,336 tokens). However, the maximum concurrency diverges dramatically:

Config	KV Cache Memory	KV Cache Tokens	Max Concurrency (131K context)
With NIXL	52.4 GiB	1,526,336	11.65x
Without NIXL	52.4 GiB	1,526,336	21.90x

With NIXL enabled, the startup logs show:

WARNING [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model,
This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the
KV cache outside the sliding window). The compute of layers like sliding window is still saved.

This warning is emitted but does not mention that the NIXL connector is the cause, nor does it suggest a remediation.

Without NIXL, this warning is absent and HMA is enabled by default.

Expected Behavior

HMA should remain enabled by default when a KV connector is specified, since NIXL+HMA compatibility has been implemented.
If HMA must be disabled for a connector, the startup log should explicitly state: HMA disabled because KV connector [NixlConnector] is active. To re-enable: --no-disable-hybrid-kv-cache-manager.
The KV cache token count and max concurrency should be consistent regardless of connector presence (given the same model and GPU memory).

Workaround

Adding --no-disable-hybrid-kv-cache-manager to the launch command re-enables HMA alongside NIXL and restores expected KV cache capacity:

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --no-disable-hybrid-kv-cache-manager

Confirmed: This restores max concurrency to expected levels.

Root Cause Analysis

The relevant code paths:

kv_cache_utils.py:1175 — HMA disabled warning
gpu_worker.py:436-470 — KV cache memory profiling (reports identical memory in both cases, masking the problem)

Additional Issue: Prometheus Metrics Discrepancy

When NIXL is not activated, there is a discrepancy between the KV cache size reported by the Prometheus /metrics endpoint (blocks metrics) and the startup log (GPU KV cache size: 1,526,336 tokens). These should always agree.

Proposed Fix

Flip the default: Set HMA=enabled regardless of connector presence (one-line change + update tests).
Improve the warning: If HMA is ever disabled, the warning should name the cause and the remediation flag.
Rename the flag: --no-disable-hybrid-kv-cache-manager is a double-negative; consider --enable-hybrid-kv-cache-manager or --hybrid-kv-cache-manager (on/off).
Fix Prometheus discrepancy: Ensure /metrics KV cache block counts match the startup log token count.

Impact

This affects every deployment using NixlConnector for P/D disaggregation on hybrid-architecture models (models with sliding window layers). Operators lose 50% of their KV cache capacity with no indication other than a generic warning buried in startup logs. The performance impact is silent and difficult to diagnose — the token count looks correct, only the max concurrency reveals the problem.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: NIXL connector silently disables HMA, halving KV cache capacity — flip default to HMA=on even with connectors

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Your current environment

🐛 Describe the bug

Summary

Steps to Reproduce

1. Run with NIXL connector (TP=2)

2. Run without NIXL connector (TP=2, same config otherwise)

3. Compare KV cache metrics

Observed Behavior

Expected Behavior

Workaround

Root Cause Analysis

Additional Issue: Prometheus Metrics Discrepancy

Proposed Fix

Impact

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NIXL connector silently disables HMA, halving KV cache capacity — flip default to HMA=on even with connectors

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Your current environment

🐛 Describe the bug

Summary

Steps to Reproduce

1. Run with NIXL connector (TP=2)

2. Run without NIXL connector (TP=2, same config otherwise)

3. Compare KV cache metrics

Observed Behavior

Expected Behavior

Workaround

Root Cause Analysis

Additional Issue: Prometheus Metrics Discrepancy

Proposed Fix

Impact

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING