vllm - 💡(How to fix) Fix [Bug]: NIXL connector silently disables HMA, halving KV cache capacity — flip default to HMA=on even with connectors

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Enabling the NIXL connector (NixlConnector) for prefill/decode KV transfer silently disables the Hybrid Memory Allocator (HMA). This causes all transformer layers to be treated as full attention, roughly doubling per-layer KV cache footprint and cutting effective KV cache capacity in half. There is no warning at startup that names the connector as the cause.

Severity: High — affects all P/D disaggregation deployments using NixlConnector. Component: KV cache / NIXL connector / Hybrid Memory Allocator

Root Cause

In kv_cache_utils.py, the hybrid KV cache manager is disabled when any KV connector is configured. This was originally a safety measure before NIXL+HMA compatibility was implemented. Now that compatibility exists, the default should be flipped.

The relevant code paths:

  • kv_cache_utils.py:1175 — HMA disabled warning
  • gpu_worker.py:436-470 — KV cache memory profiling (reports identical memory in both cases, masking the problem)

Fix Action

Workaround

Adding --no-disable-hybrid-kv-cache-manager to the launch command re-enables HMA alongside NIXL and restores expected KV cache capacity:

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --no-disable-hybrid-kv-cache-manager

Confirmed: This restores max concurrency to expected levels.

Code Example

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

---

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy

---

curl http://localhost:8000/metrics | grep blocks

---

WARNING [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model,
This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the
KV cache outside the sliding window). The compute of layers like sliding window is still saved.

---

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --no-disable-hybrid-kv-cache-manager
RAW_BUFFERClick to expand / collapse

Your current environment

  • GPU: RTX 6000 Pro (reproduced); expected to affect all GPU types
  • Model: openai/gpt-oss-120b (hybrid architecture with sliding window layers)
  • vLLM version: 0.18.x (current)
  • OS/Container: Standard vLLM container image

🐛 Describe the bug

Summary

Enabling the NIXL connector (NixlConnector) for prefill/decode KV transfer silently disables the Hybrid Memory Allocator (HMA). This causes all transformer layers to be treated as full attention, roughly doubling per-layer KV cache footprint and cutting effective KV cache capacity in half. There is no warning at startup that names the connector as the cause.

Severity: High — affects all P/D disaggregation deployments using NixlConnector. Component: KV cache / NIXL connector / Hybrid Memory Allocator

Steps to Reproduce

1. Run with NIXL connector (TP=2)

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

2. Run without NIXL connector (TP=2, same config otherwise)

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy

3. Compare KV cache metrics

curl http://localhost:8000/metrics | grep blocks

Observed Behavior

Both configurations report identical available KV cache memory (52.4 GiB) and identical GPU KV cache size (1,526,336 tokens). However, the maximum concurrency diverges dramatically:

ConfigKV Cache MemoryKV Cache TokensMax Concurrency (131K context)
With NIXL52.4 GiB1,526,33611.65x
Without NIXL52.4 GiB1,526,33621.90x

With NIXL enabled, the startup logs show:

WARNING [kv_cache_utils.py:1175] Hybrid KV cache manager is disabled for this hybrid model,
This means we do not enable any optimizations for saving KV cache memory (e.g., dropping the
KV cache outside the sliding window). The compute of layers like sliding window is still saved.

This warning is emitted but does not mention that the NIXL connector is the cause, nor does it suggest a remediation.

Without NIXL, this warning is absent and HMA is enabled by default.

Expected Behavior

  1. HMA should remain enabled by default when a KV connector is specified, since NIXL+HMA compatibility has been implemented.
  2. If HMA must be disabled for a connector, the startup log should explicitly state: HMA disabled because KV connector [NixlConnector] is active. To re-enable: --no-disable-hybrid-kv-cache-manager.
  3. The KV cache token count and max concurrency should be consistent regardless of connector presence (given the same model and GPU memory).

Workaround

Adding --no-disable-hybrid-kv-cache-manager to the launch command re-enables HMA alongside NIXL and restores expected KV cache capacity:

vllm serve \
  --host 0.0.0.0 --port 8000 \
  --model openai/gpt-oss-120b \
  --gpu-memory-utilization 0.95 \
  --tensor-parallel-size 2 \
  --no-enable-prefix-caching \
  --load-format dummy \
  --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
  --no-disable-hybrid-kv-cache-manager

Confirmed: This restores max concurrency to expected levels.

Root Cause Analysis

In kv_cache_utils.py, the hybrid KV cache manager is disabled when any KV connector is configured. This was originally a safety measure before NIXL+HMA compatibility was implemented. Now that compatibility exists, the default should be flipped.

The relevant code paths:

  • kv_cache_utils.py:1175 — HMA disabled warning
  • gpu_worker.py:436-470 — KV cache memory profiling (reports identical memory in both cases, masking the problem)

Additional Issue: Prometheus Metrics Discrepancy

When NIXL is not activated, there is a discrepancy between the KV cache size reported by the Prometheus /metrics endpoint (blocks metrics) and the startup log (GPU KV cache size: 1,526,336 tokens). These should always agree.

Proposed Fix

  1. Flip the default: Set HMA=enabled regardless of connector presence (one-line change + update tests).
  2. Improve the warning: If HMA is ever disabled, the warning should name the cause and the remediation flag.
  3. Rename the flag: --no-disable-hybrid-kv-cache-manager is a double-negative; consider --enable-hybrid-kv-cache-manager or --hybrid-kv-cache-manager (on/off).
  4. Fix Prometheus discrepancy: Ensure /metrics KV cache block counts match the startup log token count.

Impact

This affects every deployment using NixlConnector for P/D disaggregation on hybrid-architecture models (models with sliding window layers). Operators lose 50% of their KV cache capacity with no indication other than a generic warning buried in startup logs. The performance impact is silent and difficult to diagnose — the token count looks correct, only the max concurrency reveals the problem.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NIXL connector silently disables HMA, halving KV cache capacity — flip default to HMA=on even with connectors