vllm - 💡(How to fix) Fix [Bug]: Prefix caching causes AssertionError for pooling models (embedding models) with empty kv_cache_groups [1 comments, 2 participants]

1998ming · 2026-04-23T06:49:34Z

[vllm] Your current environment python3 -m vllm.entrypoints.openai.api_server --model bge-m3 \ --trust-remote-code \ --served-model-name embed \ --enable-prefix-caching \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --convert embed \ --dtype=bfloat16 \ --enforce-eager \ --max_num_seqs $BATCH_SIZE ```text Your output of `python collect_env.py` here ``` When launching a pooling model (such as BGE-M3 embedding model) with --enable-prefix-caching flag enabled, the following error occurs: AssertionError: HybridKVCacheCoordinator requires at least two attention groups. ### 🐛 Describe the bug Complete Stack Trace File "/path/to/vllm/v1/core/kv_cache_coordinator.py", line 434, in verify_and_split_kv_cache_groups assert len(attention_groups) > 1, ( "HybridKVCacheCoordinator requires at least two attention groups." ) AssertionError: HybridKVCacheCoordinator requires at least two attention groups. Reproduction Steps 1. Start vLLM API server with a pooling model and prefix caching enabled: python3 -m vllm.entrypoints.openai.api_server \ --model bge-m3 \ --enable-prefix-caching \ --trust-remote-code 2. Observe the pooling model's KVCacheConfig: KVCacheConfig(num_blocks=1, kv_cache_tensors=[], kv_cache_groups=[]) 3. Error is raised Root Cause The coordinator selection logic in get_kv_cache_coordinator() function (lines 547-595) has a flaw: if not enable_caching: return KVCacheCoordinatorNoPrefixCache(...) if len(kv_cache_config.kv_cache_groups) == 1: return UnitaryKVCacheCoordinator(...) return HybridKVCacheCoordinator(...) # ❌ Problem here ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-23 06:49:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40682•Fetched 2026-04-24 05:52:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

1998ming

Participants

1998ming

noooop

Timeline (top)

closed ×1commented ×1labeled ×1

Error Message

When launching a pooling model (such as BGE-M3 embedding model) with --enable-prefix-caching flag enabled, the following error occurs: 3. Error is raised

Root Cause

Code Example

Your output of `python collect_env.py` here

RAW_BUFFERClick to expand / collapse

Your current environment

<details> python3 -m vllm.entrypoints.openai.api_server --model bge-m3 \ --trust-remote-code \ --served-model-name embed \ --enable-prefix-caching \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --convert embed \ --dtype=bfloat16 \ --enforce-eager \ --max_num_seqs $BATCH_SIZE

Your output of `python collect_env.py` here

When launching a pooling model (such as BGE-M3 embedding model) with --enable-prefix-caching flag enabled, the following error occurs:

AssertionError: HybridKVCacheCoordinator requires at least two attention groups.

🐛 Describe the bug

Complete Stack Trace

File "/path/to/vllm/v1/core/kv_cache_coordinator.py", line 434, in verify_and_split_kv_cache_groups assert len(attention_groups) > 1, ( "HybridKVCacheCoordinator requires at least two attention groups." ) AssertionError: HybridKVCacheCoordinator requires at least two attention groups.

Reproduction Steps

Start vLLM API server with a pooling model and prefix caching enabled: python3 -m vllm.entrypoints.openai.api_server
--model bge-m3
--enable-prefix-caching
--trust-remote-code
Observe the pooling model's KVCacheConfig: KVCacheConfig(num_blocks=1, kv_cache_tensors=[], kv_cache_groups=[])
Error is raised

Root Cause

The coordinator selection logic in get_kv_cache_coordinator() function (lines 547-595) has a flaw:

if not enable_caching: return KVCacheCoordinatorNoPrefixCache(...) if len(kv_cache_config.kv_cache_groups) == 1: return UnitaryKVCacheCoordinator(...) return HybridKVCacheCoordinator(...) # ❌ Problem here

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be fixed by modifying the coordinator selection logic in the get_kv_cache_coordinator() function to handle the case where len(kv_cache_config.kv_cache_groups) equals 1.

Guidance

Review the get_kv_cache_coordinator() function to ensure it correctly handles different numbers of kv_cache_groups.
Verify that the KVCacheConfig is correctly configured with at least two attention groups when using the HybridKVCacheCoordinator.
Consider adding a check to ensure that len(kv_cache_config.kv_cache_groups) is greater than 1 before returning HybridKVCacheCoordinator.
Check the documentation for any specific requirements or recommendations for configuring KVCacheConfig with pooling models.

Example

if not enable_caching:
    return KVCacheCoordinatorNoPrefixCache(...)
if len(kv_cache_config.kv_cache_groups) <= 1:
    return UnitaryKVCacheCoordinator(...)
return HybridKVCacheCoordinator(...)

Notes

The provided code snippet assumes that the issue is with the coordinator selection logic. However, without the full codebase, it's difficult to provide a definitive solution.

Recommendation

Apply workaround: Modify the get_kv_cache_coordinator() function to handle the case where len(kv_cache_config.kv_cache_groups) equals 1, as shown in the example above. This change will prevent the AssertionError from being raised.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Prefix caching causes AssertionError for pooling models (embedding models) with empty kv_cache_groups [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Prefix caching causes AssertionError for pooling models (embedding models) with empty kv_cache_groups [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING