vllm - 💡(How to fix) Fix [Bug]: Prefix caching causes AssertionError for pooling models (embedding models) with empty kv_cache_groups [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40682Fetched 2026-04-24 05:52:09
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1labeled ×1

Error Message

When launching a pooling model (such as BGE-M3 embedding model) with --enable-prefix-caching flag enabled, the following error occurs: 3. Error is raised

Root Cause

Root Cause

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> python3 -m vllm.entrypoints.openai.api_server --model bge-m3 \ --trust-remote-code \ --served-model-name embed \ --enable-prefix-caching \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 \ --convert embed \ --dtype=bfloat16 \ --enforce-eager \ --max_num_seqs $BATCH_SIZE
Your output of `python collect_env.py` here

When launching a pooling model (such as BGE-M3 embedding model) with --enable-prefix-caching flag enabled, the following error occurs:

AssertionError: HybridKVCacheCoordinator requires at least two attention groups.

🐛 Describe the bug

Complete Stack Trace

File "/path/to/vllm/v1/core/kv_cache_coordinator.py", line 434, in verify_and_split_kv_cache_groups assert len(attention_groups) > 1, ( "HybridKVCacheCoordinator requires at least two attention groups." ) AssertionError: HybridKVCacheCoordinator requires at least two attention groups.

Reproduction Steps

  1. Start vLLM API server with a pooling model and prefix caching enabled: python3 -m vllm.entrypoints.openai.api_server
    --model bge-m3
    --enable-prefix-caching
    --trust-remote-code

  2. Observe the pooling model's KVCacheConfig: KVCacheConfig(num_blocks=1, kv_cache_tensors=[], kv_cache_groups=[])

  3. Error is raised

Root Cause

The coordinator selection logic in get_kv_cache_coordinator() function (lines 547-595) has a flaw:

if not enable_caching: return KVCacheCoordinatorNoPrefixCache(...) if len(kv_cache_config.kv_cache_groups) == 1: return UnitaryKVCacheCoordinator(...) return HybridKVCacheCoordinator(...) # ❌ Problem here

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be fixed by modifying the coordinator selection logic in the get_kv_cache_coordinator() function to handle the case where len(kv_cache_config.kv_cache_groups) equals 1.

Guidance

  • Review the get_kv_cache_coordinator() function to ensure it correctly handles different numbers of kv_cache_groups.
  • Verify that the KVCacheConfig is correctly configured with at least two attention groups when using the HybridKVCacheCoordinator.
  • Consider adding a check to ensure that len(kv_cache_config.kv_cache_groups) is greater than 1 before returning HybridKVCacheCoordinator.
  • Check the documentation for any specific requirements or recommendations for configuring KVCacheConfig with pooling models.

Example

if not enable_caching:
    return KVCacheCoordinatorNoPrefixCache(...)
if len(kv_cache_config.kv_cache_groups) <= 1:
    return UnitaryKVCacheCoordinator(...)
return HybridKVCacheCoordinator(...)

Notes

The provided code snippet assumes that the issue is with the coordinator selection logic. However, without the full codebase, it's difficult to provide a definitive solution.

Recommendation

Apply workaround: Modify the get_kv_cache_coordinator() function to handle the case where len(kv_cache_config.kv_cache_groups) equals 1, as shown in the example above. This change will prevent the AssertionError from being raised.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING