vllm - 💡(How to fix) Fix [Bug]: Infinite loop in EngineCore during multimodal cache eviction after 400/500 sequence with reused image

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  1. Send a new request containing the exact same image from step 1, but with a token count within the context limit. The server returns a 500 Internal Server Error, and the logs print: Expected a cached item for mm_hash='xxx'.
  • Step 2 fails with a 500 error and logs a cache expectation warning.
  1. Step 2 (500 Cache Miss): On the subsequent request with the same image (now within limits), BaseMultiModalProcessor finds a cache hit. It forwards the request containing only the mm_hash but omits the actual mm_item. When EngineCore receives this, it cannot locate the corresponding item, triggering the 500 error and the log: Expected a cached item for mm_hash='xxx'.

Root Cause

Based on my investigation, the issue stems from a desynchronization between the two multimodal caches maintained by vLLM:

  1. Dual Cache Architecture: vLLM maintains two separate mm-caches: one in the service preprocessing stage (BaseMultiModalProcessor) and another in the inference stage (EngineCore).

    vllm/multimodal/processing/processor.py

    class BaseMultiModalProcessor(ABC, Generic[_I]):
        ......
        def _merge_mm_kwargs(
            self,
            cache: BaseMultiModalProcessorCache,
            mm_hashes: MultiModalHashes,
            mm_is_cached: MultiModalIsCached,
            mm_missing_kwargs: MultiModalKwargsItems,
            mm_missing_prompt_updates: MultiModalPromptUpdates,
        ) -> tuple[MultiModalKwargsOptionalItems, MultiModalPromptUpdates]:
            ......
            # update cache
            for hashes in mm_hashes.values():
                for item_hash in hashes:
                    cache.touch_sender_cache_item(item_hash)
            for modality, hashes in mm_hashes.items():
                ......
                    kwargs, updates = cache.get_and_update_item(item, item_hash)
    
             ......

    vllm/v1/engine/core.py

    class EngineCore:
        ......
        def preprocess_add_request(self, request: EngineCoreRequest) -> tuple[Request, int]:
            """Preprocess the request.
    
            This function could be directly used in input processing thread to allow
            request initialization running in parallel with Model forward
            """
            # Note on thread safety: no race condition.
            # `mm_receiver_cache` is reset at the end of LLMEngine init,
            # and will only be accessed in the input processing thread afterwards.
            if self.mm_receiver_cache is not None and request.mm_features:
                request.mm_features = self.mm_receiver_cache.get_and_update_features(
                    request.mm_features
                )
  2. Step 1 (400 Rejection): When a request exceeds the context limit, it is rejected before reaching the engine. However, BaseMultiModalProcessor has already registered the image's mm_hash and mm_item in its cache, while EngineCore's cache remains unupdated.

  3. Step 2 (500 Cache Miss): On the subsequent request with the same image (now within limits), BaseMultiModalProcessor finds a cache hit. It forwards the request containing only the mm_hash but omits the actual mm_item. When EngineCore receives this, it cannot locate the corresponding item, triggering the 500 error and the log: Expected a cached item for mm_hash='xxx'.

    vllm/multimodal/cache.py

        def get_and_update_item(
            self,
            mm_item: MultiModalKwargsItem | None,
            mm_hash: str,
        ) -> MultiModalKwargsItem:
            # Already updated _lru_order in "self._cache.get(mm_hash)"
            if (cached_item := self._cache.get(mm_hash)) is not None:
                return cached_item
            # Receives mm_item is None, fails to populate the actual `_cache_data` dictionary 
            assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
    
            self._cache[mm_hash] = mm_item
            return mm_item
  4. Step 3 (Infinite Loop during Eviction): Both caches are implemented via LRUCache. During the failed Step 2 request, EngineCore's LRU updates its internal _lru_order list with the hash but fails to populate the actual _cache_data dictionary (since the item was never provided). Later, when cache reclamation is triggered and the LRU traversal reaches this orphaned hash, it encounters a missing key in _cache_data, causing the eviction logic to fall into an infinite loop.

    cachetools/__init__.py

        def __setitem__(self, key, value):
            maxsize = self.__maxsize
            size = self.getsizeof(value)
            if size > maxsize:
                raise ValueError("value too large")
            if key not in self.__data or self.__size[key] < size:
                # INFINITY LOOP!! Since LRUCache cannot normally pop item when oder has a key that not in data.
                while self.__currsize + size > maxsize:
                    self.popitem()
            if key in self.__data:
                diffsize = size - self.__size[key]
            else:
                diffsize = size
            self.__data[key] = value
            self.__size[key] = size
            self.__currsize += diffsize

    vllm/utils/cache.py

    class LRUCache(cachetools.LRUCache[_K, _V]):
        ......
        def pop(self, key: _K, default: _V | _T | None = None) -> _V | _T | None:
            value: _V | _T | None
            # DIRECT ROOT CAUSE!! If hash key only in oder but not in data, the key will not be deleted in oder, nothing happend after popitem!
            if key not in self:
                return default
    
            value = self.__getitem__(key, update_info=False)  # type: ignore[call-arg]
            self.__delitem__(key)
            return value    
        
        def popitem(self, remove_pinned: bool = False):
            """Remove and return the `(key, value)` pair least recently used."""
            if not remove_pinned:
                # pop the oldest item in the cache that is not pinned
                lru_key = next(
                    (key for key in self.order if key not in self.pinned_items),
                    ALL_PINNED_SENTINEL,
                )
                if lru_key is ALL_PINNED_SENTINEL:
                    raise RuntimeError(
                        "All items are pinned, cannot remove oldest from the cache."
                    )
            else:
                lru_key = next(iter(self.order))
            value = self.pop(cast(_K, lru_key))
            if lru_key in self.order:
                del self._LRUCache__order[lru_key]
            return (lru_key, value)

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: HiSilicon BIOS Vendor ID: HiSilicon Model name: Kunpeng-920 BIOS Model name: HUAWEI Kunpeng 920 5250 Model: 0 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 4 Stepping: 0x1 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs L1d cache: 12 MiB (192 instances) L1i cache: 12 MiB (192 instances) L2 cache: 96 MiB (192 instances) L3 cache: 192 MiB (8 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95 NUMA node4 CPU(s): 96-119 NUMA node5 CPU(s): 120-143 NUMA node6 CPU(s): 144-167 NUMA node7 CPU(s): 168-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (aarch64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : 15.0.7
CMake version                : version 4.3.2
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Feb 26 2026, 04:49:14) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-4.19.90-2107.6.0.0251.71.oe1.bclinux.aarch64-aarch64-with-glibc2.35


==============================
          CPU Info
==============================
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0+cpu
[pip3] torch_npu==2.9.0.post1+gitee7ba04
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton-ascend==3.2.0.dev20260322
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/cann-8.5.1/tools/aml/lib64:/usr/local/Ascend/cann-8.5.1/tools/aml/lib64/plugin:/usr/local/Ascend/cann-8.5.1/lib64:/usr/local/Ascend/cann-8.5.1/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.1/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.1/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/python3.11.14/lib::/usr/local/lib
OMP_NUM_THREADS=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

class BaseMultiModalProcessor(ABC, Generic[_I]):
       ......
       def _merge_mm_kwargs(
           self,
           cache: BaseMultiModalProcessorCache,
           mm_hashes: MultiModalHashes,
           mm_is_cached: MultiModalIsCached,
           mm_missing_kwargs: MultiModalKwargsItems,
           mm_missing_prompt_updates: MultiModalPromptUpdates,
       ) -> tuple[MultiModalKwargsOptionalItems, MultiModalPromptUpdates]:
           ......
           # update cache
           for hashes in mm_hashes.values():
               for item_hash in hashes:
                   cache.touch_sender_cache_item(item_hash)
           for modality, hashes in mm_hashes.items():
               ......
                   kwargs, updates = cache.get_and_update_item(item, item_hash)
   
            ......

---

class EngineCore:
       ......
       def preprocess_add_request(self, request: EngineCoreRequest) -> tuple[Request, int]:
           """Preprocess the request.
   
           This function could be directly used in input processing thread to allow
           request initialization running in parallel with Model forward
           """
           # Note on thread safety: no race condition.
           # `mm_receiver_cache` is reset at the end of LLMEngine init,
           # and will only be accessed in the input processing thread afterwards.
           if self.mm_receiver_cache is not None and request.mm_features:
               request.mm_features = self.mm_receiver_cache.get_and_update_features(
                   request.mm_features
               )

---

def get_and_update_item(
           self,
           mm_item: MultiModalKwargsItem | None,
           mm_hash: str,
       ) -> MultiModalKwargsItem:
           # Already updated _lru_order in "self._cache.get(mm_hash)"
           if (cached_item := self._cache.get(mm_hash)) is not None:
               return cached_item
           # Receives mm_item is None, fails to populate the actual `_cache_data` dictionary 
           assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
   
           self._cache[mm_hash] = mm_item
           return mm_item

---

def __setitem__(self, key, value):
           maxsize = self.__maxsize
           size = self.getsizeof(value)
           if size > maxsize:
               raise ValueError("value too large")
           if key not in self.__data or self.__size[key] < size:
               # INFINITY LOOP!! Since LRUCache cannot normally pop item when oder has a key that not in data.
               while self.__currsize + size > maxsize:
                   self.popitem()
           if key in self.__data:
               diffsize = size - self.__size[key]
           else:
               diffsize = size
           self.__data[key] = value
           self.__size[key] = size
           self.__currsize += diffsize

---

class LRUCache(cachetools.LRUCache[_K, _V]):
       ......
       def pop(self, key: _K, default: _V | _T | None = None) -> _V | _T | None:
           value: _V | _T | None
           # DIRECT ROOT CAUSE!! If hash key only in oder but not in data, the key will not be deleted in oder, nothing happend after popitem!
           if key not in self:
               return default
   
           value = self.__getitem__(key, update_info=False)  # type: ignore[call-arg]
           self.__delitem__(key)
           return value    
       
       def popitem(self, remove_pinned: bool = False):
           """Remove and return the `(key, value)` pair least recently used."""
           if not remove_pinned:
               # pop the oldest item in the cache that is not pinned
               lru_key = next(
                   (key for key in self.order if key not in self.pinned_items),
                   ALL_PINNED_SENTINEL,
               )
               if lru_key is ALL_PINNED_SENTINEL:
                   raise RuntimeError(
                       "All items are pinned, cannot remove oldest from the cache."
                   )
           else:
               lru_key = next(iter(self.order))
           value = self.pop(cast(_K, lru_key))
           if lru_key in self.order:
               del self._LRUCache__order[lru_key]
           return (lru_key, value)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (aarch64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : 15.0.7
CMake version                : version 4.3.2
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Feb 26 2026, 04:49:14) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-4.19.90-2107.6.0.0251.71.oe1.bclinux.aarch64-aarch64-with-glibc2.35


==============================
          CPU Info
==============================
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         Kunpeng-920
BIOS Model name:                    HUAWEI Kunpeng 920 5250
Model:                              0
Thread(s) per core:                 1
Core(s) per socket:                 48
Socket(s):                          4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0+cpu
[pip3] torch_npu==2.9.0.post1+gitee7ba04
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton-ascend==3.2.0.dev20260322
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.18.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/cann-8.5.1/tools/aml/lib64:/usr/local/Ascend/cann-8.5.1/tools/aml/lib64/plugin:/usr/local/Ascend/cann-8.5.1/lib64:/usr/local/Ascend/cann-8.5.1/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.1/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.1/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:/usr/local/python3.11.14/lib::/usr/local/lib
OMP_NUM_THREADS=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

Describe the bug

When serving multimodal models, a specific sequence of requests causes the EngineCore to enter an infinite loop during multimodal (mm) cache eviction, resulting in completely unresponsive requests.

Steps to reproduce

  1. Send a multimodal request containing an image where the total token count exceeds the model's context limit. The server correctly returns a 400 Bad Request (context limit exceeded).
  2. Send a new request containing the exact same image from step 1, but with a token count within the context limit. The server returns a 500 Internal Server Error, and the logs print: Expected a cached item for mm_hash='xxx'.
  3. Continue sending new requests with different images. When EngineCore triggers the multimodal cache eviction/reclamation process, it falls into an infinite loop, causing the service to hang and all subsequent requests to time out.

Expected behavior

  • Step 2 should be processed successfully and return a valid response.
  • Step 3 should handle mm-cache eviction normally without deadlocks or infinite loops.

Actual behavior

  • Step 2 fails with a 500 error and logs a cache expectation warning.
  • Step 3 triggers an infinite loop in the EngineCore during mm-cache recycling, making the entire service unresponsive.

Root Cause Analysis

Based on my investigation, the issue stems from a desynchronization between the two multimodal caches maintained by vLLM:

  1. Dual Cache Architecture: vLLM maintains two separate mm-caches: one in the service preprocessing stage (BaseMultiModalProcessor) and another in the inference stage (EngineCore).

    vllm/multimodal/processing/processor.py

    class BaseMultiModalProcessor(ABC, Generic[_I]):
        ......
        def _merge_mm_kwargs(
            self,
            cache: BaseMultiModalProcessorCache,
            mm_hashes: MultiModalHashes,
            mm_is_cached: MultiModalIsCached,
            mm_missing_kwargs: MultiModalKwargsItems,
            mm_missing_prompt_updates: MultiModalPromptUpdates,
        ) -> tuple[MultiModalKwargsOptionalItems, MultiModalPromptUpdates]:
            ......
            # update cache
            for hashes in mm_hashes.values():
                for item_hash in hashes:
                    cache.touch_sender_cache_item(item_hash)
            for modality, hashes in mm_hashes.items():
                ......
                    kwargs, updates = cache.get_and_update_item(item, item_hash)
    
             ......

    vllm/v1/engine/core.py

    class EngineCore:
        ......
        def preprocess_add_request(self, request: EngineCoreRequest) -> tuple[Request, int]:
            """Preprocess the request.
    
            This function could be directly used in input processing thread to allow
            request initialization running in parallel with Model forward
            """
            # Note on thread safety: no race condition.
            # `mm_receiver_cache` is reset at the end of LLMEngine init,
            # and will only be accessed in the input processing thread afterwards.
            if self.mm_receiver_cache is not None and request.mm_features:
                request.mm_features = self.mm_receiver_cache.get_and_update_features(
                    request.mm_features
                )
  2. Step 1 (400 Rejection): When a request exceeds the context limit, it is rejected before reaching the engine. However, BaseMultiModalProcessor has already registered the image's mm_hash and mm_item in its cache, while EngineCore's cache remains unupdated.

  3. Step 2 (500 Cache Miss): On the subsequent request with the same image (now within limits), BaseMultiModalProcessor finds a cache hit. It forwards the request containing only the mm_hash but omits the actual mm_item. When EngineCore receives this, it cannot locate the corresponding item, triggering the 500 error and the log: Expected a cached item for mm_hash='xxx'.

    vllm/multimodal/cache.py

        def get_and_update_item(
            self,
            mm_item: MultiModalKwargsItem | None,
            mm_hash: str,
        ) -> MultiModalKwargsItem:
            # Already updated _lru_order in "self._cache.get(mm_hash)"
            if (cached_item := self._cache.get(mm_hash)) is not None:
                return cached_item
            # Receives mm_item is None, fails to populate the actual `_cache_data` dictionary 
            assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
    
            self._cache[mm_hash] = mm_item
            return mm_item
  4. Step 3 (Infinite Loop during Eviction): Both caches are implemented via LRUCache. During the failed Step 2 request, EngineCore's LRU updates its internal _lru_order list with the hash but fails to populate the actual _cache_data dictionary (since the item was never provided). Later, when cache reclamation is triggered and the LRU traversal reaches this orphaned hash, it encounters a missing key in _cache_data, causing the eviction logic to fall into an infinite loop.

    cachetools/__init__.py

        def __setitem__(self, key, value):
            maxsize = self.__maxsize
            size = self.getsizeof(value)
            if size > maxsize:
                raise ValueError("value too large")
            if key not in self.__data or self.__size[key] < size:
                # INFINITY LOOP!! Since LRUCache cannot normally pop item when oder has a key that not in data.
                while self.__currsize + size > maxsize:
                    self.popitem()
            if key in self.__data:
                diffsize = size - self.__size[key]
            else:
                diffsize = size
            self.__data[key] = value
            self.__size[key] = size
            self.__currsize += diffsize

    vllm/utils/cache.py

    class LRUCache(cachetools.LRUCache[_K, _V]):
        ......
        def pop(self, key: _K, default: _V | _T | None = None) -> _V | _T | None:
            value: _V | _T | None
            # DIRECT ROOT CAUSE!! If hash key only in oder but not in data, the key will not be deleted in oder, nothing happend after popitem!
            if key not in self:
                return default
    
            value = self.__getitem__(key, update_info=False)  # type: ignore[call-arg]
            self.__delitem__(key)
            return value    
        
        def popitem(self, remove_pinned: bool = False):
            """Remove and return the `(key, value)` pair least recently used."""
            if not remove_pinned:
                # pop the oldest item in the cache that is not pinned
                lru_key = next(
                    (key for key in self.order if key not in self.pinned_items),
                    ALL_PINNED_SENTINEL,
                )
                if lru_key is ALL_PINNED_SENTINEL:
                    raise RuntimeError(
                        "All items are pinned, cannot remove oldest from the cache."
                    )
            else:
                lru_key = next(iter(self.order))
            value = self.pop(cast(_K, lru_key))
            if lru_key in self.order:
                del self._LRUCache__order[lru_key]
            return (lru_key, value)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • Step 2 should be processed successfully and return a valid response.
  • Step 3 should handle mm-cache eviction normally without deadlocks or infinite loops.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Infinite loop in EngineCore during multimodal cache eviction after 400/500 sequence with reused image