vllm - 💡(How to fix) Fix [Bug]: V1 sleep/wake leaves P0 multimodal sender cache desynced from P1 → AssertionError on next image reuse [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On the V1 engine with sleep mode enabled and a multimodal IPC cache (--mm-processor-cache-gb > 0 = default behaviour), a /sleep?level=1 followed by /wake_up deterministically corrupts the multimodal cache state. The next /v1/chat/completions request that re-uses an image whose mm_hash was seen before the sleep crashes the engine's preprocessing thread with:

AssertionError: Expected a cached item for mm_hash='...'

Afterwards: /v1/models and /health keep returning 200 OK, but any endpoint that round-trips to EngineCore (/sleep, /wake_up, /is_sleeping, generation) hangs.

Launch flags /opt/venv/bin/vllm serve google/gemma-4-26B-A4B-it \
--host 0.0.0.0 --port 11437 \
--tensor-parallel-size 1 \
--dtype auto
--gpu-memory-utilization 0.6271517157538793 \
--kv-cache-memory-bytes 8G \
--enable-prefix-caching \
--enable-sleep-mode
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--compilation-config '{"cache_dir": "..."}'
--default-chat-template-kwargs '{"enable_thinking": true}'

Required environment VLLM_SERVER_DEV_MODE=1

Error Message

AssertionError: Expected a cached item for mm_hash='...'

Root Cause

On the V1 engine with sleep mode enabled and a multimodal IPC cache (--mm-processor-cache-gb > 0 = default behaviour), a /sleep?level=1 followed by /wake_up deterministically corrupts the multimodal cache state. The next /v1/chat/completions request that re-uses an image whose mm_hash was seen before the sleep crashes the engine's preprocessing thread with:

AssertionError: Expected a cached item for mm_hash='...'

Afterwards: /v1/models and /health keep returning 200 OK, but any endpoint that round-trips to EngineCore (/sleep, /wake_up, /is_sleeping, generation) hangs.

Launch flags /opt/venv/bin/vllm serve google/gemma-4-26B-A4B-it \
--host 0.0.0.0 --port 11437 \
--tensor-parallel-size 1 \
--dtype auto
--gpu-memory-utilization 0.6271517157538793 \
--kv-cache-memory-bytes 8G \
--enable-prefix-caching \
--enable-sleep-mode
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--compilation-config '{"cache_dir": "..."}'
--default-chat-template-kwargs '{"enable_thinking": true}'

Required environment VLLM_SERVER_DEV_MODE=1

Fix Action

Fixed

Code Example

==============================                                                                                                      
          System Info                                                                                                                 
  ==============================
  OS                           : Ubuntu 24.04.3 LTS (x86_64)                                                                          
  GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0                                                              
  Libc version                 : glibc-2.39                                                                                           
                                                                                                                                      
  ==============================                                                                                                      
         PyTorch Info                                                                                                               
  ==============================                                                                                                      
  PyTorch version              : 2.11.0+cu130
  CUDA used to build PyTorch   : 13.0                                                                                                 
                                                                                                                                    
  ==============================
        Python Environment
  ==============================                                                                                                      
  Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
  Python platform              : Linux-6.8.0-111-generic-x86_64-with-glibc2.39                                                        
                                                                                                                                      
  ==============================                                                                                                      
         CUDA / GPU Info                                                                                                              
  ==============================                                                                                                    
  Is CUDA available            : True
  CUDA runtime version         : 13.1.115
  GPU 0/1/2                    : NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition                                              
  Nvidia driver version        : 580.142                                                                                              
  cuDNN version                : libcudnn.so.9.17.1                                                                                   
                                                                                                                                      
  ==============================                                                                                                    
            CPU Info                                                                                                                  
  ==============================                                                                                                    
  AMD Ryzen Threadripper PRO 9975WX 32-Cores (64 threads, 1 NUMA node)
                                                                                                                                      
  ==============================
  Versions of relevant libraries                                                                                                      
  ==============================                                                                                                    
  [pip3] flashinfer-python==0.6.8.post1
  [pip3] flashinfer-python==0.6.8.post1
  [pip3] numpy==2.3.5
  [pip3] nvidia-nccl-cu13==2.28.9
  [pip3] pyzmq==27.1.0
  [pip3] torch==2.11.0
  [pip3] torchaudio==2.11.0
  [pip3] torchvision==0.26.0
  [pip3] transformers==5.8.0
  [pip3] triton==3.6.0

  ==============================
           vLLM Info
  ==============================
  vLLM Version                 : 0.20.0
  vLLM Build Flags:
    CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
  GPU Topology:
         GPU0  GPU1  GPU2  CPU Affinity  NUMA Affinity  GPU NUMA ID
  GPU0   X    NODE  NODE  0-63          0              N/A
  GPU1   NODE X    NODE  0-63          0              N/A
  GPU2   NODE NODE X    0-63          0              N/A
  (no NVLink — all PCIe NODE-level)

  ==============================
       Environment Variables
  ==============================
  NCCL_DEBUG=WARN
  NCCL_IB_DISABLE=1
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
  CUDA_VERSION=13.1.1

---

AssertionError: Expected a cached item for mm_hash='...'

---

# Any V1 multimodal model
vllm serve <some-vlm> \
  --enable-sleep-mode \
  --enable-prefix-caching \
  --max-model-len 8192 \
  --mm-processor-cache-gb 4
# Server must be started with VLLM_SERVER_DEV_MODE=1 so /sleep is exposed.

# 1) Send a chat-completion with an image
curl -sS http://localhost:8000/v1/chat/completions -d '{
  "model": "<some-vlm>",
  "messages": [{"role": "user", "content": [
    {"type": "text", "text": "describe"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,<B64>"}}
  ]}]
}' > /dev/null

# 2) Sleep (at level 1), then wake up after some seconds.
curl -sS -X POST 'http://localhost:8000/sleep?level=1&mode=wait'
curl -sS -X POST  http://localhost:8000/wake_up

# 3) Re-send the **SAME image** with any prompt: the assertion fires.
curl -sS http://localhost:8000/v1/chat/completions -d '<same image payload>'

---

ERROR core.py:1537 Unexpected error pre-processing request chatcmpl-...
Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1449, in process_input_sockets
    request = self.preprocess_add_request(req)
  File "vllm/v1/engine/core.py", line 775, in preprocess_add_request
    request.mm_features = self.mm_receiver_cache.get_and_update_features(...)
  File "vllm/multimodal/cache.py", line 591, in get_and_update_features
    feature.data = self.get_and_update_item(feature.data, cache_key)
  File "vllm/multimodal/cache.py", line 644, in get_and_update_item
    assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
AssertionError: Expected a cached item for mm_hash='ab07189…ddd6941f'
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
 ==============================                                                                                                      
          System Info                                                                                                                 
  ==============================
  OS                           : Ubuntu 24.04.3 LTS (x86_64)                                                                          
  GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0                                                              
  Libc version                 : glibc-2.39                                                                                           
                                                                                                                                      
  ==============================                                                                                                      
         PyTorch Info                                                                                                               
  ==============================                                                                                                      
  PyTorch version              : 2.11.0+cu130
  CUDA used to build PyTorch   : 13.0                                                                                                 
                                                                                                                                    
  ==============================
        Python Environment
  ==============================                                                                                                      
  Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
  Python platform              : Linux-6.8.0-111-generic-x86_64-with-glibc2.39                                                        
                                                                                                                                      
  ==============================                                                                                                      
         CUDA / GPU Info                                                                                                              
  ==============================                                                                                                    
  Is CUDA available            : True
  CUDA runtime version         : 13.1.115
  GPU 0/1/2                    : NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition                                              
  Nvidia driver version        : 580.142                                                                                              
  cuDNN version                : libcudnn.so.9.17.1                                                                                   
                                                                                                                                      
  ==============================                                                                                                    
            CPU Info                                                                                                                  
  ==============================                                                                                                    
  AMD Ryzen Threadripper PRO 9975WX 32-Cores (64 threads, 1 NUMA node)
                                                                                                                                      
  ==============================
  Versions of relevant libraries                                                                                                      
  ==============================                                                                                                    
  [pip3] flashinfer-python==0.6.8.post1
  [pip3] flashinfer-python==0.6.8.post1
  [pip3] numpy==2.3.5
  [pip3] nvidia-nccl-cu13==2.28.9
  [pip3] pyzmq==27.1.0
  [pip3] torch==2.11.0
  [pip3] torchaudio==2.11.0
  [pip3] torchvision==0.26.0
  [pip3] transformers==5.8.0
  [pip3] triton==3.6.0

  ==============================
           vLLM Info
  ==============================
  vLLM Version                 : 0.20.0
  vLLM Build Flags:
    CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
  GPU Topology:
         GPU0  GPU1  GPU2  CPU Affinity  NUMA Affinity  GPU NUMA ID
  GPU0   X    NODE  NODE  0-63          0              N/A
  GPU1   NODE X    NODE  0-63          0              N/A
  GPU2   NODE NODE X    0-63          0              N/A
  (no NVLink — all PCIe NODE-level)

  ==============================
       Environment Variables
  ==============================
  NCCL_DEBUG=WARN
  NCCL_IB_DISABLE=1
  PYTORCH_NVML_BASED_CUDA_CHECK=1
  TORCHINDUCTOR_COMPILE_THREADS=1
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
  CUDA_VERSION=13.1.1
</details>

🐛 Describe the bug

Summary

On the V1 engine with sleep mode enabled and a multimodal IPC cache (--mm-processor-cache-gb > 0 = default behaviour), a /sleep?level=1 followed by /wake_up deterministically corrupts the multimodal cache state. The next /v1/chat/completions request that re-uses an image whose mm_hash was seen before the sleep crashes the engine's preprocessing thread with:

AssertionError: Expected a cached item for mm_hash='...'

Afterwards: /v1/models and /health keep returning 200 OK, but any endpoint that round-trips to EngineCore (/sleep, /wake_up, /is_sleeping, generation) hangs.

Launch flags /opt/venv/bin/vllm serve google/gemma-4-26B-A4B-it \
--host 0.0.0.0 --port 11437 \
--tensor-parallel-size 1 \
--dtype auto
--gpu-memory-utilization 0.6271517157538793 \
--kv-cache-memory-bytes 8G \
--enable-prefix-caching \
--enable-sleep-mode
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--compilation-config '{"cache_dir": "..."}'
--default-chat-template-kwargs '{"enable_thinking": true}'

Required environment VLLM_SERVER_DEV_MODE=1

Minimal reproduction

# Any V1 multimodal model
vllm serve <some-vlm> \
  --enable-sleep-mode \
  --enable-prefix-caching \
  --max-model-len 8192 \
  --mm-processor-cache-gb 4
# Server must be started with VLLM_SERVER_DEV_MODE=1 so /sleep is exposed.

# 1) Send a chat-completion with an image
curl -sS http://localhost:8000/v1/chat/completions -d '{
  "model": "<some-vlm>",
  "messages": [{"role": "user", "content": [
    {"type": "text", "text": "describe"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,<B64>"}}
  ]}]
}' > /dev/null

# 2) Sleep (at level 1), then wake up after some seconds.
curl -sS -X POST 'http://localhost:8000/sleep?level=1&mode=wait'
curl -sS -X POST  http://localhost:8000/wake_up

# 3) Re-send the **SAME image** with any prompt: the assertion fires.
curl -sS http://localhost:8000/v1/chat/completions -d '<same image payload>'
ERROR core.py:1537 Unexpected error pre-processing request chatcmpl-...
Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1449, in process_input_sockets
    request = self.preprocess_add_request(req)
  File "vllm/v1/engine/core.py", line 775, in preprocess_add_request
    request.mm_features = self.mm_receiver_cache.get_and_update_features(...)
  File "vllm/multimodal/cache.py", line 591, in get_and_update_features
    feature.data = self.get_and_update_item(feature.data, cache_key)
  File "vllm/multimodal/cache.py", line 644, in get_and_update_item
    assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
AssertionError: Expected a cached item for mm_hash='ab07189…ddd6941f'

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: V1 sleep/wake leaves P0 multimodal sender cache desynced from P1 → AssertionError on next image reuse [1 pull requests]