vllm - ✅(Solved) Fix [Bug]: POST /wake_up causes vLLM process to crash. 500 Internal Server Error [1 pull requests, 20 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36753Fetched 2026-04-08 00:35:04
View on GitHub
Comments
20
Participants
4
Timeline
85
Reactions
0
Assignees
Timeline (top)
mentioned ×31subscribed ×31commented ×20assigned ×1

Error Message

(APIServer pid=68) INFO: 127.0.0.1:44768 - "GET /is_sleeping HTTP/1.1" 200 OK (APIServer pid=972) INFO: 127.0.0.1:38796 - "GET /is_sleeping HTTP/1.1" 200 OK (APIServer pid=68) INFO 03-11 07:18:31 [entrypoints/.../sleep/api_router.py:38] wake up the engine with tags: None (APIServer pid=68) INFO: 127.0.0.1:33724 - "POST /wake_up HTTP/1.1" 500 Internal Server Error (APIServer pid=68) ERROR: Exception in ASGI application (APIServer pid=68) Traceback (most recent call last): (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi (APIServer pid=68) result = await app( # type: ignore[func-returns-value] (APIServer pid=68) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in call (APIServer pid=68) return await self.app(scope, receive, send) (APIServer pid=68) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/fastapi/applications.py", line 1160, in call (APIServer pid=68) await super().call(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/applications.py", line 107, in call (APIServer pid=68) await self.middleware_stack(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 186, in call (APIServer pid=68) raise exc (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 164, in call (APIServer pid=68) await self.app(scope, receive, _send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/cors.py", line 87, in call (APIServer pid=68) await self.app(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in call (APIServer pid=68) raise exc (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in call (APIServer pid=68) await self.app(scope, receive, send_wrapper) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in call (APIServer pid=68) await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app (APIServer pid=68) raise exc (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app (APIServer pid=68) await app(scope, receive, sender) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call (APIServer pid=68) await self.app(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 716, in call (APIServer pid=68) await self.middleware_stack(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 736, in app (APIServer pid=68) await route.handle(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 290, in handle (APIServer pid=68) await self.app(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 119, in app (APIServer pid=68) await wrap_app_handling_exceptions(app, request)(scope, receive, send) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app (APIServer pid=68) raise exc (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app (APIServer pid=68) await app(scope, receive, sender) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 105, in app (APIServer pid=68) response = await f(request) (APIServer pid=68) ^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 431, in app (APIServer pid=68) raw_response = await run_endpoint_function( (APIServer pid=68) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 313, in run_endpoint_function (APIServer pid=68) return await dependant.call(**values) (APIServer pid=68) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/serve/sleep/api_router.py", line 39, in wake_up (APIServer pid=68) await engine_client(raw_request).wake_up(tags) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 755, in wake_up (APIServer pid=68) await self.engine_core.wake_up_async(tags) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 981, in wake_up_async (APIServer pid=68) await self.call_utility_async("wake_up", tags) (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 936, in call_utility_async (APIServer pid=68) return await self._call_utility_async(method, *args, engine=self.core_engine) (APIServer pid=68) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 948, in _call_utility_async (APIServer pid=68) await self._send_input_message(message, engine, args) (APIServer pid=68) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 917, in _send_input_message (APIServer pid=68) self.ensure_alive() (APIServer pid=68) File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 566, in ensure_alive (APIServer pid=68) raise EngineDeadError() (APIServer pid=68) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=68) INFO: Shutting down (APIServer pid=68) INFO: Waiting for application shutdown. (APIServer pid=68) INFO: Application shutdown complete. (APIServer pid=68) INFO: Finished server process [68]

Root Cause

(APIServer pid=68) INFO:     127.0.0.1:44768 - "GET /is_sleeping HTTP/1.1" 200 OK
(APIServer pid=972) INFO:     127.0.0.1:38796 - "GET /is_sleeping HTTP/1.1" 200 OK
(APIServer pid=68) INFO 03-11 07:18:31 [entrypoints/.../sleep/api_router.py:38] wake up the engine with tags: None
(APIServer pid=68) INFO:     127.0.0.1:33724 - "POST /wake_up HTTP/1.1" 500 Internal Server Error
(APIServer pid=68) ERROR:    Exception in ASGI application
(APIServer pid=68) Traceback (most recent call last):
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=68)     result = await app(  # type: ignore[func-returns-value]
(APIServer pid=68)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
(APIServer pid=68)     return await self.app(scope, receive, send)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/applications.py", line 1160, in __call__
(APIServer pid=68)     await super().__call__(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/applications.py", line 107, in __call__
(APIServer pid=68)     await self.middleware_stack(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
(APIServer pid=68)     await self.app(scope, receive, _send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/cors.py", line 87, in __call__
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in __call__
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in __call__
(APIServer pid=68)     await self.app(scope, receive, send_wrapper)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
(APIServer pid=68)     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=68)     await app(scope, receive, sender)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 716, in __call__
(APIServer pid=68)     await self.middleware_stack(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 736, in app
(APIServer pid=68)     await route.handle(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 290, in handle
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 119, in app
(APIServer pid=68)     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=68)     await app(scope, receive, sender)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 105, in app
(APIServer pid=68)     response = await f(request)
(APIServer pid=68)                ^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 431, in app
(APIServer pid=68)     raw_response = await run_endpoint_function(
(APIServer pid=68)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 313, in run_endpoint_function
(APIServer pid=68)     return await dependant.call(**values)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/serve/sleep/api_router.py", line 39, in wake_up
(APIServer pid=68)     await engine_client(raw_request).wake_up(tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 755, in wake_up
(APIServer pid=68)     await self.engine_core.wake_up_async(tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 981, in wake_up_async
(APIServer pid=68)     await self.call_utility_async("wake_up", tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 936, in call_utility_async
(APIServer pid=68)     return await self._call_utility_async(method, *args, engine=self.core_engine)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 948, in _call_utility_async
(APIServer pid=68)     await self._send_input_message(message, engine, args)
(APIServer pid=68)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 917, in _send_input_message
(APIServer pid=68)     self.ensure_alive()
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 566, in ensure_alive
(APIServer pid=68)     raise EngineDeadError()
(APIServer pid=68) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=68) INFO:     Shutting down
(APIServer pid=68) INFO:     Waiting for application shutdown.
(APIServer pid=68) INFO:     Application shutdown complete.
(APIServer pid=68) INFO:     Finished server process [68]

Fix Action

Fix / Workaround

========== CPU ========== Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0-159 Vendor ID: GenuineIntel Model name: Intel Xeon Processor (SapphireRapids) CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 40 Socket(s): 2 Stepping: 4 BogoMIPS: 4200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 5 MiB (160 instances) L1i cache: 5 MiB (160 instances) L2 cache: 320 MiB (80 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-79 NUMA node1 CPU(s): 80-159 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Unknown: No mitigations Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

PR fix notes

PR #37065: [Bugfix][sleepmode] Serialize GPU VMM operations to fix sleep/wakeup race condition

Description (problem / solution / changelog)

Purpose

Fixes #36753

This PR addresses a critical race condition in vLLM V1's sleep and wake_up transitions when multiple engine instances share the same GPU.

Problem

In vLLM V1, independent API server processes manage their own physical memory mappings via CUDA VMM. When multiple models on the same GPU attempt concurrent VMM operations, specifically when Model A is sleeping (releasing memory) while Model B is waking up (acquiring memory), the CUDA driver's internal state becomes inconsistent.

This leads to two failure modes:

Silent Livelock: The process hangs indefinitely in kernel space. EngineDeadError: The process crashes with a hard error during memory re-mapping (as reported by users).

Solution

Implemented a Node-Local Atomic Signaling mechanism in Executor: Atomic Serialization: Uses os.O_EXCL to ensure only one process per GPU can perform VMM operations (sleep/wake_up) at any given time. Self-Healing: Automatically detects and cleans up stale signal files if a process crashes (through os.kill(pid, 0)), preventing deadlocks. Cluster Safety: Incorporates hostname in the signal path to prevent false contention on shared filesystems (NFS/Ceph).

Test Plan

Environment GPU: A100 80GB (or H100)

Config: lmcache and Two V1 instances (Model A & Model B) with gpu_memory_utilization=0.8. see user original issue script

Reproduce Steps Concurrent Wakeup: Trigger wake_up for both models simultaneously.

Transition Race: Wake up Model A. Simultaneously send POST /sleep to Model A and POST /wake_up to Model B.

Expected Behavior Processes should wait for each other in a queue. Logs should show GPU VMM busy (Owner: ...), waiting... instead of hanging or crashing.

Test Result

Before Fix Concurrent transitions led to EngineDeadError or indefinite hangs:

(APIServer pid=68) ERROR: Exception in ASGI application
Traceback (most recent call last):
  ...
  File ".../vllm/v1/engine/core_client.py", line 566, in ensure_alive
    raise EngineDeadError()
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

After Fix Transitions are now serialized and stable:


----------------------------------------------------------------
ALL READY! Initial sequence complete. Both models ready in Sleep Mode.
----------------------------------------------------------------
(APIServer pid=374540) INFO 03-14 17:24:18 [api_router.py:39] wake up the engine with tags: None
(EngineCore pid=375193) INFO 03-14 17:24:19 [abstract.py:396] It took 1.082721 seconds to wake up tags {'weights', 'kv_cache'}.
(APIServer pid=374540) INFO:     127.0.0.1:58278 - "POST /wake_up HTTP/1.1" 200 OK
(APIServer pid=376649) INFO 03-14 17:24:26 [api_router.py:39] wake up the engine with tags: None
(EngineCore pid=375193) INFO 03-14 17:24:26 [block_pool.py:472] Successfully reset prefix cache
(EngineCore pid=377264) INFO 03-14 17:24:26 [abstract.py:353] GPU VMM busy (Owner: 375193), waiting...
(EngineCore pid=377264) INFO 03-14 17:24:26 [abstract.py:353] GPU VMM busy (Owner: 375193), waiting...
(EngineCore pid=375193) INFO 03-14 17:24:27 [cumem.py:216] CuMemAllocator: sleep freed 38.11 GiB memory in total, of which 15.05 GiB is backed up in CPU and the rest 23.06 GiB is discarded directly.
(EngineCore pid=375193) INFO 03-14 17:24:27 [gpu_worker.py:175] Sleep mode freed 38.11 GiB memory, 2.92 GiB memory is still in use.
(EngineCore pid=375193) INFO 03-14 17:24:27 [abstract.py:375] It took 0.687472 seconds to fall asleep.
(APIServer pid=374540) INFO:     127.0.0.1:50660 - "POST /sleep?level=1 HTTP/1.1" 200 OK
(EngineCore pid=377264) INFO 03-14 17:24:29 [abstract.py:399] It took 1.693557 seconds to wake up tags {'weights', 'kv_cache'}.
(APIServer pid=376649) INFO:     127.0.0.1:47442 - "POST /wake_up HTTP/1.1" 200 OK

Changed files

  • vllm/config/device.py (modified, +17/-0)
  • vllm/v1/engine/core_client.py (modified, +14/-1)
  • vllm/v1/executor/abstract.py (modified, +132/-16)

Code Example

sh-5.1$ python collect_env.py
Collecting environment information...

========== System ==========
OS: NAME="Red Hat Enterprise Linux"
VERSION="9.7 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.7"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.7 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.7"
Red Hat Enterprise Linux release 9.7 (Plow)
Red Hat Enterprise Linux release 9.7 (Plow)

========== Python ==========
Python: 3.12.12 (main, Jan 19 2026, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-11)]
Platform: linux

========== PyTorch ==========
Torch version: 2.9.0+cu129
CUDA available: True

========== GPU ==========
Driver: 570.148.08
CUDA runtime: 12.9.86
GPU: GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-ef8e343f-d3e5-bad0-bbe6-3a138de8449f)
cuDNN: Unknown

========== CPU ==========
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               160
On-line CPU(s) list:                  0-159
Vendor ID:                            GenuineIntel
Model name:                           Intel Xeon Processor (SapphireRapids)
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   40
Socket(s):                            2
Stepping:                             4
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            5 MiB (160 instances)
L1i cache:                            5 MiB (160 instances)
L2 cache:                             320 MiB (80 instances)
L3 cache:                             32 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-79
NUMA node1 CPU(s):                    80-159
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Unknown: No mitigations
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Vulnerability Vmscape:                Not affected

---

- |
          # 0. Verify lmcache is available
          python3 -c "import lmcache; print('lmcache successfully imported')" || {
            echo "Failed to import lmcache"
            exit 1
          }

          # 1. Start LMCache Server (Background)
          echo "Starting LMCache server..."
          python3 -m lmcache.server 0.0.0.0 8100 cpu &
          sleep 2

          # 2. Start Granite-8B
          CUDA_VISIBLE_DEVICES=0 vllm serve /mnt/models/granite-8b-code-instruct/models--ibm--granite-8b-code-instruct \
            --served-model-name ibm/granite-8b-code-instruct-prashil \
            --port 8080 \
            --enable-sleep-mode \
            --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
            --gpu-memory-utilization 0.8 &

          until curl -s http://localhost:8080/health; do
            echo "Health Check for granite-8b-code-instruct..."
            sleep 5
          done

          echo "Evicting 8080 from GPU..."
          curl -X POST http://localhost:8080/sleep -d '{"level": 1}'

          until [ "$(curl -s http://localhost:8080/is_sleeping | grep -o 'true')" == "true" ]; do
            echo "Waiting for 8080 to finish offloading..."
            sleep 2
          done

          # 2. Start granite-3-2-8b-instruct
          CUDA_VISIBLE_DEVICES=0 vllm serve  /mnt/models/granite-3-2-8b-instruct \
            --served-model-name ibm/granite-3-2-8b-instruct-prashil \
            --port 8081 \
            --enable-sleep-mode \
            --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
            --gpu-memory-utilization 0.8 &

          until curl -s http://localhost:8081/health; do
            echo "Health Check for granite-3-2-8b-instruct..."
            sleep 5
          done

          # Hibernate granite-3-2-8b-instruct (Level 1: Clear CPU RAM / Offload to Disk)
          curl -X POST http://localhost:8081/sleep -d '{"level": 1}'

          echo "Initial sequence complete. Both models ready in Sleep Mode."
          wait

---

(APIServer pid=68) INFO:     127.0.0.1:44768 - "GET /is_sleeping HTTP/1.1" 200 OK
(APIServer pid=972) INFO:     127.0.0.1:38796 - "GET /is_sleeping HTTP/1.1" 200 OK
(APIServer pid=68) INFO 03-11 07:18:31 [entrypoints/.../sleep/api_router.py:38] wake up the engine with tags: None
(APIServer pid=68) INFO:     127.0.0.1:33724 - "POST /wake_up HTTP/1.1" 500 Internal Server Error
(APIServer pid=68) ERROR:    Exception in ASGI application
(APIServer pid=68) Traceback (most recent call last):
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=68)     result = await app(  # type: ignore[func-returns-value]
(APIServer pid=68)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
(APIServer pid=68)     return await self.app(scope, receive, send)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/applications.py", line 1160, in __call__
(APIServer pid=68)     await super().__call__(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/applications.py", line 107, in __call__
(APIServer pid=68)     await self.middleware_stack(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
(APIServer pid=68)     await self.app(scope, receive, _send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/cors.py", line 87, in __call__
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in __call__
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in __call__
(APIServer pid=68)     await self.app(scope, receive, send_wrapper)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
(APIServer pid=68)     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=68)     await app(scope, receive, sender)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 716, in __call__
(APIServer pid=68)     await self.middleware_stack(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 736, in app
(APIServer pid=68)     await route.handle(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 290, in handle
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 119, in app
(APIServer pid=68)     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=68)     await app(scope, receive, sender)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 105, in app
(APIServer pid=68)     response = await f(request)
(APIServer pid=68)                ^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 431, in app
(APIServer pid=68)     raw_response = await run_endpoint_function(
(APIServer pid=68)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 313, in run_endpoint_function
(APIServer pid=68)     return await dependant.call(**values)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/serve/sleep/api_router.py", line 39, in wake_up
(APIServer pid=68)     await engine_client(raw_request).wake_up(tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 755, in wake_up
(APIServer pid=68)     await self.engine_core.wake_up_async(tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 981, in wake_up_async
(APIServer pid=68)     await self.call_utility_async("wake_up", tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 936, in call_utility_async
(APIServer pid=68)     return await self._call_utility_async(method, *args, engine=self.core_engine)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 948, in _call_utility_async
(APIServer pid=68)     await self._send_input_message(message, engine, args)
(APIServer pid=68)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 917, in _send_input_message
(APIServer pid=68)     self.ensure_alive()
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 566, in ensure_alive
(APIServer pid=68)     raise EngineDeadError()
(APIServer pid=68) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=68) INFO:     Shutting down
(APIServer pid=68) INFO:     Waiting for application shutdown.
(APIServer pid=68) INFO:     Application shutdown complete.
(APIServer pid=68) INFO:     Finished server process [68]
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
sh-5.1$ python collect_env.py
Collecting environment information...

========== System ==========
OS: NAME="Red Hat Enterprise Linux"
VERSION="9.7 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.7"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.7 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.7"
Red Hat Enterprise Linux release 9.7 (Plow)
Red Hat Enterprise Linux release 9.7 (Plow)

========== Python ==========
Python: 3.12.12 (main, Jan 19 2026, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-11)]
Platform: linux

========== PyTorch ==========
Torch version: 2.9.0+cu129
CUDA available: True

========== GPU ==========
Driver: 570.148.08
CUDA runtime: 12.9.86
GPU: GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-ef8e343f-d3e5-bad0-bbe6-3a138de8449f)
cuDNN: Unknown

========== CPU ==========
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               160
On-line CPU(s) list:                  0-159
Vendor ID:                            GenuineIntel
Model name:                           Intel Xeon Processor (SapphireRapids)
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   40
Socket(s):                            2
Stepping:                             4
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            5 MiB (160 instances)
L1i cache:                            5 MiB (160 instances)
L2 cache:                             320 MiB (80 instances)
L3 cache:                             32 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-79
NUMA node1 CPU(s):                    80-159
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Unknown: No mitigations
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Vulnerability Vmscape:                Not affected
</details>

🐛 Describe the bug

So, I am working on putting multiple models on 1 H100 HBM GPU and leverage the vLLM Sleep/Awake feature to serve models. Since my models in question are only being used for batch flow use cases, I do not care of latency or TTFT since I have an upper window on when I can serve the request.

My models were served like this:

        - |
          # 0. Verify lmcache is available
          python3 -c "import lmcache; print('lmcache successfully imported')" || {
            echo "Failed to import lmcache"
            exit 1
          }

          # 1. Start LMCache Server (Background)
          echo "Starting LMCache server..."
          python3 -m lmcache.server 0.0.0.0 8100 cpu &
          sleep 2

          # 2. Start Granite-8B
          CUDA_VISIBLE_DEVICES=0 vllm serve /mnt/models/granite-8b-code-instruct/models--ibm--granite-8b-code-instruct \
            --served-model-name ibm/granite-8b-code-instruct-prashil \
            --port 8080 \
            --enable-sleep-mode \
            --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
            --gpu-memory-utilization 0.8 &

          until curl -s http://localhost:8080/health; do
            echo "Health Check for granite-8b-code-instruct..."
            sleep 5
          done

          echo "Evicting 8080 from GPU..."
          curl -X POST http://localhost:8080/sleep -d '{"level": 1}'

          until [ "$(curl -s http://localhost:8080/is_sleeping | grep -o 'true')" == "true" ]; do
            echo "Waiting for 8080 to finish offloading..."
            sleep 2
          done

          # 2. Start granite-3-2-8b-instruct
          CUDA_VISIBLE_DEVICES=0 vllm serve  /mnt/models/granite-3-2-8b-instruct \
            --served-model-name ibm/granite-3-2-8b-instruct-prashil \
            --port 8081 \
            --enable-sleep-mode \
            --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
            --gpu-memory-utilization 0.8 &

          until curl -s http://localhost:8081/health; do
            echo "Health Check for granite-3-2-8b-instruct..."
            sleep 5
          done

          # Hibernate granite-3-2-8b-instruct (Level 1: Clear CPU RAM / Offload to Disk)
          curl -X POST http://localhost:8081/sleep -d '{"level": 1}'

          echo "Initial sequence complete. Both models ready in Sleep Mode."
          wait

Upon a simple /chat/completions CALL which worked before but failed on the second attempt, gave:

(APIServer pid=68) INFO:     127.0.0.1:44768 - "GET /is_sleeping HTTP/1.1" 200 OK
(APIServer pid=972) INFO:     127.0.0.1:38796 - "GET /is_sleeping HTTP/1.1" 200 OK
(APIServer pid=68) INFO 03-11 07:18:31 [entrypoints/.../sleep/api_router.py:38] wake up the engine with tags: None
(APIServer pid=68) INFO:     127.0.0.1:33724 - "POST /wake_up HTTP/1.1" 500 Internal Server Error
(APIServer pid=68) ERROR:    Exception in ASGI application
(APIServer pid=68) Traceback (most recent call last):
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=68)     result = await app(  # type: ignore[func-returns-value]
(APIServer pid=68)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
(APIServer pid=68)     return await self.app(scope, receive, send)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/applications.py", line 1160, in __call__
(APIServer pid=68)     await super().__call__(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/applications.py", line 107, in __call__
(APIServer pid=68)     await self.middleware_stack(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 186, in __call__
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/errors.py", line 164, in __call__
(APIServer pid=68)     await self.app(scope, receive, _send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/cors.py", line 87, in __call__
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in __call__
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in __call__
(APIServer pid=68)     await self.app(scope, receive, send_wrapper)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in __call__
(APIServer pid=68)     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=68)     await app(scope, receive, sender)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 716, in __call__
(APIServer pid=68)     await self.middleware_stack(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 736, in app
(APIServer pid=68)     await route.handle(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 290, in handle
(APIServer pid=68)     await self.app(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 119, in app
(APIServer pid=68)     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=68)     raise exc
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=68)     await app(scope, receive, sender)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 105, in app
(APIServer pid=68)     response = await f(request)
(APIServer pid=68)                ^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 431, in app
(APIServer pid=68)     raw_response = await run_endpoint_function(
(APIServer pid=68)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 313, in run_endpoint_function
(APIServer pid=68)     return await dependant.call(**values)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/serve/sleep/api_router.py", line 39, in wake_up
(APIServer pid=68)     await engine_client(raw_request).wake_up(tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 755, in wake_up
(APIServer pid=68)     await self.engine_core.wake_up_async(tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 981, in wake_up_async
(APIServer pid=68)     await self.call_utility_async("wake_up", tags)
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 936, in call_utility_async
(APIServer pid=68)     return await self._call_utility_async(method, *args, engine=self.core_engine)
(APIServer pid=68)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 948, in _call_utility_async
(APIServer pid=68)     await self._send_input_message(message, engine, args)
(APIServer pid=68)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 917, in _send_input_message
(APIServer pid=68)     self.ensure_alive()
(APIServer pid=68)   File "/opt/vllm/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 566, in ensure_alive
(APIServer pid=68)     raise EngineDeadError()
(APIServer pid=68) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=68) INFO:     Shutting down
(APIServer pid=68) INFO:     Waiting for application shutdown.
(APIServer pid=68) INFO:     Application shutdown complete.
(APIServer pid=68) INFO:     Finished server process [68]

This happened when model which was awake, was asked to sleep and wake up the asleep model. The model was instructed to awake failed and crashed the process with above exception. Can someone help me get a clearer picture whether this is a bug or my configuration issue?

TIA

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the EngineDeadError issue, we need to ensure that the engine is properly handled when a model is put to sleep and another model is woken up. Here are the steps to resolve the issue:

  • Increase the GPU memory utilization: Try increasing the --gpu-memory-utilization flag to a higher value (e.g., 0.9) to allocate more memory to the engine.
  • Implement retry logic: Wrap the wake_up call in a retry loop to handle temporary engine failures.
  • Check engine status: Before waking up a model, check the engine status to ensure it's alive and ready.

Example code snippet:

import time

# ...

while True:
    try:
        await engine_client(raw_request).wake_up(tags)
        break
    except EngineDeadError:
        print("Engine dead, retrying in 2 seconds...")
        time.sleep(2)

Alternatively, you can use a decorator to implement retry logic:

import functools
import time

def retry_on_engine_dead(func):
    @functools.wraps(func)
    async def wrapper(*args, **kwargs):
        for attempt in range(3):
            try:
                return await func(*args, **kwargs)
            except EngineDeadError:
                print(f"Engine dead, retrying in 2 seconds... (attempt {attempt+1}/3)")
                time.sleep(2)
        raise EngineDeadError("Engine failed after 3 retries")
    return wrapper

@retry_on_engine_dead
async def wake_up_engine(engine_client, tags):
    await engine_client.wake_up(tags)

Verification

To verify that the fix worked, run the same sequence of commands that previously caused the error. The engine should now wake up successfully, and the model should respond to requests without crashing.

Extra Tips

  • Monitor engine memory usage and adjust the --gpu-memory-utilization flag accordingly to prevent engine failures.
  • Consider implementing a more robust retry mechanism with exponential backoff to handle temporary engine failures.
  • If issues persist, try updating to the latest version of the VLLM library or seeking support from the VLLM community.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING