openclaw - 💡(How to fix) Fix Event loop saturation during startup: synchronous model-prewarm and session-locks block event loop for 28-64 seconds

OpenClaw 2026.5.19 suffers from severe event loop saturation during startup. Two synchronous startup sidecars — model-prewarm and session-locks — block the Node.js event loop for seconds at a time, producing max event loop delays of 28–64 seconds, utilization of 93–96%, and heap pressure exceeding 1GB. This saturation cascades into multiple user-visible failures:

Discord WS READY never fires — heartbeat ACKs can't be sent in time, Discord closes the connection (code 1000). Bot appears online but can't receive guild messages. (see #79794)
Typing indicator delays — even with typingMode: "instant", typing doesn't fire until the event loop has capacity
Memory pressure — heap hits 1.1–1.3GB (threshold 1GB) during the startup burst
Gateway restart cascade — systemd RestartSec=5 causes rapid restart attempts, each re-triggering the same saturation

Root Cause

Discord WS READY never fires — heartbeat ACKs can't be sent in time, Discord closes the connection (code 1000). Bot appears online but can't receive guild messages. (see #79794)
Typing indicator delays — even with typingMode: "instant", typing doesn't fire until the event loop has capacity
Memory pressure — heap hits 1.1–1.3GB (threshold 1GB) during the startup burst
Gateway restart cascade — systemd RestartSec=5 causes rapid restart attempts, each re-triggering the same saturation

Code Example

19:02:56  systemd starts openclaw-gateway.service
19:03:07  http server listening (23 plugins; 9.4s)
19:03:09  [discord] starting provider
19:03:10  gateway ready
19:03:13  [discord] bot probe resolved (REST — works fine)
19:03:14  [discord] channels resolved (6 channels, REST — works fine)
19:03:19  [discord] client initialized; awaiting gateway readiness
          ↕ startup sidecars running: model-prewarm (4.2s) + session-locks (1.6s)
          ↕ agent work queued
          ↕ event loop blocked — max delay 64.5 SECONDS
19:04:48  [diagnostic] liveness warning (p99=1389ms, max=64525ms, util=0.927)
19:04:48  [discord] Gateway websocket closed: 1000
          ↕ ~2 minutes of silence — no log output
~19:05    Discord WS silently auto-reconnects. Bot starts responding.

Summary

Discord WS READY never fires — heartbeat ACKs can't be sent in time, Discord closes the connection (code 1000). Bot appears online but can't receive guild messages. (see #79794)
Typing indicator delays — even with typingMode: "instant", typing doesn't fire until the event loop has capacity
Memory pressure — heap hits 1.1–1.3GB (threshold 1GB) during the startup burst
Gateway restart cascade — systemd RestartSec=5 causes rapid restart attempts, each re-triggering the same saturation

Environment

OpenClaw: 2026.5.19 (a185ca2)
Node.js: v24.15.0
OS: Ubuntu 24.04.4 LTS, x86_64, systemd user service
Plugins: 23
Agents: 7
Session stores: 168 sessions totaling ~3.7MB JSON, parsed synchronously on every startup

Evidence

Liveness warnings across 17 startups in 45 minutes

Every successful startup produced a liveness warning within 30–90 seconds:

Startup	Time	p99 delay	Max delay	EL util	Prewarm	Session locks	Heap
#3	18:22	1,430ms	44,426ms	95.9%	2,662ms	1,206ms	—
#5	18:27	1,985ms	3,496ms	95.3%	3,135ms	1,241ms	1,303MB
#7	18:39	2,321ms	28,739ms	94.1%	1,876ms	830ms	1,110MB
#8	18:43	1,983ms	24,562ms	92.9%	1,820ms	795ms	—
#17	19:03	1,389ms	64,525ms	92.7%	4,199ms	1,624ms	—

Full startup timeline (representative: startup #17)

19:02:56  systemd starts openclaw-gateway.service
19:03:07  http server listening (23 plugins; 9.4s)
19:03:09  [discord] starting provider
19:03:10  gateway ready
19:03:13  [discord] bot probe resolved (REST — works fine)
19:03:14  [discord] channels resolved (6 channels, REST — works fine)
19:03:19  [discord] client initialized; awaiting gateway readiness
          ↕ startup sidecars running: model-prewarm (4.2s) + session-locks (1.6s)
          ↕ agent work queued
          ↕ event loop blocked — max delay 64.5 SECONDS
19:04:48  [diagnostic] liveness warning (p99=1389ms, max=64525ms, util=0.927)
19:04:48  [discord] Gateway websocket closed: 1000
          ↕ ~2 minutes of silence — no log output
~19:05    Discord WS silently auto-reconnects. Bot starts responding.

Breakdown of saturation sources

1. model-prewarm (1.8–4.2s per startup) Synchronously loads model weights immediately after gateway ready. Blocks the event loop entirely. Runs on every restart — with 17 restarts in 45 min, that's 17 prewarm cycles.

2. session-locks (0.8–1.6s per startup) Parses JSON session stores for all agents on every startup. With 168 sessions totaling ~3.7MB of JSON, this is a significant synchronous parse. Scales with session count — will get worse over time.

3. Agent work starts immediately Queued agent work begins processing before startup sidecars finish, competing for the already-saturated event loop.

4. Restart cascade amplifies the problem systemd RestartSec=5 + StartLimitBurst=5 means 5 rapid restarts before systemd gives up. Each restart re-runs prewarm + session-locks.

Discord READY failure mechanism

The Discord gateway handshake (HELLO → IDENTIFY → READY) requires multiple event loop ticks for WS frame parsing, heartbeat ACK responses, and IDENTIFY payload send. Discord expects heartbeat ACK within heartbeat_interval (typically 41.25s). With max event loop delays of 28–64s, ACKs are missed, and Discord closes the connection with code 1000.

The bot self-heals once the event loop calms down (~2 min post-startup) — Discord.js auto-reconnects silently. But this reconnect is not logged, making it invisible to monitoring.

Memory pressure

Heap exceeded the 1GB threshold on 2 of 5 liveness-warned startups:

Startup #5: heapUsedBytes=1,303MB (RSS=1,437MB)
Startup #7: heapUsedBytes=1,110MB

Suggested fixes

Defer model-prewarm — prewarm is only useful when the first user message arrives. Deferring to after provider READY events (or lazy-loading on first model call) would eliminate the largest single-source block (1.8–4.2s).
Async session-locks parsing — use streaming JSON parse, JSON.parse in a worker thread, or lazy-load sessions on first access instead of parsing all session data synchronously at startup.
Defer agent work until provider READY — don't start processing queued agent messages until provider websocket handshakes complete.
Log WS reconnect events — the silent auto-reconnect after ~2 minutes produces zero log output. Add logging for WS reconnect so operators can distinguish permanent failure from self-healing.
Session store hygiene — an expiry or compaction mechanism would reduce the parse cost over time as sessions accumulate.

Workaround

The bot self-heals after 1–3 minutes once the startup sidecars complete. Increasing systemd RestartSec (e.g., to 30s) reduces the restart cascade. No way to disable model-prewarm or defer session-locks parsing from user config.

Related issues

#79794 — Discord gateway READY never fires (multiple reporters, confirmed regression in 2026.5.x)
#78910 — Discord WS 1006 rapid disconnect loop
#81172 — memory-core blocks event loop

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Event loop saturation during startup: synchronous model-prewarm and session-locks block event loop for 28-64 seconds

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Evidence

Liveness warnings across 17 startups in 45 minutes

Full startup timeline (representative: startup #17)

Breakdown of saturation sources

Discord READY failure mechanism

Memory pressure

Suggested fixes

Workaround

Related issues

Still need to ship something?

TRENDING