openclaw - 💡(How to fix) Fix Gateway event loop starvation and HTTP/WS outage during sessions usage/cost under memory pressure

openclaw2026-05-26 01:54:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

We observed a recurrent OpenClaw Gateway outage where the dashboard became unreachable and the local HTTP server stopped responding while the Gateway process was still running.

This looks related to heavy sessions.usage / sessions.cost processing combined with high Gateway memory usage. The likely failure mode is event loop starvation plus aggressive GC pauses under memory pressure.

Root Cause

We installed a temporary preventive daily systemd user timer to restart openclaw-gateway.service at 08:00 UTC / 04:00 Martinique, because the issue appeared roughly daily with Gateway uptime and memory growth.

Fix Action

Fix / Workaround

Mitigation applied locally

This is only a workaround. The actual fix likely needs to be in OpenClaw runtime behavior.

This caused dashboard unavailability and temporary degradation of connected integrations. Manual restart recovered the instance, but recurrence is likely without mitigation when memory grows and usage/cost computation is triggered.

RAW_BUFFERClick to expand / collapse

Summary

We observed a recurrent OpenClaw Gateway outage where the dashboard became unreachable and the local HTTP server stopped responding while the Gateway process was still running.

Environment

OpenClaw: 2026.5.22
Install type: git install on VPS
OS: Debian 13
Gateway: systemd user service
Dashboard/Gateway URL: http://127.0.0.1:18789
Approximate data size at incident time: 184 sessions x 9 stores

Incident 1

Date: 2026-05-25 00:55 UTC
Dashboard WebSocket inaccessible for about 12 minutes
HTTP/Gateway still responded
Event loop metrics around the incident showed very high blocking:
- max event loop delay: about 17985 ms
- p99: about 2561 ms
- utilization: about 0.815
sessions.usage calls were observed taking up to about 298s
The incident resolved spontaneously after the long computation finished

Incident 2

Date: 2026-05-26 01:09 UTC to about 01:14 UTC
Dashboard showed connection impossible
WebSocket connections closed before auth with code 1006
Local HTTP probe timed out:
- curl --max-time 5 http://127.0.0.1:18789/ timed out
status --deep did not return while the event loop appeared blocked
Gateway process memory was high:
- RSS about 1.75 GB
- memory warning threshold around 1.5 GB
Related logs included:
- [diagnostics/memory] memory pressure: level=warning rssBytes=1752862720 thresholdBytes=1610612736
- repeated [ws] closed before connect code=1006
- WhatsApp reconnects / status 408

A manual Gateway restart immediately restored service.

Before restart:

RSS: about 1.75 GB
HTTP: timeout
WebSocket/dashboard: unreachable
status --deep: blocked / no response

After restart:

RSS: about 922 MB
HTTP: 200 OK
WebSocket probe: reachable, about 142 ms
event loop: OK, max about 32 ms, p99 about 21 ms, utilization about 0.058
Telegram and WhatsApp recovered

Expected behavior

Heavy usage/cost computation should not block the Gateway's main event loop long enough to make HTTP and WebSocket unavailable.

The dashboard and gateway health endpoints should remain responsive, even if usage/cost computation is slow or has to be degraded, cached, paginated, cancelled, or moved off the main thread.

Actual behavior

During usage/cost related processing and high memory pressure, the Gateway can become effectively unavailable:

dashboard WS closes with 1006
local HTTP requests time out
status --deep can hang
messaging integrations start reconnecting or timing out

Mitigation applied locally

This is only a workaround. The actual fix likely needs to be in OpenClaw runtime behavior.

Suggested fixes / guardrails

Move sessions.usage / sessions.cost heavy work off the Gateway main event loop, or run it in bounded chunks.
Add pagination, caching, or incremental aggregation for usage/cost over many sessions/stores.
Add timeout/cancellation/degraded response behavior for expensive dashboard usage views.
Keep health/status/HTTP/WS handshakes responsive even while usage/cost is computing.
Add diagnostic logs around usage/cost computations: session count, store count, duration, cache hit/miss, and whether the result came from a background worker or main thread.

Impact

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Heavy usage/cost computation should not block the Gateway's main event loop long enough to make HTTP and WebSocket unavailable.

The dashboard and gateway health endpoints should remain responsive, even if usage/cost computation is slow or has to be degraded, cached, paginated, cancelled, or moved off the main thread.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Gateway event loop starvation and HTTP/WS outage during sessions usage/cost under memory pressure

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Mitigation applied locally

Summary

Environment

Incident 1

Incident 2

Expected behavior

Actual behavior

Mitigation applied locally

Suggested fixes / guardrails

Impact

FAQ

Expected behavior

Still need to ship something?

TRENDING