openclaw - 💡(How to fix) Fix Gateway event loop starvation and HTTP/WS outage during sessions usage/cost under memory pressure

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

We observed a recurrent OpenClaw Gateway outage where the dashboard became unreachable and the local HTTP server stopped responding while the Gateway process was still running.

This looks related to heavy sessions.usage / sessions.cost processing combined with high Gateway memory usage. The likely failure mode is event loop starvation plus aggressive GC pauses under memory pressure.

Root Cause

We installed a temporary preventive daily systemd user timer to restart openclaw-gateway.service at 08:00 UTC / 04:00 Martinique, because the issue appeared roughly daily with Gateway uptime and memory growth.

Fix Action

Fix / Workaround

Mitigation applied locally

This is only a workaround. The actual fix likely needs to be in OpenClaw runtime behavior.

This caused dashboard unavailability and temporary degradation of connected integrations. Manual restart recovered the instance, but recurrence is likely without mitigation when memory grows and usage/cost computation is triggered.

RAW_BUFFERClick to expand / collapse

Summary

We observed a recurrent OpenClaw Gateway outage where the dashboard became unreachable and the local HTTP server stopped responding while the Gateway process was still running.

This looks related to heavy sessions.usage / sessions.cost processing combined with high Gateway memory usage. The likely failure mode is event loop starvation plus aggressive GC pauses under memory pressure.

Environment

  • OpenClaw: 2026.5.22
  • Install type: git install on VPS
  • OS: Debian 13
  • Gateway: systemd user service
  • Dashboard/Gateway URL: http://127.0.0.1:18789
  • Approximate data size at incident time: 184 sessions x 9 stores

Incident 1

  • Date: 2026-05-25 00:55 UTC
  • Dashboard WebSocket inaccessible for about 12 minutes
  • HTTP/Gateway still responded
  • Event loop metrics around the incident showed very high blocking:
    • max event loop delay: about 17985 ms
    • p99: about 2561 ms
    • utilization: about 0.815
  • sessions.usage calls were observed taking up to about 298s
  • The incident resolved spontaneously after the long computation finished

Incident 2

  • Date: 2026-05-26 01:09 UTC to about 01:14 UTC
  • Dashboard showed connection impossible
  • WebSocket connections closed before auth with code 1006
  • Local HTTP probe timed out:
    • curl --max-time 5 http://127.0.0.1:18789/ timed out
  • status --deep did not return while the event loop appeared blocked
  • Gateway process memory was high:
    • RSS about 1.75 GB
    • memory warning threshold around 1.5 GB
  • Related logs included:
    • [diagnostics/memory] memory pressure: level=warning rssBytes=1752862720 thresholdBytes=1610612736
    • repeated [ws] closed before connect code=1006
    • WhatsApp reconnects / status 408

A manual Gateway restart immediately restored service.

Before restart:

  • RSS: about 1.75 GB
  • HTTP: timeout
  • WebSocket/dashboard: unreachable
  • status --deep: blocked / no response

After restart:

  • RSS: about 922 MB
  • HTTP: 200 OK
  • WebSocket probe: reachable, about 142 ms
  • event loop: OK, max about 32 ms, p99 about 21 ms, utilization about 0.058
  • Telegram and WhatsApp recovered

Expected behavior

Heavy usage/cost computation should not block the Gateway's main event loop long enough to make HTTP and WebSocket unavailable.

The dashboard and gateway health endpoints should remain responsive, even if usage/cost computation is slow or has to be degraded, cached, paginated, cancelled, or moved off the main thread.

Actual behavior

During usage/cost related processing and high memory pressure, the Gateway can become effectively unavailable:

  • dashboard WS closes with 1006
  • local HTTP requests time out
  • status --deep can hang
  • messaging integrations start reconnecting or timing out

Mitigation applied locally

We installed a temporary preventive daily systemd user timer to restart openclaw-gateway.service at 08:00 UTC / 04:00 Martinique, because the issue appeared roughly daily with Gateway uptime and memory growth.

This is only a workaround. The actual fix likely needs to be in OpenClaw runtime behavior.

Suggested fixes / guardrails

  • Move sessions.usage / sessions.cost heavy work off the Gateway main event loop, or run it in bounded chunks.
  • Add pagination, caching, or incremental aggregation for usage/cost over many sessions/stores.
  • Add timeout/cancellation/degraded response behavior for expensive dashboard usage views.
  • Keep health/status/HTTP/WS handshakes responsive even while usage/cost is computing.
  • Add diagnostic logs around usage/cost computations: session count, store count, duration, cache hit/miss, and whether the result came from a background worker or main thread.

Impact

This caused dashboard unavailability and temporary degradation of connected integrations. Manual restart recovered the instance, but recurrence is likely without mitigation when memory grows and usage/cost computation is triggered.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Heavy usage/cost computation should not block the Gateway's main event loop long enough to make HTTP and WebSocket unavailable.

The dashboard and gateway health endpoints should remain responsive, even if usage/cost computation is slow or has to be degraded, cached, paginated, cancelled, or moved off the main thread.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING