openclaw - 💡(How to fix) Fix [Bug]: gateway heap grows unbounded over time, gets killed by cgroup OOM on long-running Linux systemd --user deployments

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB (heapUsedBytes ≈ 1.5 GB, rssBytes ≈ 1.64 GB) and was killed by systemd-oomd because its cgroup's memory.pressure crossed the kernel's kill threshold:

Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s
Current Memory Usage: 234.8M, Pgscan: 5184818

The gateway exposed no built-in heap bound, so with the kernel-default cgroup limit it was effectively unbounded.

Root Cause

After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB (heapUsedBytes ≈ 1.5 GB, rssBytes ≈ 1.64 GB) and was killed by systemd-oomd because its cgroup's memory.pressure crossed the kernel's kill threshold:

Fix Action

Fix / Workaround

  1. The gateway's own [diagnostics/memory] emitted warnings at the 1 GB and 1.6 GB thresholds, but it has no auto-mitigation path — no graceful restart, no heap cap, no session shedding, no alert escalation. The process kept growing until the kernel cgroup OOM killer stepped in.
  2. The user-level systemd manager (user@<UID>.service) survived the gateway kill (Linger=yes), but its dbus effectively became unresponsive. journalctl --user -u openclaw-gateway has no entries for the same window because the user-level journal ring buffer was on the same memory-starved path.
  3. With kernel-default cgroup limits and no operator-visible recommended MemoryHigh / MemoryMax, any long-running Linux systemd --user deployment with non-trivial traffic will eventually be killed by the kernel.
  4. Adjacent regression that hid this: v2026.5.28 downgraded [model-fetch] request/response logs from log.info to log.debug (tracked separately as #89300). At the default log level these no longer appear in gateway.log, which made it much harder to identify the request pattern that correlated with the heap growth in this incident.
  • No auto-mitigation.
  • The gateway's Node process heap grew to ~1.6 GB.
  • The kernel cgroup OOM killer terminated the process without a graceful shutdown (in-flight sessions lost, all user-level channel bridges offline).
  • The user-level systemd manager was effectively dead — it required a fresh interactive ssh/login for the kernel to start a new user@<UID>.service and reattach the Linger session, which then brought the gateway back.

Workaround (until fixed)

Code Example

Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s
Current Memory Usage: 234.8M, Pgscan: 5184818

---

[Service]
MemoryHigh=1G
MemoryMax=2G
RAW_BUFFERClick to expand / collapse

[Bug]: gateway heap grows unbounded over time, gets killed by cgroup OOM on long-running Linux systemd --user deployments

Environment

  • OpenClaw version: v2026.5.28 (upgraded from v2026.5.27 on 2026-05-31)
  • Host: Linux x86_64, kernel 6.8.0-117-generic, Ubuntu 24.04
  • Node: v22.22.2
  • Deployment: openclaw-gateway as a user-level systemd service (~/.config/systemd/user/openclaw-gateway.service) with Linger=yes; no MemoryMax / MemoryHigh set on the cgroup
  • Uptime when symptom appeared: ~5 days 16 hours since the v2026.5.28 upgrade; gateway had not been manually restarted in that window.

Summary

After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB (heapUsedBytes ≈ 1.5 GB, rssBytes ≈ 1.64 GB) and was killed by systemd-oomd because its cgroup's memory.pressure crossed the kernel's kill threshold:

Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s
Current Memory Usage: 234.8M, Pgscan: 5184818

The gateway exposed no built-in heap bound, so with the kernel-default cgroup limit it was effectively unbounded.

Observed timeline (local journal, UTC+8)

TimeEvent
03:00:00memory-core: managed dreaming cron could not be reconciled (cron service unavailable)
03:01:23[diagnostics/memory] memory pressure: level=warning reason=heap_threshold rssBytes=1403424768 heapUsedBytes=1341035200 thresholdBytes=1073741824
03:02:24[diagnostics/memory] memory pressure: level=warning reason=rss_threshold rssBytes=1642409984 heapUsedBytes=1546413600 thresholdBytes=1610612736
05:46:14systemd-journald[431]: Under memory pressure, flushing caches.
06:29:42systemd-oomd[918]: Considered 16 cgroups for killing, top candidates were: /user.slice/user-<UID>.slice/user@<UID>.service/app.slice/openclaw-gateway.service, Memory Pressure Limit: 0.00%, Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s, Current Memory Usage: 234.8M, Pgscan: 5184818
06:29:42 + secondsnode process exited; no further log lines from that PID.

(UID is shown as <UID> in this report; on the affected host it corresponds to the regular non-root user that owns the user-level systemd service.)

Observations

  1. The gateway's own [diagnostics/memory] emitted warnings at the 1 GB and 1.6 GB thresholds, but it has no auto-mitigation path — no graceful restart, no heap cap, no session shedding, no alert escalation. The process kept growing until the kernel cgroup OOM killer stepped in.
  2. The user-level systemd manager (user@<UID>.service) survived the gateway kill (Linger=yes), but its dbus effectively became unresponsive. journalctl --user -u openclaw-gateway has no entries for the same window because the user-level journal ring buffer was on the same memory-starved path.
  3. With kernel-default cgroup limits and no operator-visible recommended MemoryHigh / MemoryMax, any long-running Linux systemd --user deployment with non-trivial traffic will eventually be killed by the kernel.
  4. Adjacent regression that hid this: v2026.5.28 downgraded [model-fetch] request/response logs from log.info to log.debug (tracked separately as #89300). At the default log level these no longer appear in gateway.log, which made it much harder to identify the request pattern that correlated with the heap growth in this incident.

Reproduction

  1. Install v2026.5.28 on a Linux box under systemd --user with Linger=yes.
  2. Drive a moderate steady workload for several days: a couple of channel bridges, periodic cron / heartbeat jobs, a few agents, some dreaming cycles.
  3. Watch the node process RSS climb past 1 GB and never reclaim even after traffic stops. The only signal is the internal [diagnostics/memory] warning.
  4. When system memory pressure rises (any other workload, even briefly), systemd-oomd selects the gateway cgroup first because Pressure.Avg10 is well above 50% and the gateway itself is the dominant Pss on the box, then SIGKILLs it.

Expected

One of the following:

  • The gateway self-bounds its heap (e.g. graceful restart at a configured threshold, cap heap size, shed idle sessions / dreaming transcripts, or refuse new work and alert).
  • The gateway ships a recommended MemoryHigh / MemoryMax value (and documents it in the systemd unit template) so operators can set a sane bound before the kernel kills the process.
  • [diagnostics/memory] warnings escalate (e.g. emit a CRITICAL level, trigger a graceful shutdown, or push to a health endpoint) instead of just logging a warning that nothing acts on.

Actual

  • No auto-mitigation.
  • The gateway's Node process heap grew to ~1.6 GB.
  • The kernel cgroup OOM killer terminated the process without a graceful shutdown (in-flight sessions lost, all user-level channel bridges offline).
  • The user-level systemd manager was effectively dead — it required a fresh interactive ssh/login for the kernel to start a new user@<UID>.service and reattach the Linger session, which then brought the gateway back.

Workaround (until fixed)

Operators can put a cgroup bound on the service manually. In ~/.config/systemd/user/openclaw-gateway.service:

[Service]
MemoryHigh=1G
MemoryMax=2G

After systemctl --user daemon-reload, the gateway will be SIGKILL'd by the cgroup at 2 GB and Restart=always will bring it back, but this still loses in-flight sessions and is a band-aid — the underlying heap growth has no upper bound.

A weekly systemctl --user restart openclaw-gateway (e.g. via OnCalendar=weekly in a user timer) also keeps the heap small enough to avoid systemd-oomd being triggered by transient system memory pressure.


Thanks for the work on [diagnostics/memory] — the warning at 1 GB was what pointed us at this. The missing piece is the auto-mitigation (or at least a published MemoryHigh default) and the fact that the user-level systemd manager can't recover on its own after a cgroup OOM kill.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING