Fix Action

Fix / Workaround

The gateway's own [diagnostics/memory] emitted warnings at the 1 GB and 1.6 GB thresholds, but it has no auto-mitigation path — no graceful restart, no heap cap, no session shedding, no alert escalation. The process kept growing until the kernel cgroup OOM killer stepped in.
The user-level systemd manager (user@<UID>.service) survived the gateway kill (Linger=yes), but its dbus effectively became unresponsive. journalctl --user -u openclaw-gateway has no entries for the same window because the user-level journal ring buffer was on the same memory-starved path.
With kernel-default cgroup limits and no operator-visible recommended MemoryHigh / MemoryMax, any long-running Linux systemd --user deployment with non-trivial traffic will eventually be killed by the kernel.
Adjacent regression that hid this: v2026.5.28 downgraded [model-fetch] request/response logs from log.info to log.debug (tracked separately as #89300). At the default log level these no longer appear in gateway.log, which made it much harder to identify the request pattern that correlated with the heap growth in this incident.

No auto-mitigation.
The gateway's Node process heap grew to ~1.6 GB.
The kernel cgroup OOM killer terminated the process without a graceful shutdown (in-flight sessions lost, all user-level channel bridges offline).
The user-level systemd manager was effectively dead — it required a fresh interactive ssh/login for the kernel to start a new user@<UID>.service and reattach the Linger session, which then brought the gateway back.

Workaround (until fixed)

[Bug]: gateway heap grows unbounded over time, gets killed by cgroup OOM on long-running Linux systemd --user deployments

StepCodex · 2026-06-02T03:49:44Z

[openclaw] After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB heapUsedBytes ≈ 1.5 GB , rssBytes ≈ 1.64 GB and was killed by systemd… After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB (`heapUsedBytes ≈ 1.5 GB`, `rssBytes ≈ 1.64 GB`) and was killed by `systemd-oomd` because its cgroup's `memory.pressure` crossed the kernel's kill threshold: ``` Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s Current Memory Usage: 234.8M, Pgscan: 5184818 ``` The gateway exposed no built-in heap bound, so with the kernel-default cgroup limit it was effectively unbounded. ## Fix / Workaround 1. **The gateway's own `[diagnostics/memory]` emitted warnings** at the 1 GB and 1.6 GB thresholds, but it has no auto-mitigation path — no graceful restart, no heap cap, no session shedding, no alert escalation. The process kept growing until the kernel cgroup OOM killer stepped in. 2. The user-level systemd manager (`user@ .service`) survived the gateway kill (Linger=yes), but its dbus effectively became unresponsive. `journalctl --user -u openclaw-gateway` has no entries for the same window because the user-level journal ring buffer was on the same memory-starved path. 3. With kernel-default cgroup limits and no operator-visible recommended `MemoryHigh` / `MemoryMax`, any long-running Linux systemd --user deployment with non-trivial traffic will eventually be killed by the kernel. 4. **Adjacent regression that hid this:** v2026.5.28 downgraded `[model-fetch]` request/response logs from `log.info` to `log.debug` (tracked separately as #89300). At the default log level these no longer appear in `gateway.log`, which made it much harder to identify the request pattern that correlated with the heap growth in this incident. - No auto-mitigation. - The gateway's Node process heap grew to ~1.6 GB. - The kernel cgroup OOM killer terminated the process without a graceful shutdown (in-flight sessions lost, all user-level channel bridges offline). - The user-level systemd manager was effectively dead — it required a fresh interactive ssh/login for the kernel to start a new `user@ .service` and reattach the Linger session, which then brought the gateway back. ## Workaround (until fixed) # [Bug]: gateway heap grows unbounded over time, gets killed by cgroup OOM on long-running Linux systemd --user deployments ## Environment - **OpenClaw version:** v2026.5.28 (upgraded from v2026.5.27 on 2026-05-31) - **Host:** Linux x86_64, kernel 6.8.0-117-generic, Ubuntu 24.04 - **Node:** v22.22.2 - **Deployment:** openclaw-gateway as a user-level systemd service (`~/.config/systemd/user/openclaw-gateway.service`) with `Linger=yes`; no `MemoryMax` / `MemoryHigh` set on the cgroup - **Uptime when symptom appeared:** ~5 days 16 hours since the v2026.5.28 upgrade; gateway had not been manually restarted in that window. ## Summary After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB (`heapUsedBytes ≈ 1.5 GB`, `rssBytes ≈ 1.64 GB`) and was killed by `systemd-oomd` because its cgroup's `memory.pressure` crossed the kernel's kill threshold: ``` Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s Current Memory Usage: 234.8M, Pgscan: 5184818 ``` The gateway exposed no built-in heap bound, so with the kernel-default cgroup limit it was effectively unbounded. ## Observed timeline (local journal, UTC+8) | Time | Event | |------|-------| | 03:00:00 | `memory-core: managed dreaming cron could not be reconciled (cron service unavailable)` | | 03:01:23 | `[diagnostics/memory] memory pressure: level=warning reason=heap_threshold rssBytes=1403424768 heapUsedBytes=1341035200 thresholdBytes=1073741824` | | 03:02:24 | `[diagnostics/memory] memory pressure: level=warning reason=rss_threshold rssBytes=1642409984 heapUsedBytes=1546413600 thresholdBytes=1610612736` | | 05:46:14 | `systemd-journald[431]: Under memory pressure, flushing caches.` | | 06:29:42 | `systemd-oomd[918]: Considered 16 cgroups for killing, top candidates were: /user.slice/user- .slice/user@ .service/app.slice/openclaw-gateway.service, Memory Pressure Limit: 0.00%, Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s, Current Memory Usage: 234.8M, Pgscan: 5184818` | | 06:29:42 + seconds | `node` process exited; no further log lines from that PID. | (UID is shown as ` ` in this report; on the affected host it corresponds to the regular non-root user that owns the user-level systemd service.) ## Observations 1. **The gateway's own `[diagnostics/memory]` emitted warnings** at the 1 GB and 1.6 GB thresholds, but it has no auto-mitigation path — no graceful restart, no heap cap, no session shedding, no alert escalation. The process kept growing until the kernel cgroup OOM killer stepped in. 2. The user-level systemd manager (`user@ .service`) survived the gateway kill (Linger=yes), but its dbus effectively became unresponsive. `journalctl --user -u openclaw-gateway` has no entries

Environment

OpenClaw version: v2026.5.28 (upgraded from v2026.5.27 on 2026-05-31)
Host: Linux x86_64, kernel 6.8.0-117-generic, Ubuntu 24.04
Node: v22.22.2
Deployment: openclaw-gateway as a user-level systemd service (~/.config/systemd/user/openclaw-gateway.service) with Linger=yes; no MemoryMax / MemoryHigh set on the cgroup
Uptime when symptom appeared: ~5 days 16 hours since the v2026.5.28 upgrade; gateway had not been manually restarted in that window.

Summary

After ~5 days of steady traffic, the gateway Node process heap grew to ~1.6 GB (heapUsedBytes ≈ 1.5 GB, rssBytes ≈ 1.64 GB) and was killed by systemd-oomd because its cgroup's memory.pressure crossed the kernel's kill threshold:

Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s
Current Memory Usage: 234.8M, Pgscan: 5184818

The gateway exposed no built-in heap bound, so with the kernel-default cgroup limit it was effectively unbounded.

Observed timeline (local journal, UTC+8)

Time	Event
03:00:00	`memory-core: managed dreaming cron could not be reconciled (cron service unavailable)`
03:01:23	`[diagnostics/memory] memory pressure: level=warning reason=heap_threshold rssBytes=1403424768 heapUsedBytes=1341035200 thresholdBytes=1073741824`
03:02:24	`[diagnostics/memory] memory pressure: level=warning reason=rss_threshold rssBytes=1642409984 heapUsedBytes=1546413600 thresholdBytes=1610612736`
05:46:14	`systemd-journald[431]: Under memory pressure, flushing caches.`
06:29:42	`systemd-oomd[918]: Considered 16 cgroups for killing, top candidates were: /user.slice/user-<UID>.slice/user@<UID>.service/app.slice/openclaw-gateway.service, Memory Pressure Limit: 0.00%, Pressure: Avg10: 83.12 Avg60: 33.06 Avg300: 8.24 Total: 1min 9s, Current Memory Usage: 234.8M, Pgscan: 5184818`
06:29:42 + seconds	`node` process exited; no further log lines from that PID.

(UID is shown as <UID> in this report; on the affected host it corresponds to the regular non-root user that owns the user-level systemd service.)

Observations

The gateway's own [diagnostics/memory] emitted warnings at the 1 GB and 1.6 GB thresholds, but it has no auto-mitigation path — no graceful restart, no heap cap, no session shedding, no alert escalation. The process kept growing until the kernel cgroup OOM killer stepped in.
The user-level systemd manager (user@<UID>.service) survived the gateway kill (Linger=yes), but its dbus effectively became unresponsive. journalctl --user -u openclaw-gateway has no entries for the same window because the user-level journal ring buffer was on the same memory-starved path.
With kernel-default cgroup limits and no operator-visible recommended MemoryHigh / MemoryMax, any long-running Linux systemd --user deployment with non-trivial traffic will eventually be killed by the kernel.
Adjacent regression that hid this: v2026.5.28 downgraded [model-fetch] request/response logs from log.info to log.debug (tracked separately as #89300). At the default log level these no longer appear in gateway.log, which made it much harder to identify the request pattern that correlated with the heap growth in this incident.

Reproduction

Install v2026.5.28 on a Linux box under systemd --user with Linger=yes.
Drive a moderate steady workload for several days: a couple of channel bridges, periodic cron / heartbeat jobs, a few agents, some dreaming cycles.
Watch the node process RSS climb past 1 GB and never reclaim even after traffic stops. The only signal is the internal [diagnostics/memory] warning.
When system memory pressure rises (any other workload, even briefly), systemd-oomd selects the gateway cgroup first because Pressure.Avg10 is well above 50% and the gateway itself is the dominant Pss on the box, then SIGKILLs it.

Expected

One of the following:

The gateway self-bounds its heap (e.g. graceful restart at a configured threshold, cap heap size, shed idle sessions / dreaming transcripts, or refuse new work and alert).
The gateway ships a recommended MemoryHigh / MemoryMax value (and documents it in the systemd unit template) so operators can set a sane bound before the kernel kills the process.
[diagnostics/memory] warnings escalate (e.g. emit a CRITICAL level, trigger a graceful shutdown, or push to a health endpoint) instead of just logging a warning that nothing acts on.

Actual

No auto-mitigation.
The gateway's Node process heap grew to ~1.6 GB.
The kernel cgroup OOM killer terminated the process without a graceful shutdown (in-flight sessions lost, all user-level channel bridges offline).
The user-level systemd manager was effectively dead — it required a fresh interactive ssh/login for the kernel to start a new user@<UID>.service and reattach the Linger session, which then brought the gateway back.

Workaround (until fixed)

Operators can put a cgroup bound on the service manually. In ~/.config/systemd/user/openclaw-gateway.service:

[Service]
MemoryHigh=1G
MemoryMax=2G

After systemctl --user daemon-reload, the gateway will be SIGKILL'd by the cgroup at 2 GB and Restart=always will bring it back, but this still loses in-flight sessions and is a band-aid — the underlying heap growth has no upper bound.

A weekly systemctl --user restart openclaw-gateway (e.g. via OnCalendar=weekly in a user timer) also keeps the heap small enough to avoid systemd-oomd being triggered by transient system memory pressure.

Thanks for the work on [diagnostics/memory] — the warning at 1 GB was what pointed us at this. The missing piece is the auto-mitigation (or at least a published MemoryHigh default) and the fact that the user-level systemd manager can't recover on its own after a cgroup OOM kill.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: gateway heap grows unbounded over time, gets killed by cgroup OOM on long-running Linux systemd --user deployments

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround (until fixed)

Code Example

[Bug]: gateway heap grows unbounded over time, gets killed by cgroup OOM on long-running Linux systemd --user deployments

Environment

Summary

Observed timeline (local journal, UTC+8)

Observations

Reproduction

Expected

Actual

Workaround (until fixed)

Still need to ship something?

TRENDING