openclaw - ✅(Solved) Fix [Bug]: openclaw-gateway is killed by OOM due to inherited oom_score_adj from parent, causing workers not to be preferred victims [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70404Fetched 2026-04-23 07:25:14
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
referenced ×5labeled ×2closed ×1commented ×1

When deploying OpenClaw in a memory-constrained container environment (e.g., 2 vCPUs, 6GiB RAM), openclaw-gateway is frequently targeted and killed by the Linux cgroup OOM Killer instead of the transient task/worker processes (like openclaw-cron or spawned Chrome instances).

In our environment, openclaw-gateway uses around 2.1 GiB of RAM (maintaining WS connections and V8 heap), while child processes use around 300 MiB. Because child processes spawned via Node.js child_process completely inherit the parent's oom_score_adj value, the OOM killer falls back to selecting the process with the highest absolute memory consumption (RSS). Consequently, the critical gateway process is killed, bringing down the entire control plane.

We need a way to lower the OOM survival priority of child processes so they are sacrificed first during memory exhaustion, preserving the gateway.

Root Cause

In our environment, openclaw-gateway uses around 2.1 GiB of RAM (maintaining WS connections and V8 heap), while child processes use around 300 MiB. Because child processes spawned via Node.js child_process completely inherit the parent's oom_score_adj value, the OOM killer falls back to selecting the process with the highest absolute memory consumption (RSS). Consequently, the critical gateway process is killed, bringing down the entire control plane.

Fix Action

Fixed

PR fix notes

PR #70419: fix(gateway): raise child oom_score_adj on linux to spare the gateway under OOM

Description (problem / solution / changelog)

Closes #70404.

Root Cause

On Linux, child processes inherit the gateway's oom_score_adj. In a memory-constrained cgroup, the gateway is often the largest-RSS process because it keeps long-lived WebSocket state and V8 heap resident, while transient children such as agent workers, MCP stdio servers, PTY shells, and Chrome/browser helpers are smaller individually. When the cgroup hits its memory limit, the kernel can therefore kill openclaw-gateway instead of the transient child that pushed the cgroup over the edge. The gateway exits with 137 and all connected sessions drop.

The important constraint: lowering the gateway's OOM score, or having the parent process write a lower score into children, is capability-sensitive in hardened containers. The reliable unprivileged operation is the opposite: a Linux process may voluntarily increase its own OOM kill likelihood.

Fix

Add a shared Linux-only spawn helper that wraps eligible child commands in a short /bin/sh shim:

/bin/sh -c 'echo 1000 > /proc/self/oom_score_adj 2>/dev/null; exec "$0" "$@"' <cmd> <args...>

The shim runs in the post-fork child, raises that child's own oom_score_adj, then execs the real command. There is no extra long-lived shell process, and after exec the process identity, PID, stdio, exit, and kill semantics remain the target process.

Current covered spawn surfaces:

  • src/process/supervisor/adapters/child.ts for regular supervisor-managed children.
  • src/process/supervisor/adapters/pty.ts for PTY-backed shell children.
  • src/agents/mcp-stdio-transport.ts for MCP stdio server children.
  • extensions/browser/src/browser/chrome.ts for launched browser/Chrome processes, through the public plugin SDK process-runtime seam.

The helper is no-op when:

  • the platform is not Linux,
  • OPENCLAW_CHILD_OOM_SCORE_ADJ=0 / false / no / off is set in the child env,
  • /bin/sh is unavailable, so distroless/scratch images degrade to previous behavior instead of failing with ENOENT,
  • the argv is already wrapped,
  • the command name starts with -, because POSIX sh implementations do not support exec -- and a leading-dash command could be parsed as an exec option.

Safety Notes

  • Linux-only behavior. macOS, Windows, and other platforms keep their existing spawn shape.
  • Argument-safe execution. The wrapper script is fixed text. The real command and args are passed as shell positional parameters and executed with POSIX-compatible exec "$0" "$@", so user args are not re-parsed as shell source. Leading-dash command names are intentionally left on the original direct-spawn path.
  • Shell env hardening. Wrapped spawns strip BASH_ENV, ENV, and CDPATH so the /bin/sh -c shim cannot source caller-influenced startup files before exec.
  • Transparent failure mode. If /proc/self/oom_score_adj is unavailable or unwritable, stderr is suppressed and the child still runs normally. It just does not get the OOM bias.
  • Plugin boundary kept clean. Browser plugin code uses openclaw/plugin-sdk/process-runtime; it does not deep-import core internals.

Scope Boundary / Related Work

This PR is intentionally a kernel victim-selection fix. It does not try to solve every child-process OOM class.

Related issues/PRs that remain separate work:

  • #70400, #70389, #69145, #64169, #64984: MCP stdio/runtime lifecycle leaks. This PR makes leaked or transient MCP children better OOM victims than the gateway, but it does not replace proper runtime disposal and transport shutdown ordering.
  • #70270, #55698, #30130, #31504: browser/Chrome renderer cleanup and container hardening. This PR covers launched browser process trees with the OOM bias, but stale renderer cleanup/resource caps remain separate lifecycle work.
  • #23409, #28629: broader child resource controls such as cgroup v2 limits, systemd MemoryMax=, spawn caps, and watchdogs. Those are stronger resource-governance features and should not be folded into this focused fix.
  • #68680, #69242: SIGKILL observability. Once children are intentionally preferred OOM victims, surfacing signal-killed subprocesses clearly becomes more useful, but it is an independent reporting improvement.
  • #52205, #47776: process-group and orphan cleanup. The shim uses exec, so it preserves the existing process-tree cleanup model rather than changing it.

Documentation

Added Linux docs for OOM victim selection, covered child process surfaces, opt-out env values, and /proc/<pid>/oom_score_adj verification:

  • docs/platforms/linux.md
  • docs/vps.md

Live Linux Docker Validation

Ran on node:22-bookworm inside Docker and verified real /proc/<pid>/oom_score_adj values for all covered spawn paths:

  • direct shared helper wrapped spawn: 1000
  • direct helper opt-out with OPENCLAW_CHILD_OOM_SCORE_ADJ=0: 0
  • supervisor child adapter: 1000
  • PTY adapter: 1000
  • MCP stdio transport: 1000
  • browser launch path with a fake Chrome executable: 1000

Also ran a cgroup memory-pressure simulation with --memory=256m --memory-swap=256m, a gateway-like parent holding ~179 MB RSS, and a child allocating memory in 4 MB chunks:

  • baseline/no wrapper: child inherited oom_score_adj=0; the parent/container was killed with exit 137 while the child was around 141 MB RSS.
  • wrapper enabled: child had oom_score_adj=1000; the child was killed with SIGKILL while the parent stayed alive at ~179 MB RSS.

This live pass also caught a portability bug in the earlier wrapper: Debian's /bin/sh is dash and rejects exec --. The PR now uses portable exec "$0" "$@" and skips wrapping leading-dash command names.

Tests Run

  • pnpm docs:list
  • pnpm test src/process/linux-oom-score.test.ts src/process/supervisor/adapters/child.test.ts src/process/supervisor/adapters/pty.test.ts src/agents/mcp-stdio-transport.test.ts extensions/browser/src/browser/chrome.internal.test.ts
  • node scripts/run-vitest.mjs run --config test/vitest/vitest.extension-browser.config.ts extensions/browser/src/browser/chrome.internal.test.ts
  • pnpm tsgo:prod
  • pnpm plugin-sdk:check-exports
  • pnpm plugin-sdk:api:check
  • pnpm check:changed
  • Linux Docker live harness against node:22-bookworm verifying /proc/<pid>/oom_score_adj for helper, opt-out, supervisor child, PTY, MCP stdio, and browser launch paths.
  • Linux Docker cgroup memory-pressure simulation with --memory=256m --memory-swap=256m, confirming the wrapper changes victim selection from parent/container to child.

Note: after the full pnpm check:changed passed locally on the prior commit, later repeated pnpm check:changed / combined targeted test invocations hit a Vitest unit-fast process stuck at 0% CPU. The focused test lanes above were rerun split by lane and passed.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/.generated/plugin-sdk-api-baseline.sha256 (modified, +2/-2)
  • docs/platforms/linux.md (modified, +37/-0)
  • docs/vps.md (modified, +3/-0)
  • extensions/browser/src/browser/chrome.ts (modified, +6/-2)
  • src/agents/mcp-stdio-transport.test.ts (modified, +20/-3)
  • src/agents/mcp-stdio-transport.ts (modified, +12/-5)
  • src/plugin-sdk/process-runtime.ts (modified, +2/-0)
  • src/process/linux-oom-score.test.ts (added, +105/-0)
  • src/process/linux-oom-score.ts (added, +143/-0)
  • src/process/supervisor/adapters/child.test.ts (modified, +53/-1)
  • src/process/supervisor/adapters/child.ts (modified, +7/-2)
  • src/process/supervisor/adapters/pty.test.ts (modified, +55/-8)
  • src/process/supervisor/adapters/pty.ts (modified, +5/-2)
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When deploying OpenClaw in a memory-constrained container environment (e.g., 2 vCPUs, 6GiB RAM), openclaw-gateway is frequently targeted and killed by the Linux cgroup OOM Killer instead of the transient task/worker processes (like openclaw-cron or spawned Chrome instances).

In our environment, openclaw-gateway uses around 2.1 GiB of RAM (maintaining WS connections and V8 heap), while child processes use around 300 MiB. Because child processes spawned via Node.js child_process completely inherit the parent's oom_score_adj value, the OOM killer falls back to selecting the process with the highest absolute memory consumption (RSS). Consequently, the critical gateway process is killed, bringing down the entire control plane.

We need a way to lower the OOM survival priority of child processes so they are sacrificed first during memory exhaustion, preserving the gateway.

Steps to reproduce

  1. Deploy openclaw via Docker/Kubernetes with a hard memory limit of 6GiB (memory.max=6GiB).

  2. Keep the openclaw-gateway running so its memory footprint naturally grows (e.g., ~2 GiB).

  3. Trigger workloads that spawn memory-intensive child processes (e.g., headless Chromium or heavy Node tasks).

Once the container hits the 6GiB physical memory ceiling, a cgroup-local OOM is triggered.

Check dmesg or container logs: openclaw-gateway is killed with Exit Code 137 because its memory footprint is larger than any single child process.

Expected behavior

Transient task workers or headless browser processes should have a higher OOM score (e.g., oom_score_adj = 1000) so the Linux kernel prioritizes killing them over the main gateway daemon.

Actual behavior

Once the container hits the 6GiB physical memory ceiling, a cgroup-local OOM is triggered.

Check dmesg or container logs: openclaw-gateway is killed with Exit Code 137 because its memory footprint is larger than any single child process.

OpenClaw version

HEAD

Operating system

Ubuntu

Install method

docker

Model

N/A

Provider / routing chain

openclaw

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

High. When the gateway is killed by OOM, all connected clients, web sockets, and agent sessions are immediately disconnected, causing a severe service outage. Implementing this would make OpenClaw significantly more resilient in cloud-native/containerized environments.

Additional information

Root Cause & Proposed Solution

By default, adjusting a process's oom_score_adj to a lower (more immune) value requires the CAP_SYS_RESOURCE capability , which is typically stripped from unprivileged containers. Furthermore, if a parent process tries to modify a child's score externally, it often hits Permission denied errors due to Linux dumpable security attributes and user namespace restrictions.

Optimization:

However, since Linux 2.6.20, any non-root process is allowed to voluntarily INCREASE its own oom_score_adj without needing CAP_SYS_RESOURCE or sudo.

Instead of having openclaw-gateway modify the child's score, the gateway can spawn child processes wrapped in a shell script that raises their own OOM score before replacing the shell with the target executable.

extent analysis

TL;DR

Modify the child process spawning mechanism to increase the oom_score_adj value of child processes, making them more likely to be killed by the OOM killer instead of the gateway process.

Guidance

  • Identify the code responsible for spawning child processes in openclaw-gateway and modify it to wrap the child process execution in a shell script.
  • In the shell script, increase the oom_score_adj value of the process to a high value (e.g., 1000) using the proc filesystem, for example, by writing to /proc/self/oom_score_adj.
  • Ensure the shell script replaces itself with the target executable using exec to avoid leaving the shell process running.
  • Verify that the oom_score_adj value of child processes is correctly set by checking the /proc/<pid>/oom_score_adj file for a child process.

Example

#!/bin/sh
echo 1000 > /proc/self/oom_score_adj
exec /path/to/target/executable "$@"

This script sets the oom_score_adj value to 1000 and then replaces itself with the target executable.

Notes

This solution relies on the fact that a non-root process can voluntarily increase its own oom_score_adj value without needing special capabilities. This approach avoids the need to modify the parent process's capability or use sudo.

Recommendation

Apply the workaround by modifying the child process spawning mechanism to use a shell script that increases the oom_score_adj value of child processes. This will make child processes more likely to be killed by the OOM killer, preserving the gateway process and improving the overall resilience of the system.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Transient task workers or headless browser processes should have a higher OOM score (e.g., oom_score_adj = 1000) so the Linux kernel prioritizes killing them over the main gateway daemon.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: openclaw-gateway is killed by OOM due to inherited oom_score_adj from parent, causing workers not to be preferred victims [1 pull requests, 1 comments, 2 participants]