openclaw - ✅(Solved) Fix web_fetch: Chrome renderer processes accumulate and are never cleaned up (memory leak) [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70270Fetched 2026-04-23 07:26:55
View on GitHub
Comments
2
Participants
3
Timeline
3
Reactions
0
Timeline (top)
commented ×2cross-referenced ×1

When using web_fetch (headless Chrome browser tool), renderer processes accumulate over time and are never terminated, causing a memory leak that can crash the gateway on constrained servers.

Error Message

After approximately 24 hours of normal use (briefing cron jobs, web_fetch calls), the Chrome renderer count grew to 46 processes consuming 2.6 GB RAM on a 3.7 GB VPS (Hetzner CX22). RAM usage reached 91%, with 1.8 GB swap in use.

Root Cause

When using web_fetch (headless Chrome browser tool), renderer processes accumulate over time and are never terminated, causing a memory leak that can crash the gateway on constrained servers.

Fix Action

Workaround

Daily gateway restart via cron:

45 3 * * * XDG_RUNTIME_DIR=/run/user/1000 systemctl --user restart openclaw-gateway.service

PR fix notes

PR #70419: fix(gateway): raise child oom_score_adj on linux to spare the gateway under OOM

Description (problem / solution / changelog)

Closes #70404.

Root Cause

On Linux, child processes inherit the gateway's oom_score_adj. In a memory-constrained cgroup, the gateway is often the largest-RSS process because it keeps long-lived WebSocket state and V8 heap resident, while transient children such as agent workers, MCP stdio servers, PTY shells, and Chrome/browser helpers are smaller individually. When the cgroup hits its memory limit, the kernel can therefore kill openclaw-gateway instead of the transient child that pushed the cgroup over the edge. The gateway exits with 137 and all connected sessions drop.

The important constraint: lowering the gateway's OOM score, or having the parent process write a lower score into children, is capability-sensitive in hardened containers. The reliable unprivileged operation is the opposite: a Linux process may voluntarily increase its own OOM kill likelihood.

Fix

Add a shared Linux-only spawn helper that wraps eligible child commands in a short /bin/sh shim:

/bin/sh -c 'echo 1000 > /proc/self/oom_score_adj 2>/dev/null; exec "$0" "$@"' <cmd> <args...>

The shim runs in the post-fork child, raises that child's own oom_score_adj, then execs the real command. There is no extra long-lived shell process, and after exec the process identity, PID, stdio, exit, and kill semantics remain the target process.

Current covered spawn surfaces:

  • src/process/supervisor/adapters/child.ts for regular supervisor-managed children.
  • src/process/supervisor/adapters/pty.ts for PTY-backed shell children.
  • src/agents/mcp-stdio-transport.ts for MCP stdio server children.
  • extensions/browser/src/browser/chrome.ts for launched browser/Chrome processes, through the public plugin SDK process-runtime seam.

The helper is no-op when:

  • the platform is not Linux,
  • OPENCLAW_CHILD_OOM_SCORE_ADJ=0 / false / no / off is set in the child env,
  • /bin/sh is unavailable, so distroless/scratch images degrade to previous behavior instead of failing with ENOENT,
  • the argv is already wrapped,
  • the command name starts with -, because POSIX sh implementations do not support exec -- and a leading-dash command could be parsed as an exec option.

Safety Notes

  • Linux-only behavior. macOS, Windows, and other platforms keep their existing spawn shape.
  • Argument-safe execution. The wrapper script is fixed text. The real command and args are passed as shell positional parameters and executed with POSIX-compatible exec "$0" "$@", so user args are not re-parsed as shell source. Leading-dash command names are intentionally left on the original direct-spawn path.
  • Shell env hardening. Wrapped spawns strip BASH_ENV, ENV, and CDPATH so the /bin/sh -c shim cannot source caller-influenced startup files before exec.
  • Transparent failure mode. If /proc/self/oom_score_adj is unavailable or unwritable, stderr is suppressed and the child still runs normally. It just does not get the OOM bias.
  • Plugin boundary kept clean. Browser plugin code uses openclaw/plugin-sdk/process-runtime; it does not deep-import core internals.

Scope Boundary / Related Work

This PR is intentionally a kernel victim-selection fix. It does not try to solve every child-process OOM class.

Related issues/PRs that remain separate work:

  • #70400, #70389, #69145, #64169, #64984: MCP stdio/runtime lifecycle leaks. This PR makes leaked or transient MCP children better OOM victims than the gateway, but it does not replace proper runtime disposal and transport shutdown ordering.
  • #70270, #55698, #30130, #31504: browser/Chrome renderer cleanup and container hardening. This PR covers launched browser process trees with the OOM bias, but stale renderer cleanup/resource caps remain separate lifecycle work.
  • #23409, #28629: broader child resource controls such as cgroup v2 limits, systemd MemoryMax=, spawn caps, and watchdogs. Those are stronger resource-governance features and should not be folded into this focused fix.
  • #68680, #69242: SIGKILL observability. Once children are intentionally preferred OOM victims, surfacing signal-killed subprocesses clearly becomes more useful, but it is an independent reporting improvement.
  • #52205, #47776: process-group and orphan cleanup. The shim uses exec, so it preserves the existing process-tree cleanup model rather than changing it.

Documentation

Added Linux docs for OOM victim selection, covered child process surfaces, opt-out env values, and /proc/<pid>/oom_score_adj verification:

  • docs/platforms/linux.md
  • docs/vps.md

Live Linux Docker Validation

Ran on node:22-bookworm inside Docker and verified real /proc/<pid>/oom_score_adj values for all covered spawn paths:

  • direct shared helper wrapped spawn: 1000
  • direct helper opt-out with OPENCLAW_CHILD_OOM_SCORE_ADJ=0: 0
  • supervisor child adapter: 1000
  • PTY adapter: 1000
  • MCP stdio transport: 1000
  • browser launch path with a fake Chrome executable: 1000

Also ran a cgroup memory-pressure simulation with --memory=256m --memory-swap=256m, a gateway-like parent holding ~179 MB RSS, and a child allocating memory in 4 MB chunks:

  • baseline/no wrapper: child inherited oom_score_adj=0; the parent/container was killed with exit 137 while the child was around 141 MB RSS.
  • wrapper enabled: child had oom_score_adj=1000; the child was killed with SIGKILL while the parent stayed alive at ~179 MB RSS.

This live pass also caught a portability bug in the earlier wrapper: Debian's /bin/sh is dash and rejects exec --. The PR now uses portable exec "$0" "$@" and skips wrapping leading-dash command names.

Tests Run

  • pnpm docs:list
  • pnpm test src/process/linux-oom-score.test.ts src/process/supervisor/adapters/child.test.ts src/process/supervisor/adapters/pty.test.ts src/agents/mcp-stdio-transport.test.ts extensions/browser/src/browser/chrome.internal.test.ts
  • node scripts/run-vitest.mjs run --config test/vitest/vitest.extension-browser.config.ts extensions/browser/src/browser/chrome.internal.test.ts
  • pnpm tsgo:prod
  • pnpm plugin-sdk:check-exports
  • pnpm plugin-sdk:api:check
  • pnpm check:changed
  • Linux Docker live harness against node:22-bookworm verifying /proc/<pid>/oom_score_adj for helper, opt-out, supervisor child, PTY, MCP stdio, and browser launch paths.
  • Linux Docker cgroup memory-pressure simulation with --memory=256m --memory-swap=256m, confirming the wrapper changes victim selection from parent/container to child.

Note: after the full pnpm check:changed passed locally on the prior commit, later repeated pnpm check:changed / combined targeted test invocations hit a Vitest unit-fast process stuck at 0% CPU. The focused test lanes above were rerun split by lane and passed.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/.generated/plugin-sdk-api-baseline.sha256 (modified, +2/-2)
  • docs/platforms/linux.md (modified, +37/-0)
  • docs/vps.md (modified, +3/-0)
  • extensions/browser/src/browser/chrome.ts (modified, +6/-2)
  • src/agents/mcp-stdio-transport.test.ts (modified, +20/-3)
  • src/agents/mcp-stdio-transport.ts (modified, +12/-5)
  • src/plugin-sdk/process-runtime.ts (modified, +2/-0)
  • src/process/linux-oom-score.test.ts (added, +105/-0)
  • src/process/linux-oom-score.ts (added, +143/-0)
  • src/process/supervisor/adapters/child.test.ts (modified, +53/-1)
  • src/process/supervisor/adapters/child.ts (modified, +7/-2)
  • src/process/supervisor/adapters/pty.test.ts (modified, +55/-8)
  • src/process/supervisor/adapters/pty.ts (modified, +5/-2)

Code Example

$ ps aux | grep 'chrome.*renderer' | wc -l
46

$ ps aux | grep 'chrome.*renderer' | awk '{sum += $6} END {print sum " KB"}'
2611628 KB   # ≈ 2.6 GB

---

45 3 * * * XDG_RUNTIME_DIR=/run/user/1000 systemctl --user restart openclaw-gateway.service
RAW_BUFFERClick to expand / collapse

Summary

When using web_fetch (headless Chrome browser tool), renderer processes accumulate over time and are never terminated, causing a memory leak that can crash the gateway on constrained servers.

Observed behavior

After approximately 24 hours of normal use (briefing cron jobs, web_fetch calls), the Chrome renderer count grew to 46 processes consuming 2.6 GB RAM on a 3.7 GB VPS (Hetzner CX22). RAM usage reached 91%, with 1.8 GB swap in use.

$ ps aux | grep 'chrome.*renderer' | wc -l
46

$ ps aux | grep 'chrome.*renderer' | awk '{sum += $6} END {print sum " KB"}'
2611628 KB   # ≈ 2.6 GB

All 43 stale renderer processes had been running since the previous day (Apr 21) — none were cleaned up after their web_fetch sessions completed.

Steps to reproduce

  1. Configure web_fetch with the browser provider
  2. Run several cron sessions that call web_fetch (e.g. daily briefings, market monitors)
  3. Wait 12–24 hours
  4. Observe Chrome renderer processes accumulating (ps aux | grep chrome | wc -l)

Environment

  • OpenClaw: 2026.4.15
  • OS: Debian 13 (Trixie), Linux 6.12.74
  • Node: 22.22.2
  • Chrome: /opt/google/chrome (headless), --no-sandbox --disable-dev-shm-usage
  • VPS: Hetzner CX22 (2 vCPU, 4 GB RAM)

Impact

Gateway requires full restart to recover. On a 4 GB VPS this can cause OOM within 24 hours.

Workaround

Daily gateway restart via cron:

45 3 * * * XDG_RUNTIME_DIR=/run/user/1000 systemctl --user restart openclaw-gateway.service

Expected behavior

Chrome renderer processes should be terminated after their web_fetch session completes, not left running indefinitely.

extent analysis

TL;DR

The most likely fix is to implement a mechanism to terminate Chrome renderer processes after their web_fetch session completes, potentially through modifications to the web_fetch tool or its configuration.

Guidance

  • Investigate the web_fetch tool's configuration and documentation to see if there are any options for automatically terminating Chrome renderer processes after use.
  • Consider implementing a script or cron job to periodically clean up stale Chrome renderer processes, in addition to the existing daily gateway restart workaround.
  • Review the system's resource monitoring and alerting to ensure that memory usage thresholds are being tracked and alerted on to prevent unexpected out-of-memory errors.
  • Examine the Chrome launch flags (--no-sandbox --disable-dev-shm-usage) to determine if they may be contributing to the issue, and consider alternative flags or configurations.

Example

No specific code example is provided, as the issue is more related to system configuration and process management.

Notes

The provided workaround of daily gateway restarts via cron may not be suitable for all environments, and a more targeted solution to terminate Chrome renderer processes after use is desirable to prevent memory leaks and out-of-memory errors.

Recommendation

Apply a workaround, such as implementing a script to periodically clean up stale Chrome renderer processes, until a more permanent fix can be developed and implemented. This will help mitigate the memory leak issue and prevent out-of-memory errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Chrome renderer processes should be terminated after their web_fetch session completes, not left running indefinitely.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING