openclaw - 💡(How to fix) Fix [Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71710Fetched 2026-04-26 05:09:34
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Timeline (top)
cross-referenced ×1mentioned ×1subscribed ×1

openclaw agent processes spawned by cron with timeout 600 openclaw agent ... (without -k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.

Root Cause

Summary

openclaw agent processes spawned by cron with timeout 600 openclaw agent ... (without -k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.

Fix Action

Fix / Workaround

  • 2026-04-17 17:34 UTC — first occurrence. 14 hung openclaw-agent processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
  • Initial mitigation (same day) — wrapped both cron lines in timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
  • 2026-04-25 00:14 UTC — second occurrence after we extended the budget to timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.

Additional provider/model setup details

Workaround currently in production:

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1

The -k 60 flag tells coreutils-timeout to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This mitigates the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.

Code Example

0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
   0 */2 * * *      timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1

---

/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
 └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
      └── openclaw-agent (still alive, RSS 40-730 MB depending on age)

---

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

openclaw agent processes spawned by cron with timeout 600 openclaw agent ... (without -k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.

Steps to reproduce

  1. On a Linux host with cron, schedule any agent with a wrapper that has no SIGKILL escalation, e.g.:
    0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
    0 */2 * * *      timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
  2. Let it run for ≥48 hours. Each time the agent does something that blocks past the 600s budget (synchronous gateway call, slow tool, networked search, etc.), the wrapper sends SIGTERM at 600s — the agent does not exit.
  3. Inspect ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4. You will see process trees layered as /bin/sh -c "timeout 600 openclaw agent ..." -> timeout -> openclaw-agent, all of them surviving across many cron firings.

Expected behavior

One of:

  • (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
  • (b) The shipped CLI/cron documentation explicitly requires timeout -k <delay> <duration> (or equivalent) for any cron-spawned agent and warns about the leak otherwise.

Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

Actual behavior

Real incident timeline observed on this host:

  • 2026-04-17 17:34 UTC — first occurrence. 14 hung openclaw-agent processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
  • Initial mitigation (same day) — wrapped both cron lines in timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
  • 2026-04-25 00:14 UTC — second occurrence after we extended the budget to timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.

Process layering observed (from ps -ef and pstree):

/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
 └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
      └── openclaw-agent (still alive, RSS 40-730 MB depending on age)

pkill -TERM on openclaw-agent had no effect — only pkill -9 (SIGKILL) freed the chain. Cleanup required two passes:

  1. pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director" — outer shells
  2. pkill -9 -f "^openclaw-agent" --older 600 — orphaned agents reparented to init

After cleanup: chains 69 → 0, RAM 11 → 4.6 GiB, swap 4.0 GiB → 374 MiB. Gateway untouched, uptime preserved.

OpenClaw version

2026.4.23 (incident reproduced on the same version family across 2026.4.17–2026.4.23)

Operating system

Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255

Install method

npm global (/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1

Model

claude-cli/claude-opus-4-7 (the agent under cron is social-director running on this primary)

Provider / routing chain

cron -> /bin/sh -> coreutils timeout -> openclaw agent (CLI) -> openclaw gateway (systemd --user) -> auth-profile anthropic:claude-cli -> claude (CLI) -> anthropic.com

Additional provider/model setup details

Workaround currently in production:

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1

The -k 60 flag tells coreutils-timeout to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This mitigates the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.

Suggested direction

  • Primary: install an async-safe SIGTERM handler on the agent process that:
    • cancels any pending synchronous gateway HTTP call,
    • flushes the current message/log buffers,
    • exits within a bounded grace window (≤30s default, configurable).
  • Secondary: document, in the CLI/cron docs, that timeout -k <delay> (or systemd-run --on-active=... --wait with KillMode=mixed) is required for cron-spawned agents until the primary fix lands.

Related signal-handling / supervisor issues that touch the same surface but from different angles:

  • #66399 — Process supervisor: graceful signal escalation and drain timeout for exec tool — directly relevant; same direction of fix needed for the agent CLI entry-point.
  • #70026 — Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade — opposite end of the same problem (supervisor side).
  • #65650 — feat: wire SQLite message store into active gateway for SIGTERM resilience — orthogonal; addresses message persistence across restarts.

Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).

extent analysis

TL;DR

The openclaw agent process ignores SIGTERM signals, causing it to accumulate as a long-lived hung chain when spawned by cron with timeout, leading to resource exhaustion.

Guidance

  • Implement an async-safe SIGTERM handler in the openclaw agent process to drain in-flight work and exit within a bounded time frame (e.g., ≤30s).
  • Document the requirement for using timeout -k <delay> (or equivalent) in cron-spawned agents to prevent similar issues until the primary fix is implemented.
  • Verify the fix by testing the agent's response to SIGTERM signals and ensuring it exits cleanly within the specified time frame.
  • Consider using systemd-run with KillMode=mixed as an alternative to timeout for managing agent processes.

Example

# Example of using timeout with SIGKILL escalation
timeout -k 60 600 openclaw agent --agent social-director --message "..."

Notes

The provided workaround using timeout -k 60 mitigates the symptom but does not address the root cause. A proper fix requires implementing an async-safe SIGTERM handler in the openclaw agent process.

Recommendation

Apply the workaround using timeout -k <delay> until the primary fix is implemented, as it provides a temporary solution to prevent resource exhaustion.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

One of:

  • (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
  • (b) The shipped CLI/cron documentation explicitly requires timeout -k <delay> <duration> (or equivalent) for any cron-spawned agent and warns about the leak otherwise.

Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING