One of: - (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or - (b) The shipped CLI/cron documentation explicitly requires `timeout -k ` (or equivalent) for any cron-spawned agent and warns about the leak otherwise. Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

openclaw - 💡(How to fix) Fix [Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap [1 participants]

nikolaykazakovvs-ux · 2026-04-25T18:41:23Z

[openclaw] openclaw agent processes spawned by cron with timeout 600 openclaw agent ... without -k `) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap. ## Fix / Workaround - **2026-04-17 17:34 UTC** — first occurrence. 14 hung `openclaw-agent` processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB. - **Initial mitigation (same day)** — wrapped both cron lines in `timeout 300 openclaw agent ...`. This stopped *future* indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower. - **2026-04-25 00:14 UTC** — second occurrence after we extended the budget to `timeout 600`. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill. ### Additional provider/model setup details Workaround currently in production: ```cron 0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1 0 */2 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1 ``` The `-k 60` flag tells coreutils-`timeout` to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This *mitigates* the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak. ### Bug type Behavior bug (incorrect output/state without crash) ### Beta release blocker No ### Summary `openclaw agent` processes spawned by cron with `timeout 600 openclaw agent ...` (without `-k `) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap. ### Steps to reproduce 1. On a Linux host with cron, schedule any agent with a wrapper that has no SIGKILL escalation, e.g.: ``` 0 2,8,14,20 * * * timeout 600 openclaw agent --agent --message "HEARTBEAT: ..." >> ~/logs/ .log 2>&1 0 */2 * * * timeout 600 openclaw agent --agent --message "HEARTBEAT: ..." >> ~/logs/ .log 2>&1 ``` 2. Let it run for ≥48 hours. Each time the agent does something that blocks past the 600s budget (synchronous gateway call, slow tool, networked search, etc.), the wrapper sends SIGTERM at 600s — the agent does not exit. 3. Inspect `ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4`. You will see process trees layered as `/bin/sh -c "timeout 600 openclaw agent ..."` -> `timeout` -> `openclaw-agent`, all of them surviving across many cron firings. ### Expected behavior One of: - (a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or - (b) The shipped CLI/cron documentation explicitly requires `timeout -k ` (or equivalent) for any cron-spawned agent and warns about the leak otherwise. Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned. ### Actual behavior Real incident timeline observed on this host: - **2026-04-17 17:34 UTC** — first occurrence. 14 hung `openclaw-agent` processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB. - **Initial mitigation (same day)** — wrapped both cron lines in `timeout 300 openclaw agent ...`. This stopped *future* indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower. - **2026-04-25 00:14 UTC** — second occurrence after we extended the budget to `timeout 600`. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill. Process layering observed (from `ps -ef` and `pstree`): ``` /bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..." └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s) └── openclaw-agent (still alive, RSS 40-730 MB depending on age) ``` `pkill -TERM` on `openclaw-agent` had no effect — only `pkill -9` (SIGKILL) freed the chain. Cleanup required two passes: 1. `pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director"` — outer shells 2. `pkill -9 -f "^openclaw-agent" --older 600` — orphaned agents reparented to init

openclaw2026-04-25 18:41:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#71710•Fetched 2026-04-26 05:09:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

nikolaykazakovvs-ux

Participants

nikolaykazakovvs-ux

Timeline (top)

cross-referenced ×1mentioned ×1subscribed ×1

openclaw agent processes spawned by cron with timeout 600 openclaw agent ... (without -k <delay>) accumulate as long-lived hung chains because the agent does not exit on SIGTERM. Within ~2.5 days a multi-firing-per-day cron schedule exhausts host RAM and swap.

Root Cause

Summary

Fix Action

Fix / Workaround

2026-04-17 17:34 UTC — first occurrence. 14 hung openclaw-agent processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
Initial mitigation (same day) — wrapped both cron lines in timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
2026-04-25 00:14 UTC — second occurrence after we extended the budget to timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.

Additional provider/model setup details

Workaround currently in production:

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1

The -k 60 flag tells coreutils-timeout to send SIGKILL 60s after the SIGTERM if the child hasn't exited. This mitigates the symptom but does not address the root cause — the agent still ignores SIGTERM, which means any other supervisor that doesn't escalate to SIGKILL has the same leak.

Code Example

0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
   0 */2 * * *      timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1

---

/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
 └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
      └── openclaw-agent (still alive, RSS 40-730 MB depending on age)

---

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

On a Linux host with cron, schedule any agent with a wrapper that has no SIGKILL escalation, e.g.:

0 2,8,14,20 * * * timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1
0 */2 * * *      timeout 600 openclaw agent --agent <X> --message "HEARTBEAT: ..." >> ~/logs/<X>.log 2>&1

Let it run for ≥48 hours. Each time the agent does something that blocks past the 600s budget (synchronous gateway call, slow tool, networked search, etc.), the wrapper sends SIGTERM at 600s — the agent does not exit.
Inspect ps -eo pid,ppid,stat,etime,rss,cmd | grep openclaw-agent | sort -k4. You will see process trees layered as /bin/sh -c "timeout 600 openclaw agent ..." -> timeout -> openclaw-agent, all of them surviving across many cron firings.

Expected behavior

One of:

(a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
(b) The shipped CLI/cron documentation explicitly requires timeout -k <delay> <duration> (or equivalent) for any cron-spawned agent and warns about the leak otherwise.

Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

Actual behavior

Real incident timeline observed on this host:

2026-04-17 17:34 UTC — first occurrence. 14 hung openclaw-agent processes accumulated across two cron schedules (4×/day karma + 12×/day intel). Swap 96%, RAM 4.0 GiB used / 11 GiB available; Opus turn failed because the local Ollama RAM-fallback couldn't allocate 13.7 GiB.
Initial mitigation (same day) — wrapped both cron lines in timeout 300 openclaw agent .... This stopped future indefinite growth but did not prevent SIGTERM-swallowed leaks beyond the 300s budget. Eight days later the chains were back, just slower.
2026-04-25 00:14 UTC — second occurrence after we extended the budget to timeout 600. 23 hung chains accumulated over 2.5 days, RSS ~7 GiB combined, swap 100% (4.0 / 4.0 GiB), available 3.6 GiB. RAM used 11 / 15 GiB. Gateway PID was at risk of OOM-kill.

Process layering observed (from ps -ef and pstree):

/bin/sh -c "timeout 600 openclaw agent --agent social-director --message ..."
 └── timeout 600 openclaw agent --agent ... (waiting for child after SIGTERM at t=600s)
      └── openclaw-agent (still alive, RSS 40-730 MB depending on age)

pkill -TERM on openclaw-agent had no effect — only pkill -9 (SIGKILL) freed the chain. Cleanup required two passes:

pkill -9 -f "/bin/sh -c timeout 600 openclaw agent --agent social-director" — outer shells
pkill -9 -f "^openclaw-agent" --older 600 — orphaned agents reparented to init

After cleanup: chains 69 → 0, RAM 11 → 4.6 GiB, swap 4.0 GiB → 374 MiB. Gateway untouched, uptime preserved.

OpenClaw version

2026.4.23 (incident reproduced on the same version family across 2026.4.17–2026.4.23)

Operating system

Ubuntu 24.04.4 LTS (kernel 6.8.0-107-generic), systemd 255

Install method

npm global (/home/ubuntu/.npm-global/lib/node_modules/openclaw), Node v22.22.1

Model

claude-cli/claude-opus-4-7 (the agent under cron is social-director running on this primary)

Provider / routing chain

cron -> /bin/sh -> coreutils timeout -> openclaw agent (CLI) -> openclaw gateway (systemd --user) -> auth-profile anthropic:claude-cli -> claude (CLI) -> anthropic.com

Additional provider/model setup details

Workaround currently in production:

0 2,8,14,20 * * * timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1
0 */2 * * *      timeout -k 60 600 openclaw agent --agent social-director --message "..." >> ~/logs/social-director.log 2>&1

Suggested direction

Primary: install an async-safe SIGTERM handler on the agent process that:
- cancels any pending synchronous gateway HTTP call,
- flushes the current message/log buffers,
- exits within a bounded grace window (≤30s default, configurable).
Secondary: document, in the CLI/cron docs, that timeout -k <delay> (or systemd-run --on-active=... --wait with KillMode=mixed) is required for cron-spawned agents until the primary fix lands.

Related signal-handling / supervisor issues that touch the same surface but from different angles:

#66399 — Process supervisor: graceful signal escalation and drain timeout for exec tool — directly relevant; same direction of fix needed for the agent CLI entry-point.
#70026 — Supervisor sends SIGKILL instead of SIGTERM for long-running agents — causes session lock cascade — opposite end of the same problem (supervisor side).
#65650 — feat: wire SQLite message store into active gateway for SIGTERM resilience — orthogonal; addresses message persistence across restarts.

Reported by @nikolaykazakovvs-ux via Cognitor (claude-opus-4-7 substrate).

extent analysis

TL;DR

The openclaw agent process ignores SIGTERM signals, causing it to accumulate as a long-lived hung chain when spawned by cron with timeout, leading to resource exhaustion.

Guidance

Implement an async-safe SIGTERM handler in the openclaw agent process to drain in-flight work and exit within a bounded time frame (e.g., ≤30s).
Document the requirement for using timeout -k <delay> (or equivalent) in cron-spawned agents to prevent similar issues until the primary fix is implemented.
Verify the fix by testing the agent's response to SIGTERM signals and ensuring it exits cleanly within the specified time frame.
Consider using systemd-run with KillMode=mixed as an alternative to timeout for managing agent processes.

Example

# Example of using timeout with SIGKILL escalation
timeout -k 60 600 openclaw agent --agent social-director --message "..."

Notes

The provided workaround using timeout -k 60 mitigates the symptom but does not address the root cause. A proper fix requires implementing an async-safe SIGTERM handler in the openclaw agent process.

Recommendation

Apply the workaround using timeout -k <delay> until the primary fix is implemented, as it provides a temporary solution to prevent resource exhaustion.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

One of:

(a) The agent installs an async-safe SIGTERM handler that drains in-flight work and exits within a small bounded grace window (e.g. ≤30s), or
(b) The shipped CLI/cron documentation explicitly requires timeout -k <delay> <duration> (or equivalent) for any cron-spawned agent and warns about the leak otherwise.

Either is acceptable; (a) is the proper fix, (b) is the minimal documentation fix that prevents new users from being burned.

#agent execution #callback error #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - 💡(How to fix) Fix [Bug]: openclaw-agent ignores SIGTERM under cron, accumulates hung process chains and exhausts host RAM/swap [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Fix Action

Fix / Workaround

Additional provider/model setup details

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Suggested direction

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING