openclaw - 💡(How to fix) Fix Agent Health Monitoring & Automated Recovery (#41924) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#52701Fetched 2026-04-08 01:20:09
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Code Example

dockegumi ALL=(ALL) NOPASSWD: /bin/systemctl restart openclaw.service
RAW_BUFFERClick to expand / collapse

Goal

Ensure OpenClaw agent (gateway + sessions) remains operational with minimal manual intervention. Detect and recover from common failure modes automatically.

Proposed Health Daemon

Triggers & Actions

1. Gateway Down

  • Detection: systemctl is-active openclaw.service returns inactive or failed.
  • Action: sudo systemctl restart openclaw.service.
  • Cooldown: 2 minutes between restarts to avoid thrash.

2. High Memory Usage

  • Threshold: >90% of available RAM.
  • Action: Trigger proactive memory compaction (via ToolResultCompactor if available) and, if unresolved after 5 min, restart gateway.
  • Log: Rotate logs to free space.

3. Low Disk Space

  • Threshold: <10% free.
  • Action: Rotate logs (compress old), clear old memory/tool_results beyond TTL, alert user.

Sudoers Setup

Requires: /etc/sudoers.d/openclaw with line:

dockegumi ALL=(ALL) NOPASSWD: /bin/systemctl restart openclaw.service

Monitoring Frequency

Every 60 seconds via systemd timer or cron.

Integration with Agent Harness

Health daemon runs outside agent process (as systemd service). Can send internal message to agent before restart to allow graceful shutdown.

Alternatives

  • Use systemd's Restart=on-failure — already used but may not handle hung states where process alive but unresponsive. Need liveness probe (HTTP health endpoint) in addition to process check.
  • Containerized deployment (Docker) with restart policies — out of scope for current install base.

Request for Feedback

  • Are the thresholds appropriate?
  • Should we add network liveness check (e.g., pinging a local endpoint)?
  • Should the daemon also handle log rotation more aggressively?
  • How to notify user after automated recovery? Via Discord message?

extent analysis

Fix Plan

To implement the proposed health daemon, follow these steps:

  • Create a systemd service file for the health daemon:

sudo nano /etc/systemd/system/health-daemon.service

  Add the following content:
  ```bash
[Unit]
Description=OpenClaw Health Daemon
After=network.target

[Service]
User=dockegumi
ExecStart=/usr/bin/python /path/to/health-daemon.py
Restart=always

[Timer]
OnUnitInactiveSec=60s
Unit=health-daemon.service
  • Create the health daemon script (health-daemon.py):

import subprocess import time import psutil import os

Define thresholds and actions

GATEWAY_DOWN_ACTION = 'sudo systemctl restart openclaw.service' HIGH_MEMORY_USAGE_ACTION = 'proactive memory compaction and restart' LOW_DISK_SPACE_ACTION = 'rotate logs, clear old data, and alert user'

Define cooldown period

COOLDOWN_PERIOD = 120 # 2 minutes

while True: # Check gateway status gateway_status = subprocess.check_output(['systemctl', 'is-active', 'openclaw.service']).decode('utf-8').strip() if gateway_status == 'inactive' or gateway_status == 'failed': # Restart gateway subprocess.check_call(GATEWAY_DOWN_ACTION, shell=True) time.sleep(COOLDOWN_PERIOD)

# Check memory usage
memory_usage = psutil.virtual_memory().percent
if memory_usage > 90:
    # Trigger proactive memory compaction and restart
    subprocess.check_call(HIGH_MEMORY_USAGE_ACTION, shell=True)
    time.sleep(300)  # 5 minutes

# Check disk space
disk_usage = psutil.disk_usage('/').percent
if disk_usage > 90:
    # Rotate logs, clear old data, and alert user
    subprocess.check_call(LOW_DISK_SPACE_ACTION, shell=True)

time.sleep(60)  # 1 minute
* **Configure sudoers** to allow the health daemon to restart the OpenClaw service:
  ```bash
sudo nano /etc/sudoers.d/openclaw

Add the following line:

dockegumi ALL=(ALL) NOPASSWD: /bin/systemctl restart openclaw.service

Verification

To verify that the health daemon is working correctly:

  • Check the systemd service status: sudo systemctl status health-daemon
  • Check the health daemon logs: sudo journalctl -u health-daemon
  • Simulate a failure scenario (e.g., stop the OpenClaw service) and verify that the health daemon restarts it.

Extra Tips

  • Consider adding a network liveness check to the health daemon to detect cases where the gateway is unresponsive but still running.
  • Consider adding

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Agent Health Monitoring & Automated Recovery (#41924) [1 participants]