openclaw - ✅(Solved) Fix Feature Request: Gateway failure recovery and notification mechanism [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#53684Fetched 2026-04-08 01:24:53
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1referenced ×1

Silence after a system crash

Error Message

  1. Startup failure diagnostics — When the gateway fails to start, write a human-readable error to a known location (e.g., ~/.openclaw/logs/startup-error.txt) explaining why it failed. Currently, failures are buried in log files with no surfacing.

Root Cause

Silence after a system crash

Fix Action

Fixed

PR fix notes

PR #53716: feat(gateway): add watchdog + startup error diagnostics (closes #53684)

Description (problem / solution / changelog)

Summary

Implements #53684 - Gateway failure recovery and notification mechanism.

The PR adds three complementary layers to ensure the OpenClaw Gateway can recover from crashes and notify operators:

1. Startup error diagnostics ()

When an uncaught exception or fatal unhandled rejection occurs before or during gateway startup, the error is written to before the process exits. This makes post-mortem diagnosis reliable — the error file survives even when the terminal buffer is gone.

2. External watchdog ()

A lightweight bash watchdog designed to be run via (macOS) or (Linux). It polls the gateway's endpoint and:

  • Attempts a restart (up to 3 tries) if the gateway is down
  • Sends a macOS Notification () or Linux Notification () when things go wrong
  • Logs all events to

Run it with:

openclaw gateway watchdog --interval 3

3. CLI integration

openclaw gateway watchdog --interval 5 spawns the watchdog script as a foreground child process, forwarding stdout/stderr and handling SIGINT/SIGTERM cleanly.


Files changed

FileChange
src/infra/startup-error.tsNew — writes startup errors to logs/startup-error.txt
src/infra/unhandled-rejections.tsWire writeStartupError into fatal unhandled rejections
src/cli/run-main.tsWrite startup errors for pre-gateway uncaught exceptions
src/cli/gateway-cli/run.tsAdd gateway watchdog subcommand
src/cli/gateway-cli/register.tsRegister watchdog command
scripts/openclaw-watchdog.shNew — bash watchdog script
src/infra/startup-error.tsFix lint: unused format import removed

Testing

# Local build
pnpm install && pnpm run build

# Run the watchdog (foreground test)
openclaw gateway watchdog --interval 1 --max-attempts 1 --no-notify

# Check startup error file (after forcing a crash):
# ~/.openclaw/logs/startup-error.txt

cc @openclaw/maintainers

Changed files

  • scripts/openclaw-watchdog.sh (added, +163/-0)
  • src/cli/gateway-cli/register.ts (modified, +3/-1)
  • src/cli/gateway-cli/run.ts (modified, +88/-0)
  • src/cli/run-main.ts (modified, +7/-1)
  • src/infra/startup-error.ts (added, +81/-0)
  • src/infra/unhandled-rejections.ts (modified, +20/-3)
RAW_BUFFERClick to expand / collapse

Summary

Silence after a system crash

Problem to solve

The Problem:

When the OpenClaw gateway fails to start or crashes at runtime, the failure is completely silent. There is:

  • No on-screen notification
  • No email or push alert
  • No health check that detects the outage
  • No automatic recovery attempt beyond the LaunchAgent's basic KeepAlive

The user's only indication that their assistant is down is... silence. No responses to messages. This can persist for hours or indefinitely, especially overnight or when the user isn't at the console.

Real-world impact:

In our deployment (4 OpenClaw instances on a home network), the primary instance has gone down overnight 3 times in the past 4 days due to various issues (script accidentally stopping the gateway, macOS auto-update reboot with FileVault blocking auto-login, TMPDIR path mismatch after migration). In each case:

  1. The gateway died silently
  2. No notification was sent through any channel
  3. The user discovered the outage only when they noticed the assistant wasn't responding — typically 5-8 hours later
  4. Recovery required console access and was diagnosed by a second OpenClaw instance running on a different machine

A less technical user (no Unix experience, single-instance deployment) would have had no way to diagnose or recover without contacting support.

Proposed solutions (any combination):

  1. Health watchdog LaunchAgent — A lightweight, separate process that pings the gateway every few minutes. If the gateway is unresponsive after N retries:
  • Attempt restart
  • If restart fails, send a macOS notification to the console
  • Optionally send an alert via a backup channel (email, webhook, or a pre-configured fallback)
  1. Startup failure diagnostics — When the gateway fails to start, write a human-readable error to a known location (e.g., ~/.openclaw/logs/startup-error.txt) explaining why it failed. Currently, failures are buried in log files with no surfacing.

  2. "Dead man's switch" for messaging channels — If the gateway has been down for more than X minutes and a channel (Signal, Telegram, etc.) has undelivered messages queued, send a notification through an alternate path.

  3. Multi-instance health monitoring — For users running multiple OpenClaw instances, allow instances to monitor each other's health and alert the user if a sibling goes down. (We've built this organically — our instance "Ember" has rescued the primary instance "Clawson" three times — but it should be a first-class feature.)

The core principle: For a system designed to be a persistent personal assistant, silent failure is the worst possible failure mode. The assistant should fight to stay alive, and when it can't, it should scream for help through every available channel.

User context: This feedback comes from a deployment running 4 instances as a family of AI assistants (Clawson, Flint, Spark, Ember) on a home network. The primary user has a Ph.D. in CS and 40+ years of Unix experience, and still found these failures difficult to diagnose. A typical consumer user would be completely stuck.

Proposed solution

Multi-instance health monitoring — For users running multiple OpenClaw instances, allow instances to monitor each other's health and alert the user if a sibling goes down. (We've built this organically — our instance "Ember" has rescued the primary instance "Clawson" three times — but it should be a first-class feature.)

Alternatives considered

No response

Impact

In our deployment (4 OpenClaw instances on a home network), the primary instance has gone down overnight 3 times in the past 4 days due to various issues (script accidentally stopping the gateway, macOS auto-update reboot with FileVault blocking auto-login, TMPDIR path mismatch after migration). In each case:

  1. The gateway died silently
  2. No notification was sent through any channel
  3. The user discovered the outage only when they noticed the assistant wasn't responding — typically 5-8 hours later
  4. Recovery required console access and was diagnosed by a second OpenClaw instance running on a different machine

A less technical user (no Unix experience, single-instance deployment) would have had no way to diagnose or recover without contacting support.

Evidence/examples

No response

Additional information

No response

extent analysis

Fix Plan

To address the silent failure issue, we will implement a Health Watchdog LaunchAgent. This will involve creating a separate process that pings the gateway every few minutes. If the gateway is unresponsive after N retries, the watchdog will attempt to restart it and send notifications if necessary.

Here are the concrete steps:

  • Create a new LaunchAgent configuration file (com.openclaw.healthwatchdog.plist) with the following content:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>com.openclaw.healthwatchdog</string>
    <key>ProgramArguments</key>
    <array>
      <string>/path/to/healthwatchdog.py</string>
    </array>
    <key>StartInterval</key>
    <integer>300</integer>
  </dict>
</plist>
  • Create a Python script (healthwatchdog.py) that will perform the health checks and send notifications:
import requests
import time
import subprocess
import os

# Configuration
GATEWAY_URL = 'http://localhost:8080'
RETRY_COUNT = 3
NOTIFICATION_COMMAND = '/path/to/notification/script'

def check_gateway():
    try:
        response = requests.get(GATEWAY_URL)
        if response.status_code == 200:
            return True
        else:
            return False
    except requests.exceptions.RequestException:
        return False

def restart_gateway():
    subprocess.run(['launchctl', 'restart', 'com.openclaw.gateway'])

def send_notification():
    subprocess.run([NOTIFICATION_COMMAND])

def main():
    retry_count = 0
    while retry_count < RETRY_COUNT:
        if not check_gateway():
            retry_count += 1
            time.sleep(60)
        else:
            break
    if retry_count == RETRY_COUNT:
        restart_gateway()
        send_notification()

if __name__ == '__main__':
    main()
  • Load the LaunchAgent configuration file using launchctl load /path/to/com.openclaw.healthwatchdog.plist

Verification

To verify that the fix worked, you can simulate a gateway failure by stopping the gateway process and checking if the watchdog restarts it and sends a notification.

Extra Tips

  • Make sure to replace the placeholders in the configuration files and scripts with the actual values for your setup.
  • You can customize the notification command to send alerts through different channels, such as email or webhook.
  • Consider implementing additional features, such as logging and error handling, to make the watchdog more robust.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Feature Request: Gateway failure recovery and notification mechanism [1 pull requests, 1 participants]