openclaw - ✅(Solved) Fix Feature Request: Gateway failure recovery and notification mechanism [1 pull requests, 1 participants]

dlturock · 2026-03-24T12:56:11Z

[openclaw] Silence after a system crash PR 53716: feat gateway : add watchdog + startup error diagnostics closes 53684 - Repository: openclaw/openclaw - Author… Silence after a system crash # PR #53716: feat(gateway): add watchdog + startup error diagnostics (closes #53684) - Repository: openclaw/openclaw - Author: rin259 - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/53716 ## Description (problem / solution / changelog) ## Summary Implements **#53684 - Gateway failure recovery and notification mechanism**. The PR adds three complementary layers to ensure the OpenClaw Gateway can recover from crashes and notify operators: ### 1. Startup error diagnostics () When an uncaught exception or fatal unhandled rejection occurs before or during gateway startup, the error is written to *before* the process exits. This makes post-mortem diagnosis reliable — the error file survives even when the terminal buffer is gone. ### 2. External watchdog () A lightweight bash watchdog designed to be run via (macOS) or (Linux). It polls the gateway's endpoint and: - Attempts a restart (up to 3 tries) if the gateway is down - Sends a **macOS Notification** () or **Linux Notification** () when things go wrong - Logs all events to Run it with: ```bash openclaw gateway watchdog --interval 3 ``` ### 3. CLI integration `openclaw gateway watchdog --interval 5` spawns the watchdog script as a foreground child process, forwarding stdout/stderr and handling SIGINT/SIGTERM cleanly. --- ## Files changed | File | Change | |------|--------| | `src/infra/startup-error.ts` | New — writes startup errors to `logs/startup-error.txt` | | `src/infra/unhandled-rejections.ts` | Wire `writeStartupError` into fatal unhandled rejections | | `src/cli/run-main.ts` | Write startup errors for pre-gateway uncaught exceptions | | `src/cli/gateway-cli/run.ts` | Add `gateway watchdog` subcommand | | `src/cli/gateway-cli/register.ts` | Register watchdog command | | `scripts/openclaw-watchdog.sh` | New — bash watchdog script | | `src/infra/startup-error.ts` | Fix lint: unused `format` import removed | ## Testing ```bash # Local build pnpm install && pnpm run build # Run the watchdog (foreground test) openclaw gateway watchdog --interval 1 --max-attempts 1 --no-notify # Check startup error file (after forcing a crash): # ~/.openclaw/logs/startup-error.txt ``` --- cc @openclaw/maintainers ## Changed files - `scripts/openclaw-watchdog.sh` (added, +163/-0) - `src/cli/gateway-cli/register.ts` (modified, +3/-1) - `src/cli/gateway-cli/run.ts` (modified, +88/-0) - `src/cli/run-main.ts` (modified, +7/-1) - `src/infra/startup-error.ts` (added, +81/-0) - `src/infra/unhandled-rejections.ts` (modified, +20/-3) ## Fixed - Fixed by PR: feat(gateway): add watchdog + startup error diagnostics (closes #53684) (https://github.com/openclaw/openclaw/pull/53716) ### Summary Silence after a system crash ### Problem to solve **The Problem:** When the OpenClaw gateway fails to start or crashes at runtime, the failure is completely silent. There is: - No on-screen notification - No email or push alert - No health check that detects the outage - No automatic recovery attempt beyond the LaunchAgent's basic KeepAlive The user's only indication that their assistant is down is... silence. No responses to messages. This can persist for hours or indefinitely, especially overnight or when the user isn't at the console. **Real-world impact:** In our deployment (4 OpenClaw instances on a home network), the primary instance has gone down overnight 3 times in the past 4 days due to various issues (script accidentally stopping the gateway, macOS auto-update reboot with FileVault blocking auto-login, TMPDIR path mismatch after migration). In each case: 1. The gateway died silently 2. No notification was sent through any channel 3. The user discovered the outage only when they noticed the assistant wasn't responding — typically 5-8 hours later 4. Recovery required console access and was diagnosed by a *second* OpenClaw instance running on a different machine A less technical user (no Unix experience, single-instance deployment) would have had no way to diagnose or recover without contacting support. **Proposed solutions (any combination):** 1. **Health watchdog LaunchAgent** — A lightweight, separate process that pings the gateway every few minutes. If the gateway is unresponsive after N retries: - Attempt restart - If restart fails, send a macOS notification to the console - Optionally send an alert via a backup channel (email, webhook, or a pre-configured fallback) 2. **Startup failure diagnostics** — When the gateway fails to start, write a human-readable error to a known location (e.g., `~/.openclaw/logs/startup-error.txt`) explaining *why* it failed. Currently, failures are buried in log files with no surfacing. 3. **"Dead man's switch" for messaging channels** — If the gateway has been down for more than X minutes and a channel (Signal, Telegram, etc.) has undelivere

openclaw2026-03-24 12:56:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#53684•Fetched 2026-04-08 01:24:53

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dlturock

Participants

dlturock

Timeline (top)

cross-referenced ×1labeled ×1referenced ×1

Silence after a system crash

Error Message

Startup failure diagnostics — When the gateway fails to start, write a human-readable error to a known location (e.g., ~/.openclaw/logs/startup-error.txt) explaining why it failed. Currently, failures are buried in log files with no surfacing.

Root Cause

Silence after a system crash

Fix Action

Fixed

Fixed by PR: feat(gateway): add watchdog + startup error diagnostics (closes #53684) (https://github.com/openclaw/openclaw/pull/53716)

PR fix notes

PR #53716: feat(gateway): add watchdog + startup error diagnostics (closes #53684)

Repository: openclaw/openclaw
Author: rin259
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/53716

Description (problem / solution / changelog)

Summary

Implements #53684 - Gateway failure recovery and notification mechanism.

The PR adds three complementary layers to ensure the OpenClaw Gateway can recover from crashes and notify operators:

1. Startup error diagnostics ()

When an uncaught exception or fatal unhandled rejection occurs before or during gateway startup, the error is written to before the process exits. This makes post-mortem diagnosis reliable — the error file survives even when the terminal buffer is gone.

2. External watchdog ()

A lightweight bash watchdog designed to be run via (macOS) or (Linux). It polls the gateway's endpoint and:

Attempts a restart (up to 3 tries) if the gateway is down
Sends a macOS Notification () or Linux Notification () when things go wrong
Logs all events to

Run it with:

openclaw gateway watchdog --interval 3

3. CLI integration

openclaw gateway watchdog --interval 5 spawns the watchdog script as a foreground child process, forwarding stdout/stderr and handling SIGINT/SIGTERM cleanly.

Files changed

File	Change
`src/infra/startup-error.ts`	New — writes startup errors to `logs/startup-error.txt`
`src/infra/unhandled-rejections.ts`	Wire `writeStartupError` into fatal unhandled rejections
`src/cli/run-main.ts`	Write startup errors for pre-gateway uncaught exceptions
`src/cli/gateway-cli/run.ts`	Add `gateway watchdog` subcommand
`src/cli/gateway-cli/register.ts`	Register watchdog command
`scripts/openclaw-watchdog.sh`	New — bash watchdog script
`src/infra/startup-error.ts`	Fix lint: unused `format` import removed

Testing

# Local build
pnpm install && pnpm run build

# Run the watchdog (foreground test)
openclaw gateway watchdog --interval 1 --max-attempts 1 --no-notify

# Check startup error file (after forcing a crash):
# ~/.openclaw/logs/startup-error.txt

cc @openclaw/maintainers

Changed files

scripts/openclaw-watchdog.sh (added, +163/-0)
src/cli/gateway-cli/register.ts (modified, +3/-1)
src/cli/gateway-cli/run.ts (modified, +88/-0)
src/cli/run-main.ts (modified, +7/-1)
src/infra/startup-error.ts (added, +81/-0)
src/infra/unhandled-rejections.ts (modified, +20/-3)

RAW_BUFFERClick to expand / collapse

Summary

Silence after a system crash

Problem to solve

The Problem:

When the OpenClaw gateway fails to start or crashes at runtime, the failure is completely silent. There is:

No on-screen notification
No email or push alert
No health check that detects the outage
No automatic recovery attempt beyond the LaunchAgent's basic KeepAlive

The user's only indication that their assistant is down is... silence. No responses to messages. This can persist for hours or indefinitely, especially overnight or when the user isn't at the console.

Real-world impact:

In our deployment (4 OpenClaw instances on a home network), the primary instance has gone down overnight 3 times in the past 4 days due to various issues (script accidentally stopping the gateway, macOS auto-update reboot with FileVault blocking auto-login, TMPDIR path mismatch after migration). In each case:

The gateway died silently
No notification was sent through any channel
The user discovered the outage only when they noticed the assistant wasn't responding — typically 5-8 hours later
Recovery required console access and was diagnosed by a second OpenClaw instance running on a different machine

A less technical user (no Unix experience, single-instance deployment) would have had no way to diagnose or recover without contacting support.

Proposed solutions (any combination):

Health watchdog LaunchAgent — A lightweight, separate process that pings the gateway every few minutes. If the gateway is unresponsive after N retries:

Attempt restart
If restart fails, send a macOS notification to the console
Optionally send an alert via a backup channel (email, webhook, or a pre-configured fallback)

Startup failure diagnostics — When the gateway fails to start, write a human-readable error to a known location (e.g., ~/.openclaw/logs/startup-error.txt) explaining why it failed. Currently, failures are buried in log files with no surfacing.
"Dead man's switch" for messaging channels — If the gateway has been down for more than X minutes and a channel (Signal, Telegram, etc.) has undelivered messages queued, send a notification through an alternate path.
Multi-instance health monitoring — For users running multiple OpenClaw instances, allow instances to monitor each other's health and alert the user if a sibling goes down. (We've built this organically — our instance "Ember" has rescued the primary instance "Clawson" three times — but it should be a first-class feature.)

The core principle: For a system designed to be a persistent personal assistant, silent failure is the worst possible failure mode. The assistant should fight to stay alive, and when it can't, it should scream for help through every available channel.

User context: This feedback comes from a deployment running 4 instances as a family of AI assistants (Clawson, Flint, Spark, Ember) on a home network. The primary user has a Ph.D. in CS and 40+ years of Unix experience, and still found these failures difficult to diagnose. A typical consumer user would be completely stuck.

Proposed solution

Multi-instance health monitoring — For users running multiple OpenClaw instances, allow instances to monitor each other's health and alert the user if a sibling goes down. (We've built this organically — our instance "Ember" has rescued the primary instance "Clawson" three times — but it should be a first-class feature.)

Alternatives considered

No response

Impact

The gateway died silently
No notification was sent through any channel
The user discovered the outage only when they noticed the assistant wasn't responding — typically 5-8 hours later
Recovery required console access and was diagnosed by a second OpenClaw instance running on a different machine

A less technical user (no Unix experience, single-instance deployment) would have had no way to diagnose or recover without contacting support.

Evidence/examples

No response

Additional information

No response

extent analysis

Fix Plan

To address the silent failure issue, we will implement a Health Watchdog LaunchAgent. This will involve creating a separate process that pings the gateway every few minutes. If the gateway is unresponsive after N retries, the watchdog will attempt to restart it and send notifications if necessary.

Here are the concrete steps:

Create a new LaunchAgent configuration file (com.openclaw.healthwatchdog.plist) with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>com.openclaw.healthwatchdog</string>
    <key>ProgramArguments</key>
    <array>
      <string>/path/to/healthwatchdog.py</string>
    </array>
    <key>StartInterval</key>
    <integer>300</integer>
  </dict>
</plist>

Create a Python script (healthwatchdog.py) that will perform the health checks and send notifications:

import requests
import time
import subprocess
import os

# Configuration
GATEWAY_URL = 'http://localhost:8080'
RETRY_COUNT = 3
NOTIFICATION_COMMAND = '/path/to/notification/script'

def check_gateway():
    try:
        response = requests.get(GATEWAY_URL)
        if response.status_code == 200:
            return True
        else:
            return False
    except requests.exceptions.RequestException:
        return False

def restart_gateway():
    subprocess.run(['launchctl', 'restart', 'com.openclaw.gateway'])

def send_notification():
    subprocess.run([NOTIFICATION_COMMAND])

def main():
    retry_count = 0
    while retry_count < RETRY_COUNT:
        if not check_gateway():
            retry_count += 1
            time.sleep(60)
        else:
            break
    if retry_count == RETRY_COUNT:
        restart_gateway()
        send_notification()

if __name__ == '__main__':
    main()

Load the LaunchAgent configuration file using launchctl load /path/to/com.openclaw.healthwatchdog.plist

Verification

To verify that the fix worked, you can simulate a gateway failure by stopping the gateway process and checking if the watchdog restarts it and sends a notification.

Extra Tips

Make sure to replace the placeholders in the configuration files and scripts with the actual values for your setup.
You can customize the notification command to send alerts through different channels, such as email or webhook.
Consider implementing additional features, such as logging and error handling, to make the watchdog more robust.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Feature Request: Gateway failure recovery and notification mechanism [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #53716: feat(gateway): add watchdog + startup error diagnostics (closes #53684)

Description (problem / solution / changelog)

Summary

1. Startup error diagnostics ()

2. External watchdog ()

3. CLI integration

Files changed

Testing

Changed files

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence/examples

Additional information

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING