openclaw - ✅(Solved) Fix Gateway crash loop: no backoff or circuit-breaker on auto-restart [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60142Fetched 2026-04-08 02:35:49
View on GitHub
Comments
1
Participants
2
Timeline
9
Reactions
0
Author
Timeline (top)
referenced ×6cross-referenced ×2commented ×1

When a config schema change causes startup validation to fail, the gateway exits with code 1 and systemd restarts it indefinitely — no backoff, no circuit breaker, no user notification.

Error Message

  1. Write a clear message to logs: "Gateway failed to start 3 times in 60s — possible config error. Check ~/.openclaw/openclaw.json. Run 'openclaw doctor' to diagnose." Option A (application-level): On startup, check a restart-sentinel file (already exists: src/infra/restart-sentinel.ts). If 3+ restarts within 60s, write error and exit with a distinct code (e.g. 78 = EX_CONFIG) that systemd's RestartPreventExitStatus can catch.

Root Cause

When a config schema change causes startup validation to fail, the gateway exits with code 1 and systemd restarts it indefinitely — no backoff, no circuit breaker, no user notification.

Fix Action

Fixed

PR fix notes

PR #60170: fix: add crash loop circuit breaker to prevent infinite restart loops

Description (problem / solution / changelog)

Summary

Adds two-layer protection against gateway crash loops caused by config errors.

Layer 1 (app-level): checkCrashLoopAndAbort() — if 3+ starts in 60s, exit with code 78 (EX_CONFIG) and print diagnostic.

Layer 2 (systemd): StartLimitBurst=5, StartLimitIntervalSec=60, RestartPreventExitStatus=78 in generated unit.

Real incident

After upgrading 2026.3.28 → 2026.4.1, a config schema change caused 6,198 restarts over 12.5 hours.

Apr 02 20:20:46 node[2645786]: Config invalid
Apr 02 20:20:46 node[2645786]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Main process exited, status=1/FAILURE
Apr 02 20:20:52 systemd[332]: Scheduled restart job, restart counter is at 1.
[... 6,198 times over 12.5 hours ...]
Apr 03 08:49:38 systemd[332]: Scheduled restart job, restart counter is at 6198.

Closes #60142

Changed files

  • src/cli/gateway-cli/run.option-collisions.test.ts (modified, +13/-0)
  • src/cli/gateway-cli/run.ts (modified, +14/-0)
  • src/daemon/systemd-unit.test.ts (modified, +20/-0)
  • src/daemon/systemd-unit.ts (modified, +3/-0)
  • src/infra/crash-loop-sentinel.test.ts (added, +84/-0)
  • src/infra/crash-loop-sentinel.ts (added, +90/-0)

Code Example

Apr 02 20:20:44 node[1768617]: [gateway] signal SIGTERM received
Apr 02 20:20:44 node[1768617]: [gateway] received SIGTERM; shutting down
Apr 02 20:20:46 node[2645786]: Config invalid
Apr 02 20:20:46 node[2645786]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Failed with result 'exit-code'.
Apr 02 20:20:52 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 1.
Apr 02 20:20:53 node[2645810]: Config invalid
Apr 02 20:20:53 node[2645810]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:54 systemd[332]: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 20:20:54 systemd[332]: openclaw-gateway.service: Failed with result 'exit-code'.
Apr 02 20:20:59 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 2.
[... same pattern every ~5-7 seconds ...]
Apr 03 08:49:30 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 6197.
Apr 03 08:49:38 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 6198.
RAW_BUFFERClick to expand / collapse

Description

When a config schema change causes startup validation to fail, the gateway exits with code 1 and systemd restarts it indefinitely — no backoff, no circuit breaker, no user notification.

Real-World Incident

After upgrading from 2026.3.28 → 2026.4.1, the tools.web.search config schema changed. The old key path was now invalid, causing a hard config validation failure on every startup. systemd restarted the gateway 6,198 times over 12.5 hours (20:20 Apr 2 → 08:55 Apr 3) before the user manually intervened.

Actual Log (journalctl)

Apr 02 20:20:44 node[1768617]: [gateway] signal SIGTERM received
Apr 02 20:20:44 node[1768617]: [gateway] received SIGTERM; shutting down
Apr 02 20:20:46 node[2645786]: Config invalid
Apr 02 20:20:46 node[2645786]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Failed with result 'exit-code'.
Apr 02 20:20:52 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 1.
Apr 02 20:20:53 node[2645810]: Config invalid
Apr 02 20:20:53 node[2645810]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:54 systemd[332]: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 20:20:54 systemd[332]: openclaw-gateway.service: Failed with result 'exit-code'.
Apr 02 20:20:59 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 2.
[... same pattern every ~5-7 seconds ...]
Apr 03 08:49:30 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 6197.
Apr 03 08:49:38 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 6198.

Duration: ~12.5 hours. Restarts: 6,198. Interval: ~7 seconds each.

Steps to Reproduce

  1. Upgrade openclaw from 2026.3.28 → 2026.4.1
  2. Have tools.web.search.brave key in openclaw.json (old schema path)
  3. Restart gateway — immediately enters crash loop

Expected Behavior

After 3 consecutive startup failures within 60 seconds:

  1. Stop auto-restarting
  2. Write a clear message to logs: "Gateway failed to start 3 times in 60s — possible config error. Check ~/.openclaw/openclaw.json. Run 'openclaw doctor' to diagnose."
  3. Exit cleanly (let the user fix and manually restart)

Suggested Fix

Option A (application-level): On startup, check a restart-sentinel file (already exists: src/infra/restart-sentinel.ts). If 3+ restarts within 60s, write error and exit with a distinct code (e.g. 78 = EX_CONFIG) that systemd's RestartPreventExitStatus can catch.

Option B (systemd unit): Document recommended StartLimitBurst=3 + StartLimitIntervalSec=60 in the generated systemd unit file (src/daemon/node-service.ts).

Both options should be implemented — A for user-visible feedback, B as a safety net.

Environment

  • OpenClaw: 2026.4.1
  • OS: Linux (WSL2, systemd user session)
  • Supervisor: systemd

extent analysis

TL;DR

Implement a restart sentinel mechanism and configure systemd to prevent excessive restarts to address the gateway crash loop issue.

Guidance

  • Implement Option A (application-level) by modifying the src/infra/restart-sentinel.ts file to check for 3+ restarts within 60s and exit with a distinct code (e.g., 78 = EX_CONFIG) that systemd's RestartPreventExitStatus can catch.
  • Configure systemd to prevent excessive restarts by adding StartLimitBurst=3 and StartLimitIntervalSec=60 to the generated systemd unit file (src/daemon/node-service.ts).
  • Verify that the gateway exits cleanly after 3 consecutive startup failures within 60 seconds and writes a clear error message to the logs.
  • Test the implementation by intentionally introducing a config error and checking the gateway's behavior.

Example

// src/infra/restart-sentinel.ts
import fs from 'fs';

const restartSentinelFile = 'restart-sentinel.txt';
const maxRestarts = 3;
const restartInterval = 60 * 1000; // 60 seconds

let restartCount = 0;
let lastRestartTime = 0;

export function checkRestartSentinel() {
  if (fs.existsSync(restartSentinelFile)) {
    const stats = fs.statSync(restartSentinelFile);
    lastRestartTime = stats.mtimeMs;
    restartCount++;
  } else {
    fs.writeFileSync(restartSentinelFile, '');
    lastRestartTime = Date.now();
    restartCount = 1;
  }

  if (restartCount >= maxRestarts && Date.now() - lastRestartTime <= restartInterval) {
    console.error('Gateway failed to start 3 times in 60s — possible config error. Check ~/.openclaw/openclaw.json. Run \'openclaw doctor\' to diagnose.');
    process.exit(78); // EX_CONFIG
  }
}

Notes

The provided solution assumes that the restart-sentinel.ts file is already implemented and only needs to be modified to include the restart count and interval checks. Additionally, the systemd configuration changes require modifying the node-service.ts file to include the StartLimitBurst and StartLimitIntervalSec options.

Recommendation

Apply both Option A (application-level) and Option B (systemd unit) to ensure that the gateway exits cleanly after 3 consecutive startup failures within 60 seconds and provides user-visible feedback. This approach provides a safety net to prevent excessive restarts and helps diagnose config errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway crash loop: no backoff or circuit-breaker on auto-restart [1 pull requests, 1 comments, 2 participants]