openclaw - ✅(Solved) Fix Cron scheduler silently stops firing after ~2.5 days of gateway uptime [2 pull requests, 1 participants]

SkywingsWang · 2026-04-28T02:14:00Z

[openclaw] A cron job schedule 0 9 with tz: Asia/Shanghai silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job… A cron job (schedule `0 9 * * *` with `tz: Asia/Shanghai`) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual `openclaw cron run ` worked immediately, and a gateway restart restored automatic scheduling. # PR #7: fix(cron): re-arm timer when onTimer rejects unexpectedly - Repository: suboss87/openclaw - Author: suboss87 - State: open | merged: False - Link: https://github.com/suboss87/openclaw/pull/7 ## Description (problem / solution / changelog) Closes openclaw/openclaw#73166 ## Problem When `onTimer` rejects unexpectedly (e.g. a transient error thrown from inside the `finally` block's `armTimer` call due to Node.js internals or GC pressure), the `.catch()` handler in `armTimer`'s setTimeout callback only logs the error. No new timer is registered, permanently breaking the scheduler chain with no recovery path until the next gateway restart. ## Root cause `src/cron/service/timer.ts` — the `setTimeout` callback inside `armTimer`: ```ts state.timer = setTimeout(() => { void onTimer(state).catch((err) => { state.deps.log.error({ err: String(err) }, "cron: timer tick failed"); // Missing: armTimer(state) re-arm }); }, clampedDelay); ``` If `onTimer` rejects, the catch block logs but does not re-arm. `state.timer` is left as `null` (set to null at the top of `armTimer` before the throw). ## Fix Call `armTimer(state)` inside the `.catch()` handler so the scheduler chain survives an unexpected rejection. ## Regression test Added to `src/cron/service.armtimer-tight-loop.test.ts`: makes `nowMs()` throw on the 4th call (inside the `finally` block's `armTimer`), which causes `onTimer` to reject. Verifies that `log.error` is called and `state.timer` is non-null after the `.catch()` re-arm. All 4 tests in the file pass. --- _Generated by [Claude Code](https://claude.ai/code/session_01NHHoPHTrH4F9qFJBJHqjTk)_ --- ## Changed files - `src/cron/service.armtimer-tight-loop.test.ts` (modified, +112/-0) - `src/cron/service/timer.ts` (modified, +3/-0) --- # PR #73355: fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death - Repository: openclaw/openclaw - Author: SkywingsWang - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/73355 ## Description (problem / solution / changelog) ## Summary When the cron `setTimeout` chain breaks at runtime, the scheduler silently stops firing and never recovers until a gateway restart. This was observed after ~2.5 days of continuous gateway uptime on macOS (Apple Silicon) with no errors logged. **Root cause**: The `.catch()` handler in `armTimer()` and `armRunningRecheckTimer()` logs the error but does NOT call `armTimer(state)`, so if `onTimer()` ever rejects without reaching its `finally` block, the timer chain is permanently broken. **Contributing factor**: On macOS, background processes (like the gateway running as a launchd agent) are subject to App Nap and timer coalescing. While `MAX_TIMER_DELAY_MS = 60s` should be short enough for the clamped `setTimeout` to survive, the OS can still defer callbacks in edge cases, and if the chain breaks once, there is no recovery mechanism. ## Changes ### 1. Re-arm in `.catch()` handler (`timer.ts`) Both `armTimer()` and `armRunningRecheckTimer()` now call `armTimer(state)` inside `.catch()`, ensuring the timer chain is never permanently broken: ```ts void onTimer(state).catch((err) => { state.deps.log.error({ err: String(err) }, "cron: timer tick failed"); armTimer(state); // ensure chain is never broken }); ``` ### 2. Independent watchdog `setInterval` (`timer.ts`) Added `startCronWatchdog()` — a `setInterval`-based watchdog (every 5 minutes) that detects when `nextWakeAtMs` is past-due by more than `MAX_TIMER_DELAY_MS` and force-triggers `onTimer` + re-arms the chain. `setInterval` is more resilient to OS-level timer deferral than chained `setTimeout` because libuv treats it as a persistent/repeating timer. ### 3. Lifecycle integration (`ops.ts`, `state.ts`) - Watchdog is started in `start()` and stopped in `stop()` - Added `_stopWatchdog` field to `CronServiceState` ### 4. Tests (`timer.watchdog.test.ts`) - Watchdog detects stalled timer and logs warning - Watchdog does not false-positive on healthy timers - Watchdog respects `cronEnabled: false` - Cleanup function stops the watchdog ## Relationship to other PRs - **#68112** fixes a

openclaw2026-04-28 02:14:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#73166•Fetched 2026-04-29 06:22:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

SkywingsWang

Participants

SkywingsWang

Timeline (top)

cross-referenced ×2

Error Message

A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling. state.deps.log.error({ err: String(err) }, "cron: timer tick failed"); log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming"); log.error({ err: String(err) }, "cron: watchdog-triggered tick failed"); state.deps.log.error({ err: String(err) }, "cron: timer tick failed");

Root Cause

Fix Action

Workaround

openclaw gateway restart immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.

PR fix notes

PR #7: fix(cron): re-arm timer when onTimer rejects unexpectedly

Repository: suboss87/openclaw
Author: suboss87
State: open | merged: False
Link: https://github.com/suboss87/openclaw/pull/7

Description (problem / solution / changelog)

Closes openclaw/openclaw#73166

Problem

When onTimer rejects unexpectedly (e.g. a transient error thrown from inside the finally block's armTimer call due to Node.js internals or GC pressure), the .catch() handler in armTimer's setTimeout callback only logs the error. No new timer is registered, permanently breaking the scheduler chain with no recovery path until the next gateway restart.

Root cause

src/cron/service/timer.ts — the setTimeout callback inside armTimer:

state.timer = setTimeout(() => {
  void onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
    // Missing: armTimer(state) re-arm
  });
}, clampedDelay);

If onTimer rejects, the catch block logs but does not re-arm. state.timer is left as null (set to null at the top of armTimer before the throw).

Fix

Call armTimer(state) inside the .catch() handler so the scheduler chain survives an unexpected rejection.

Regression test

Added to src/cron/service.armtimer-tight-loop.test.ts: makes nowMs() throw on the 4th call (inside the finally block's armTimer), which causes onTimer to reject. Verifies that log.error is called and state.timer is non-null after the .catch() re-arm. All 4 tests in the file pass.

Generated by Claude Code

Changed files

src/cron/service.armtimer-tight-loop.test.ts (modified, +112/-0)
src/cron/service/timer.ts (modified, +3/-0)

PR #73355: fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death

Repository: openclaw/openclaw
Author: SkywingsWang
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/73355

Description (problem / solution / changelog)

Summary

When the cron setTimeout chain breaks at runtime, the scheduler silently stops firing and never recovers until a gateway restart. This was observed after ~2.5 days of continuous gateway uptime on macOS (Apple Silicon) with no errors logged.

Root cause: The .catch() handler in armTimer() and armRunningRecheckTimer() logs the error but does NOT call armTimer(state), so if onTimer() ever rejects without reaching its finally block, the timer chain is permanently broken.

Contributing factor: On macOS, background processes (like the gateway running as a launchd agent) are subject to App Nap and timer coalescing. While MAX_TIMER_DELAY_MS = 60s should be short enough for the clamped setTimeout to survive, the OS can still defer callbacks in edge cases, and if the chain breaks once, there is no recovery mechanism.

Changes

1. Re-arm in `.catch()` handler (`timer.ts`)

Both armTimer() and armRunningRecheckTimer() now call armTimer(state) inside .catch(), ensuring the timer chain is never permanently broken:

void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  armTimer(state); // ensure chain is never broken
});

2. Independent watchdog `setInterval` (`timer.ts`)

Added startCronWatchdog() — a setInterval-based watchdog (every 5 minutes) that detects when nextWakeAtMs is past-due by more than MAX_TIMER_DELAY_MS and force-triggers onTimer + re-arms the chain. setInterval is more resilient to OS-level timer deferral than chained setTimeout because libuv treats it as a persistent/repeating timer.

3. Lifecycle integration (`ops.ts`, `state.ts`)

Watchdog is started in start() and stopped in stop()
Added _stopWatchdog field to CronServiceState

4. Tests (`timer.watchdog.test.ts`)

Watchdog detects stalled timer and logs warning
Watchdog does not false-positive on healthy timers
Watchdog respects cronEnabled: false
Cleanup function stops the watchdog

Relationship to other PRs

#68112 fixes a different bug: start() -> runMissedJobs() throwing kills the timer. That is a startup-time issue.
This PR fixes a runtime issue: the timer chain dying during normal operation after hours/days of uptime.

The two fixes are complementary and non-overlapping.

Fixes #73166

Changed files

src/cron/service/ops.ts (modified, +7/-0)
src/cron/service/state.ts (modified, +6/-0)
src/cron/service/timer.ts (modified, +55/-0)
src/cron/service/timer.watchdog.test.ts (added, +188/-0)

Code Example

function armTimer(state) {
  // ...
  const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
  state.timer = setTimeout(() => {
    onTimer(state).catch(err => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
      // ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
    });
  }, clampedDelay);
}

async function onTimer(state) {
  if (state.running) { armRunningRecheckTimer(state); return; }
  state.running = true;
  armRunningRecheckTimer(state); // backup timer before try
  try {
    // ... execute due jobs ...
  } finally {
    state.running = false;
    armTimer(state); // re-arm
  }
}

---

// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes

setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

---

state.timer = setTimeout(() => {
  onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+   armTimer(state); // ensure chain is never broken
  });
}, clampedDelay);

RAW_BUFFERClick to expand / collapse

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

Summary

Environment

OpenClaw version: 2026.4.14 (323493f) — also verified identical timer code in 2026.4.26
OS: macOS 15 (Apple Silicon)
Node.js: bundled with OpenClaw
Gateway mode: launchd agent (ai.openclaw.gateway)
Cron jobs: 3 enabled (two daily, one weekly)

Timeline

Event	Timestamp	Gateway age
Gateway started	T+0h	—
Daily cron job A ✅	T+15h	15h 34m
Daily cron job A ✅	T+39h	1d 15h
Daily cron job B ✅	T+52h	2d 4h
Daily cron job A ❌ MISSED	T+63h	2d 15h
Manual `openclaw cron run` ✅	T+64h	2d 16h
Gateway restart → timer reset ✅	T+64h	— (fresh)

Observations

No run was attempted: The cron run log (cron/runs/<job-id>.jsonl) has no entry between the last successful run and the manual trigger ~36 hours later. The timer simply stopped invoking onTimer().
cron status showed stale nextWakeAtMs: openclaw cron status returned a nextWakeAtMs value 35 minutes in the past, confirming the scheduler knew the next wake time but failed to act on it.
Gateway process was alive and active: The process was running with RSS ~958MB. Other periodic plugin activity was firing normally every 30 minutes. Only the cron setTimeout chain appears broken.
No cron errors in logs: gateway.log and gateway.err.log contain no cron: timer tick failed or similar entries around the scheduled time. The timer callback was simply never invoked.
No macOS full sleep detected: pmset -g log shows no Sleep/Wake transitions around the missed window. Display sleep may have occurred but not system sleep.

Analysis of `src/cron/service/timer.ts`

The timer implementation uses MAX_TIMER_DELAY_MS = 60000 (60 seconds), so the cron scheduler ticks every ≤60 seconds rather than setting one long-duration timeout. This design should be resilient to timer drift.

The armTimer → setTimeout → onTimer → finally { armTimer } chain appears correct:

function armTimer(state) {
  // ...
  const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
  state.timer = setTimeout(() => {
    onTimer(state).catch(err => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
      // ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
    });
  }, clampedDelay);
}

async function onTimer(state) {
  if (state.running) { armRunningRecheckTimer(state); return; }
  state.running = true;
  armRunningRecheckTimer(state); // backup timer before try
  try {
    // ... execute due jobs ...
  } finally {
    state.running = false;
    armTimer(state); // re-arm
  }
}

Potential failure modes

macOS timer coalescing / App Nap: Even though the timer is 60s, macOS can aggressively defer setTimeout callbacks for background processes that appear idle. The gateway has no incoming network activity between cron ticks, making it a candidate for App Nap. The setInterval-based plugin refresh (30 min cycle) may be handled differently by libuv and not subject to the same coalescing.
Missing re-arm in .catch(): If onTimer() rejects in a way that bypasses the finally block (theoretically impossible for normal async, but Node.js internals have edge cases with unhandled abort signals, V8 GC pressure at ~1GB RSS, etc.), the .catch() handler logs but does NOT call armTimer(state), permanently breaking the chain.
Event loop stall: At 958MB RSS, a major GC pause could delay timer callbacks. If a GC pause coincides with the critical window, and the subsequent timer fires into a state where nextRunAtMs is now stale, the recompute logic might skip to tomorrow.

Suggested Fix

Option A: Add a watchdog `setInterval` (safest, backward-compatible)

Add a periodic watchdog that checks whether nextWakeAtMs is past-due, independent of the setTimeout chain:

// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes

setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

Option B: Re-arm in `.catch()` handler

state.timer = setTimeout(() => {
  onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+   armTimer(state); // ensure chain is never broken
  });
}, clampedDelay);

Option C: Use `setInterval` instead of chained `setTimeout`

Replace the setTimeout chain with a single setInterval(onTimer, 60_000) that unconditionally checks for due jobs every 60 seconds. This eliminates the chain-breaking risk entirely.

Reproduction

Difficult to reproduce on demand — appears to be a timing-dependent issue related to macOS background process scheduling. Running the gateway continuously for 2+ days with only cron activity (no interactive sessions) may increase the likelihood.

Workaround

openclaw gateway restart immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.

extent analysis

TL;DR

The cron scheduler issue can be fixed by introducing a watchdog mechanism to detect and re-arm missed timers.

Guidance

Implement a watchdog setInterval to periodically check if nextWakeAtMs is past-due and re-arm the timer if necessary.
Consider re-arming the timer in the .catch() handler to prevent the chain from breaking in case of errors.
Evaluate replacing the setTimeout chain with a single setInterval to eliminate the risk of chain-breaking.

Example

// Watchdog implementation
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes
setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

Notes

The issue appears to be related to macOS background process scheduling and timer coalescing. The watchdog mechanism provides a safe and backward-compatible solution to detect and re-arm missed timers.

Recommendation

Apply the watchdog workaround (Option A) as it is the safest and most backward-compatible solution. This approach ensures that the timer is re-armed even in case of errors, preventing the chain from breaking.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Cron scheduler silently stops firing after ~2.5 days of gateway uptime [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #7: fix(cron): re-arm timer when onTimer rejects unexpectedly

Description (problem / solution / changelog)

Problem

Root cause

Fix

Regression test

Changed files

PR #73355: fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death

Description (problem / solution / changelog)

Summary

Changes

1. Re-arm in .catch() handler (timer.ts)

2. Independent watchdog setInterval (timer.ts)

3. Lifecycle integration (ops.ts, state.ts)

4. Tests (timer.watchdog.test.ts)

Relationship to other PRs

Changed files

Code Example

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

Summary

Environment

Timeline

Observations

Analysis of src/cron/service/timer.ts

Potential failure modes

Suggested Fix

Option A: Add a watchdog setInterval (safest, backward-compatible)

Option B: Re-arm in .catch() handler

Option C: Use setInterval instead of chained setTimeout

Reproduction

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Re-arm in `.catch()` handler (`timer.ts`)

2. Independent watchdog `setInterval` (`timer.ts`)

3. Lifecycle integration (`ops.ts`, `state.ts`)

4. Tests (`timer.watchdog.test.ts`)

Analysis of `src/cron/service/timer.ts`

Option A: Add a watchdog `setInterval` (safest, backward-compatible)

Option B: Re-arm in `.catch()` handler

Option C: Use `setInterval` instead of chained `setTimeout`