openclaw - ✅(Solved) Fix Cron scheduler silently stops firing after ~2.5 days of gateway uptime [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73166Fetched 2026-04-29 06:22:43
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×2

A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling.

Error Message

A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling. state.deps.log.error({ err: String(err) }, "cron: timer tick failed"); log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming"); log.error({ err: String(err) }, "cron: watchdog-triggered tick failed"); state.deps.log.error({ err: String(err) }, "cron: timer tick failed");

Root Cause

A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling.

Fix Action

Workaround

openclaw gateway restart immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.

PR fix notes

PR #7: fix(cron): re-arm timer when onTimer rejects unexpectedly

Description (problem / solution / changelog)

Closes openclaw/openclaw#73166

Problem

When onTimer rejects unexpectedly (e.g. a transient error thrown from inside the finally block's armTimer call due to Node.js internals or GC pressure), the .catch() handler in armTimer's setTimeout callback only logs the error. No new timer is registered, permanently breaking the scheduler chain with no recovery path until the next gateway restart.

Root cause

src/cron/service/timer.ts — the setTimeout callback inside armTimer:

state.timer = setTimeout(() => {
  void onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
    // Missing: armTimer(state) re-arm
  });
}, clampedDelay);

If onTimer rejects, the catch block logs but does not re-arm. state.timer is left as null (set to null at the top of armTimer before the throw).

Fix

Call armTimer(state) inside the .catch() handler so the scheduler chain survives an unexpected rejection.

Regression test

Added to src/cron/service.armtimer-tight-loop.test.ts: makes nowMs() throw on the 4th call (inside the finally block's armTimer), which causes onTimer to reject. Verifies that log.error is called and state.timer is non-null after the .catch() re-arm. All 4 tests in the file pass.


Generated by Claude Code

<!-- devin-review-badge-begin -->
<a href="https://app.devin.ai/review/suboss87/openclaw/pull/7" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->

Changed files

  • src/cron/service.armtimer-tight-loop.test.ts (modified, +112/-0)
  • src/cron/service/timer.ts (modified, +3/-0)

PR #73355: fix(cron): add .catch() re-arm and watchdog to prevent runtime timer chain death

Description (problem / solution / changelog)

Summary

When the cron setTimeout chain breaks at runtime, the scheduler silently stops firing and never recovers until a gateway restart. This was observed after ~2.5 days of continuous gateway uptime on macOS (Apple Silicon) with no errors logged.

Root cause: The .catch() handler in armTimer() and armRunningRecheckTimer() logs the error but does NOT call armTimer(state), so if onTimer() ever rejects without reaching its finally block, the timer chain is permanently broken.

Contributing factor: On macOS, background processes (like the gateway running as a launchd agent) are subject to App Nap and timer coalescing. While MAX_TIMER_DELAY_MS = 60s should be short enough for the clamped setTimeout to survive, the OS can still defer callbacks in edge cases, and if the chain breaks once, there is no recovery mechanism.

Changes

1. Re-arm in .catch() handler (timer.ts)

Both armTimer() and armRunningRecheckTimer() now call armTimer(state) inside .catch(), ensuring the timer chain is never permanently broken:

void onTimer(state).catch((err) => {
  state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
  armTimer(state); // ensure chain is never broken
});

2. Independent watchdog setInterval (timer.ts)

Added startCronWatchdog() — a setInterval-based watchdog (every 5 minutes) that detects when nextWakeAtMs is past-due by more than MAX_TIMER_DELAY_MS and force-triggers onTimer + re-arms the chain. setInterval is more resilient to OS-level timer deferral than chained setTimeout because libuv treats it as a persistent/repeating timer.

3. Lifecycle integration (ops.ts, state.ts)

  • Watchdog is started in start() and stopped in stop()
  • Added _stopWatchdog field to CronServiceState

4. Tests (timer.watchdog.test.ts)

  • Watchdog detects stalled timer and logs warning
  • Watchdog does not false-positive on healthy timers
  • Watchdog respects cronEnabled: false
  • Cleanup function stops the watchdog

Relationship to other PRs

  • #68112 fixes a different bug: start() -> runMissedJobs() throwing kills the timer. That is a startup-time issue.
  • This PR fixes a runtime issue: the timer chain dying during normal operation after hours/days of uptime.

The two fixes are complementary and non-overlapping.

Fixes #73166

Changed files

  • src/cron/service/ops.ts (modified, +7/-0)
  • src/cron/service/state.ts (modified, +6/-0)
  • src/cron/service/timer.ts (modified, +55/-0)
  • src/cron/service/timer.watchdog.test.ts (added, +188/-0)

Code Example

function armTimer(state) {
  // ...
  const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
  state.timer = setTimeout(() => {
    onTimer(state).catch(err => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
      // ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
    });
  }, clampedDelay);
}

async function onTimer(state) {
  if (state.running) { armRunningRecheckTimer(state); return; }
  state.running = true;
  armRunningRecheckTimer(state); // backup timer before try
  try {
    // ... execute due jobs ...
  } finally {
    state.running = false;
    armTimer(state); // re-arm
  }
}

---

// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes

setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

---

state.timer = setTimeout(() => {
  onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+   armTimer(state); // ensure chain is never broken
  });
}, clampedDelay);
RAW_BUFFERClick to expand / collapse

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

Summary

A cron job (schedule 0 9 * * * with tz: Asia/Shanghai) silently failed to trigger after ~2.5 days of continuous gateway uptime. No error was logged; the job simply did not execute at its scheduled time. Manual openclaw cron run <id> worked immediately, and a gateway restart restored automatic scheduling.

Environment

  • OpenClaw version: 2026.4.14 (323493f) — also verified identical timer code in 2026.4.26
  • OS: macOS 15 (Apple Silicon)
  • Node.js: bundled with OpenClaw
  • Gateway mode: launchd agent (ai.openclaw.gateway)
  • Cron jobs: 3 enabled (two daily, one weekly)

Timeline

EventTimestampGateway age
Gateway startedT+0h
Daily cron job A ✅T+15h15h 34m
Daily cron job A ✅T+39h1d 15h
Daily cron job B ✅T+52h2d 4h
Daily cron job A ❌ MISSEDT+63h2d 15h
Manual openclaw cron runT+64h2d 16h
Gateway restart → timer reset ✅T+64h— (fresh)

Observations

  1. No run was attempted: The cron run log (cron/runs/<job-id>.jsonl) has no entry between the last successful run and the manual trigger ~36 hours later. The timer simply stopped invoking onTimer().

  2. cron status showed stale nextWakeAtMs: openclaw cron status returned a nextWakeAtMs value 35 minutes in the past, confirming the scheduler knew the next wake time but failed to act on it.

  3. Gateway process was alive and active: The process was running with RSS ~958MB. Other periodic plugin activity was firing normally every 30 minutes. Only the cron setTimeout chain appears broken.

  4. No cron errors in logs: gateway.log and gateway.err.log contain no cron: timer tick failed or similar entries around the scheduled time. The timer callback was simply never invoked.

  5. No macOS full sleep detected: pmset -g log shows no Sleep/Wake transitions around the missed window. Display sleep may have occurred but not system sleep.

Analysis of src/cron/service/timer.ts

The timer implementation uses MAX_TIMER_DELAY_MS = 60000 (60 seconds), so the cron scheduler ticks every ≤60 seconds rather than setting one long-duration timeout. This design should be resilient to timer drift.

The armTimer → setTimeout → onTimer → finally { armTimer } chain appears correct:

function armTimer(state) {
  // ...
  const clampedDelay = Math.min(delay, MAX_TIMER_DELAY_MS); // max 60s
  state.timer = setTimeout(() => {
    onTimer(state).catch(err => {
      state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
      // ⚠️ No re-arm here — if onTimer rejects without reaching finally, chain breaks
    });
  }, clampedDelay);
}

async function onTimer(state) {
  if (state.running) { armRunningRecheckTimer(state); return; }
  state.running = true;
  armRunningRecheckTimer(state); // backup timer before try
  try {
    // ... execute due jobs ...
  } finally {
    state.running = false;
    armTimer(state); // re-arm
  }
}

Potential failure modes

  1. macOS timer coalescing / App Nap: Even though the timer is 60s, macOS can aggressively defer setTimeout callbacks for background processes that appear idle. The gateway has no incoming network activity between cron ticks, making it a candidate for App Nap. The setInterval-based plugin refresh (30 min cycle) may be handled differently by libuv and not subject to the same coalescing.

  2. Missing re-arm in .catch(): If onTimer() rejects in a way that bypasses the finally block (theoretically impossible for normal async, but Node.js internals have edge cases with unhandled abort signals, V8 GC pressure at ~1GB RSS, etc.), the .catch() handler logs but does NOT call armTimer(state), permanently breaking the chain.

  3. Event loop stall: At 958MB RSS, a major GC pause could delay timer callbacks. If a GC pause coincides with the critical window, and the subsequent timer fires into a state where nextRunAtMs is now stale, the recompute logic might skip to tomorrow.

Suggested Fix

Option A: Add a watchdog setInterval (safest, backward-compatible)

Add a periodic watchdog that checks whether nextWakeAtMs is past-due, independent of the setTimeout chain:

// In cron service initialization:
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes

setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

Option B: Re-arm in .catch() handler

state.timer = setTimeout(() => {
  onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
+   armTimer(state); // ensure chain is never broken
  });
}, clampedDelay);

Option C: Use setInterval instead of chained setTimeout

Replace the setTimeout chain with a single setInterval(onTimer, 60_000) that unconditionally checks for due jobs every 60 seconds. This eliminates the chain-breaking risk entirely.

Reproduction

Difficult to reproduce on demand — appears to be a timing-dependent issue related to macOS background process scheduling. Running the gateway continuously for 2+ days with only cron activity (no interactive sessions) may increase the likelihood.

Workaround

openclaw gateway restart immediately restores cron scheduling. Users can add a launchd-based periodic gateway restart or a watchdog script as a workaround.

extent analysis

TL;DR

The cron scheduler issue can be fixed by introducing a watchdog mechanism to detect and re-arm missed timers.

Guidance

  • Implement a watchdog setInterval to periodically check if nextWakeAtMs is past-due and re-arm the timer if necessary.
  • Consider re-arming the timer in the .catch() handler to prevent the chain from breaking in case of errors.
  • Evaluate replacing the setTimeout chain with a single setInterval to eliminate the risk of chain-breaking.

Example

// Watchdog implementation
const WATCHDOG_INTERVAL_MS = 5 * 60_000; // 5 minutes
setInterval(() => {
  const nextAt = nextWakeAtMs(state);
  if (nextAt && Date.now() >= nextAt + 60_000) {
    log.warn({ nextAt, now: Date.now() }, "cron: watchdog detected missed timer, re-arming");
    onTimer(state).catch(err => {
      log.error({ err: String(err) }, "cron: watchdog-triggered tick failed");
      armTimer(state); // ensure re-arm even on failure
    });
  }
}, WATCHDOG_INTERVAL_MS);

Notes

The issue appears to be related to macOS background process scheduling and timer coalescing. The watchdog mechanism provides a safe and backward-compatible solution to detect and re-arm missed timers.

Recommendation

Apply the watchdog workaround (Option A) as it is the safest and most backward-compatible solution. This approach ensures that the timer is re-armed even in case of errors, preventing the chain from breaking.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING