openclaw - ✅(Solved) Fix [Bug]: one-shot cron jobs silently lost after gateway restart [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#63657Fetched 2026-04-10 03:42:22
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×2commented ×1cross-referenced ×1referenced ×1

Root Cause

In src/cron/service/ops.ts (startup logic):

  1. On startup, any job with runningAtMs set gets cleared and added to startupInterruptedJobIds
  2. runMissedJobs is then called with skipJobIds: startupInterruptedJobIds
  3. For one-shot (at:) jobs this means: the job existed, started, gateway died, and on recovery it is skipped instead of retried
  4. Result: runningAtMs cleared, 0 entries in run history, delivery never happens

Fix Action

Workaround

Recurring jobs survive restarts correctly (they compute next run via missed-jobs logic). One-shots do not. Current workaround: a recurring reconciliation job that compares tasks-store.json cron_ids against the active cron list and recreates orphaned one-shots after each restart.

PR fix notes

PR #63675: fix(cron): replay interrupted one-shot jobs on startup recovery

Description (problem / solution / changelog)

Summary

One-shot cron jobs (schedule.kind === "at") that were mid-execution when the gateway restarted were silently lost: runningAtMs got cleared but the job was then skipped instead of retried, with 0 runs recorded and no notification delivered. Reminders that happened to fire during a model switch, container restart, or update disappeared without any user-visible signal.

Fixes #63657

Root cause

In src/cron/service/ops.ts start(), the startup recovery path used to collect any job with a stale runningAtMs marker into a local interruptedOneShotIds set and pass that set as skipJobIds to runMissedJobs, but the collection filter was gated on schedule.kind === "at":

```ts // Before (ops.ts:106-134) const interruptedOneShotIds = new Set<string>(); // ... if (typeof job.state.runningAtMs === "number") { // ... job.state.runningAtMs = undefined; // One-shot jobs are not retried after interruption; recurring jobs // (cron/every) are eligible for startup catch-up so they don't // require a second restart to recover (#60495). if (job.schedule.kind === "at") { interruptedOneShotIds.add(job.id); } } // ... await runMissedJobs(state, { skipJobIds: interruptedOneShotIds.size > 0 ? interruptedOneShotIds : undefined, }); ```

The existing comment acknowledged recurring jobs are eligible via #60583 and one-shots were deliberately excluded — but the excluded path leaves reminders irrecoverable. A reminder that fires exactly at docker restart, openclaw update, or a model-switch reload is lost forever with no log line beyond the generic "clearing stale running marker".

Fix

Drop the one-shot skip entirely and let runMissedJobs decide. The existing skipAtIfAlreadyRan guard in runMissedJobs -> isRunnableJob (src/cron/service/timer.ts:850-866) already distinguishes "interrupted before completion" from "already completed" by checking job.state.lastStatus:

  • runningAtMs and lastStatus are written in the same atomic block at timer.ts:408-411. Either both are cleared (successful completion) or neither is touched (crash mid-execution).
  • Interrupted one-shot: lastStatus === undefinedskipAtIfAlreadyRan check at timer.ts:850 falls through to the normal nowMs >= nextRunAtMs overdue check → retries ✅
  • Completed one-shot (defensive — e.g. crash between settle and flush leaves stale runningAtMs): lastStatus === \"ok\" → timer.ts:850-866 returns false → skipped ✅
  • Failed one-shot with retry eligibility: lastStatus === \"error\" + nextRun > lastRun → retries per timer.ts:856-864 ✅
  • Failed one-shot without retry: lastStatus === \"error\" + no retry window → skipped ✅

This preserves the invariant protected by PR #56509 (no double-delivery of already-completed one-shots) while fixing the symmetric data-loss bug.

Tests

Two new cases in src/cron/service/ops.test.ts, mirroring the existing #60495 recurring-interrupted startup test:

  1. replays interrupted one-shot jobs on startup recovery (#63657)

    • Setup: one-shot at: job with runningAtMs set, nextRunAtMs in the past, lastStatus unset
    • Action: start(state)
    • Asserts: stale marker cleared, enqueueSystemEvent called, lastStatus === \"ok\", lastRunAtMs === now
  2. does not re-run a completed one-shot with stale runningAtMs on startup (#63657) (the safety guard)

    • Setup: one-shot at: job with both runningAtMs AND lastStatus: \"ok\" (simulates a crash between settle and flush)
    • Action: start(state)
    • Asserts: stale marker cleared, enqueueSystemEvent NOT called, lastStatus and lastRunAtMs preserved

The second test is the critical regression guard: it proves the fix doesn't reintroduce the #56509-class of one-shot double-delivery bugs.

Precedent

  • #60583 (merged, joelnishanth): fix(cron): resume interrupted recurring jobs on first restart — the identical fix pattern for recurring jobs. Same catch-up path, same runMissedJobs entry point. Referenced in the existing ops.ts comment.
  • #56509 (open, claygeo): fix(cron): prevent one-shot at jobs from re-triggering after completion — orthogonal to this PR. #56509 fixes completed one-shots re-triggering from onTimer/runDueJobs paths that don't pass skipAtIfAlreadyRan. This PR works through runMissedJobs which DOES pass skipAtIfAlreadyRan: true (timer.ts:970), so both fixes coexist.

Scope

  • Files: src/cron/service/ops.ts (-10, +10), src/cron/service/ops.test.ts (+143)
  • LOC: <20 production changes, no new functions, no new abstractions
  • No changes to timer.ts, to isRunnableJob, or to the settle path — the safety comes from existing guards
  • oxlint clean

cc @steipete — cron service startup recovery, same area as #60583.

Credit to @myradon for the precise RCA in #63657 — the issue body named the exact file, the exact skip logic, and the exact symptom, which made this a ~20 LOC fix with a 143-line test suite.

Changed files

  • src/cron/service/ops.test.ts (modified, +143/-0)
  • src/cron/service/ops.ts (modified, +10/-10)
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Bug: one-shot cron jobs silently lost after gateway restart

Version: 2026.4.5
Platform: Docker (Linux)

What happens

When the gateway restarts while a one-shot cron job is mid-execution (runningAtMs is set), the job is permanently lost. No notification is delivered, no run is recorded, and the job is never retried.

Root cause

In src/cron/service/ops.ts (startup logic):

  1. On startup, any job with runningAtMs set gets cleared and added to startupInterruptedJobIds
  2. runMissedJobs is then called with skipJobIds: startupInterruptedJobIds
  3. For one-shot (at:) jobs this means: the job existed, started, gateway died, and on recovery it is skipped instead of retried
  4. Result: runningAtMs cleared, 0 entries in run history, delivery never happens

Observed symptoms

  • Scheduled reminder had runningAtMs set in cron/jobs.json
  • After gateway restart: runningAtMs cleared, state.runs = 0
  • No notification delivered to any channel
  • Job not rescheduled, not flagged, silently gone

Workaround

Recurring jobs survive restarts correctly (they compute next run via missed-jobs logic). One-shots do not. Current workaround: a recurring reconciliation job that compares tasks-store.json cron_ids against the active cron list and recreates orphaned one-shots after each restart.

Impact

Any one-shot reminder or scheduled task that fires exactly when the gateway is restarting (e.g. during a model switch, container restart, or update) is permanently lost without any user-visible signal.

Steps to reproduce

When the gateway restarts while a one-shot cron job is mid-execution (runningAtMs is set), the job is permanently lost. No notification is delivered, no run is recorded, and the job is never retried.

Root cause in src/cron/service/ops.ts (startup):

  1. Jobs with runningAtMs set get cleared and added to startupInterruptedJobIds
  2. runMissedJobs is called with skipJobIds: startupInterruptedJobIds
  3. One-shot (at:) jobs are skipped instead of retried
  4. Result: runningAtMs cleared, 0 runs in history, delivery never happens

Symptoms:

  • Scheduled reminder had runningAtMs set in cron/jobs.json
  • After restart: runningAtMs cleared, state.runs = 0
  • No notification delivered, job not rescheduled, silently gone

Expected behavior

An interrupted one-shot job should be retried on restart, not skipped. Options:

  • Re-execute it immediately (treat as overdue)
  • Flag it with a state.interruptedAt marker and surface it to the user
  • At minimum: do not silently discard it — log a warning or deliver a failure notification

Actual behavior

runningAtMs is cleared on restart, job is added to startupInterruptedJobIds, and runMissedJobs skips it entirely. No retry, no warning, no failure notification. The job disappears silently with 0 runs recorded.

OpenClaw version

2026.4.5

Operating system

Manjaro Linux

Install method

docker

Model

N/A

Provider / routing chain

N/A

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

Workaround: a recurring reconciliation job that compares cron_ids in tasks-store.json against the active cron list and recreates orphaned one-shots after each gateway restart.

Impact: any one-shot reminder that fires exactly when the gateway restarts (model switch, container restart, update) is permanently lost without any user-visible signal.

extent analysis

TL;DR

Modify the runMissedJobs function in src/cron/service/ops.ts to retry one-shot jobs that were interrupted during gateway restart instead of skipping them.

Guidance

  • Review the src/cron/service/ops.ts file and update the logic for handling one-shot jobs with runningAtMs set during startup.
  • Consider adding a state.interruptedAt marker to flag interrupted jobs and surface them to the user.
  • Implement a retry mechanism for interrupted one-shot jobs, either by re-executing them immediately or scheduling them for a later retry.
  • Ensure that the runMissedJobs function is modified to handle one-shot jobs correctly, without skipping them.

Example

// src/cron/service/ops.ts
// ...

const runMissedJobs = (skipJobIds: string[]) => {
  // ...
  // Check if the job is a one-shot job that was interrupted
  if (job.type === 'at' && job.runningAtMs) {
    // Retry the job instead of skipping it
    retryJob(job);
  } else {
    // ...
  }
};

const retryJob = (job: Job) => {
  // Implement retry logic here, e.g., re-execute the job or schedule it for later
};

Notes

The provided workaround using a recurring reconciliation job can help mitigate the issue, but a proper fix requires modifying the runMissedJobs function to handle one-shot jobs correctly.

Recommendation

Apply the workaround using a recurring reconciliation job until the runMissedJobs function can be modified to retry interrupted one-shot jobs. This will help prevent permanent loss of one-shot reminders and scheduled tasks during gateway restarts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

An interrupted one-shot job should be retried on restart, not skipped. Options:

  • Re-execute it immediately (treat as overdue)
  • Flag it with a state.interruptedAt marker and surface it to the user
  • At minimum: do not silently discard it — log a warning or deliver a failure notification

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING