An interrupted one-shot job should be **retried on restart**, not skipped. Options: - Re-execute it immediately (treat as overdue) - Flag it with a `state.interruptedAt` marker and surface it to the user - At minimum: do not silently discard it — log a warning or deliver a failure notification

openclaw - ✅(Solved) Fix [Bug]: one-shot cron jobs silently lost after gateway restart [1 pull requests, 1 comments, 2 participants]

myradon · 2026-04-09T09:20:32Z

[openclaw] PR 63675: fix cron : replay interrupted one-shot jobs on startup recovery - Repository: openclaw/openclaw - Author: hclsys - State: closed | merged:… # PR #63675: fix(cron): replay interrupted one-shot jobs on startup recovery - Repository: openclaw/openclaw - Author: hclsys - State: closed | merged: False - Link: https://github.com/openclaw/openclaw/pull/63675 ## Description (problem / solution / changelog) ## Summary One-shot cron jobs (`schedule.kind === "at"`) that were mid-execution when the gateway restarted were silently lost: `runningAtMs` got cleared but the job was then skipped instead of retried, with 0 runs recorded and no notification delivered. Reminders that happened to fire during a model switch, container restart, or update disappeared without any user-visible signal. Fixes #63657 ## Root cause In `src/cron/service/ops.ts` `start()`, the startup recovery path used to collect any job with a stale `runningAtMs` marker into a local `interruptedOneShotIds` set and pass that set as `skipJobIds` to `runMissedJobs`, but the collection filter was gated on `schedule.kind === "at"`: \`\`\`ts // Before (ops.ts:106-134) const interruptedOneShotIds = new Set (); // ... if (typeof job.state.runningAtMs === "number") { // ... job.state.runningAtMs = undefined; // One-shot jobs are not retried after interruption; recurring jobs // (cron/every) are eligible for startup catch-up so they don't // require a second restart to recover (#60495). if (job.schedule.kind === "at") { interruptedOneShotIds.add(job.id); } } // ... await runMissedJobs(state, { skipJobIds: interruptedOneShotIds.size > 0 ? interruptedOneShotIds : undefined, }); \`\`\` The existing comment acknowledged recurring jobs are eligible via #60583 and one-shots were deliberately excluded — but the excluded path leaves reminders irrecoverable. A reminder that fires exactly at `docker restart`, `openclaw update`, or a model-switch reload is lost forever with no log line beyond the generic \"clearing stale running marker\". ## Fix Drop the one-shot skip entirely and let `runMissedJobs` decide. The existing `skipAtIfAlreadyRan` guard in `runMissedJobs -> isRunnableJob` (`src/cron/service/timer.ts:850-866`) already distinguishes \"interrupted before completion\" from \"already completed\" by checking `job.state.lastStatus`: - `runningAtMs` and `lastStatus` are written **in the same atomic block** at `timer.ts:408-411`. Either both are cleared (successful completion) or neither is touched (crash mid-execution). - **Interrupted one-shot**: `lastStatus === undefined` → `skipAtIfAlreadyRan` check at timer.ts:850 falls through to the normal `nowMs >= nextRunAtMs` overdue check → retries ✅ - **Completed one-shot** (defensive — e.g. crash between settle and flush leaves stale `runningAtMs`): `lastStatus === \"ok\"` → timer.ts:850-866 returns false → skipped ✅ - **Failed one-shot with retry eligibility**: `lastStatus === \"error\"` + `nextRun > lastRun` → retries per timer.ts:856-864 ✅ - **Failed one-shot without retry**: `lastStatus === \"error\"` + no retry window → skipped ✅ This preserves the invariant protected by PR #56509 (no double-delivery of already-completed one-shots) while fixing the symmetric data-loss bug. ## Tests Two new cases in `src/cron/service/ops.test.ts`, mirroring the existing `#60495` recurring-interrupted startup test: 1. **`replays interrupted one-shot jobs on startup recovery (#63657)`** - Setup: one-shot `at:` job with `runningAtMs` set, `nextRunAtMs` in the past, `lastStatus` unset - Action: `start(state)` - Asserts: stale marker cleared, `enqueueSystemEvent` called, `lastStatus === \"ok\"`, `lastRunAtMs === now` 2. **`does not re-run a completed one-shot with stale runningAtMs on startup (#63657)`** (the safety guard) - Setup: one-shot `at:` job with both `runningAtMs` AND `lastStatus: \"ok\"` (simulates a crash between settle and flush) - Action: `start(state)` - Asserts: stale marker cleared, `enqueueSystemEvent` NOT called, `lastStatus` and `lastRunAtMs` preserved The second test is the critical regression guard: it proves the fix doesn't reintroduce the #56509-class of one-shot double-delivery bugs. ## Precedent - **#60583** (merged, joelnishanth): `fix(cron): resume interrupted recurring jobs on first restart` — the identical fix pattern for recurring jobs. Same catch-up path, same `runMissedJobs` entry point. Referenced in the existing ops.ts comment. - **#56509** (open, claygeo): `fix(cron): prevent one-shot at jobs from re-triggering after completion` — orthogonal to this PR. #56509 fixes completed one-shots re-triggering from `onTimer`/`runDueJobs` paths that don't pass `skipAtIfAlreadyRan`. This PR works through `runMissedJobs` which DOES pass `skipAtIfAlreadyRan: true` (timer.ts:970), so both fixes coexist. ## Scope - **Files**: `src/cron/service/ops.ts` (-10, +10), `src/cron/service/ops.test.ts` (+143) - **LOC**: <20 production changes, no new functions, no new abstractions - **No changes** to `timer

openclaw2026-04-09 09:20:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#63657•Fetched 2026-04-10 03:42:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

myradon

Participants

Artyomkun

myradon

Timeline (top)

labeled ×2commented ×1cross-referenced ×1referenced ×1

Root Cause

In src/cron/service/ops.ts (startup logic):

On startup, any job with runningAtMs set gets cleared and added to startupInterruptedJobIds
runMissedJobs is then called with skipJobIds: startupInterruptedJobIds
For one-shot (at:) jobs this means: the job existed, started, gateway died, and on recovery it is skipped instead of retried
Result: runningAtMs cleared, 0 entries in run history, delivery never happens

Fix Action

Workaround

Recurring jobs survive restarts correctly (they compute next run via missed-jobs logic). One-shots do not. Current workaround: a recurring reconciliation job that compares tasks-store.json cron_ids against the active cron list and recreates orphaned one-shots after each restart.

PR fix notes

PR #63675: fix(cron): replay interrupted one-shot jobs on startup recovery

Repository: openclaw/openclaw
Author: hclsys
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/63675

Description (problem / solution / changelog)

Summary

One-shot cron jobs (schedule.kind === "at") that were mid-execution when the gateway restarted were silently lost: runningAtMs got cleared but the job was then skipped instead of retried, with 0 runs recorded and no notification delivered. Reminders that happened to fire during a model switch, container restart, or update disappeared without any user-visible signal.

Fixes #63657

Root cause

In src/cron/service/ops.ts start(), the startup recovery path used to collect any job with a stale runningAtMs marker into a local interruptedOneShotIds set and pass that set as skipJobIds to runMissedJobs, but the collection filter was gated on schedule.kind === "at":

```ts // Before (ops.ts:106-134) const interruptedOneShotIds = new Set<string>(); // ... if (typeof job.state.runningAtMs === "number") { // ... job.state.runningAtMs = undefined; // One-shot jobs are not retried after interruption; recurring jobs // (cron/every) are eligible for startup catch-up so they don't // require a second restart to recover (#60495). if (job.schedule.kind === "at") { interruptedOneShotIds.add(job.id); } } // ... await runMissedJobs(state, { skipJobIds: interruptedOneShotIds.size > 0 ? interruptedOneShotIds : undefined, }); ```

The existing comment acknowledged recurring jobs are eligible via #60583 and one-shots were deliberately excluded — but the excluded path leaves reminders irrecoverable. A reminder that fires exactly at docker restart, openclaw update, or a model-switch reload is lost forever with no log line beyond the generic "clearing stale running marker".

Fix

Drop the one-shot skip entirely and let runMissedJobs decide. The existing skipAtIfAlreadyRan guard in runMissedJobs -> isRunnableJob (src/cron/service/timer.ts:850-866) already distinguishes "interrupted before completion" from "already completed" by checking job.state.lastStatus:

runningAtMs and lastStatus are written in the same atomic block at timer.ts:408-411. Either both are cleared (successful completion) or neither is touched (crash mid-execution).
Interrupted one-shot: lastStatus === undefined → skipAtIfAlreadyRan check at timer.ts:850 falls through to the normal nowMs >= nextRunAtMs overdue check → retries ✅
Completed one-shot (defensive — e.g. crash between settle and flush leaves stale runningAtMs): lastStatus === \"ok\" → timer.ts:850-866 returns false → skipped ✅
Failed one-shot with retry eligibility: lastStatus === \"error\" + nextRun > lastRun → retries per timer.ts:856-864 ✅
Failed one-shot without retry: lastStatus === \"error\" + no retry window → skipped ✅

This preserves the invariant protected by PR #56509 (no double-delivery of already-completed one-shots) while fixing the symmetric data-loss bug.

Tests

Two new cases in src/cron/service/ops.test.ts, mirroring the existing #60495 recurring-interrupted startup test:

replays interrupted one-shot jobs on startup recovery (#63657)
- Setup: one-shot at: job with runningAtMs set, nextRunAtMs in the past, lastStatus unset
- Action: start(state)
- Asserts: stale marker cleared, enqueueSystemEvent called, lastStatus === \"ok\", lastRunAtMs === now
does not re-run a completed one-shot with stale runningAtMs on startup (#63657) (the safety guard)
- Setup: one-shot at: job with both runningAtMs AND lastStatus: \"ok\" (simulates a crash between settle and flush)
- Action: start(state)
- Asserts: stale marker cleared, enqueueSystemEvent NOT called, lastStatus and lastRunAtMs preserved

The second test is the critical regression guard: it proves the fix doesn't reintroduce the #56509-class of one-shot double-delivery bugs.

Precedent

#60583 (merged, joelnishanth): fix(cron): resume interrupted recurring jobs on first restart — the identical fix pattern for recurring jobs. Same catch-up path, same runMissedJobs entry point. Referenced in the existing ops.ts comment.
#56509 (open, claygeo): fix(cron): prevent one-shot at jobs from re-triggering after completion — orthogonal to this PR. #56509 fixes completed one-shots re-triggering from onTimer/runDueJobs paths that don't pass skipAtIfAlreadyRan. This PR works through runMissedJobs which DOES pass skipAtIfAlreadyRan: true (timer.ts:970), so both fixes coexist.

Scope

Files: src/cron/service/ops.ts (-10, +10), src/cron/service/ops.test.ts (+143)
LOC: <20 production changes, no new functions, no new abstractions
No changes to timer.ts, to isRunnableJob, or to the settle path — the safety comes from existing guards
oxlint clean

cc @steipete — cron service startup recovery, same area as #60583.

Credit to @myradon for the precise RCA in #63657 — the issue body named the exact file, the exact skip logic, and the exact symptom, which made this a ~20 LOC fix with a 143-line test suite.

Changed files

src/cron/service/ops.test.ts (modified, +143/-0)
src/cron/service/ops.ts (modified, +10/-10)

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Bug: one-shot cron jobs silently lost after gateway restart

Version: 2026.4.5
Platform: Docker (Linux)

What happens

When the gateway restarts while a one-shot cron job is mid-execution (runningAtMs is set), the job is permanently lost. No notification is delivered, no run is recorded, and the job is never retried.

Root cause

In src/cron/service/ops.ts (startup logic):

On startup, any job with runningAtMs set gets cleared and added to startupInterruptedJobIds
runMissedJobs is then called with skipJobIds: startupInterruptedJobIds
For one-shot (at:) jobs this means: the job existed, started, gateway died, and on recovery it is skipped instead of retried
Result: runningAtMs cleared, 0 entries in run history, delivery never happens

Observed symptoms

Scheduled reminder had runningAtMs set in cron/jobs.json
After gateway restart: runningAtMs cleared, state.runs = 0
No notification delivered to any channel
Job not rescheduled, not flagged, silently gone

Workaround

Impact

Any one-shot reminder or scheduled task that fires exactly when the gateway is restarting (e.g. during a model switch, container restart, or update) is permanently lost without any user-visible signal.

Steps to reproduce

When the gateway restarts while a one-shot cron job is mid-execution (runningAtMs is set), the job is permanently lost. No notification is delivered, no run is recorded, and the job is never retried.

Root cause in src/cron/service/ops.ts (startup):

Jobs with runningAtMs set get cleared and added to startupInterruptedJobIds
runMissedJobs is called with skipJobIds: startupInterruptedJobIds
One-shot (at:) jobs are skipped instead of retried
Result: runningAtMs cleared, 0 runs in history, delivery never happens

Symptoms:

Scheduled reminder had runningAtMs set in cron/jobs.json
After restart: runningAtMs cleared, state.runs = 0
No notification delivered, job not rescheduled, silently gone

Expected behavior

An interrupted one-shot job should be retried on restart, not skipped. Options:

Re-execute it immediately (treat as overdue)
Flag it with a state.interruptedAt marker and surface it to the user
At minimum: do not silently discard it — log a warning or deliver a failure notification

Actual behavior

runningAtMs is cleared on restart, job is added to startupInterruptedJobIds, and runMissedJobs skips it entirely. No retry, no warning, no failure notification. The job disappears silently with 0 runs recorded.

OpenClaw version

2026.4.5

Operating system

Manjaro Linux

Install method

docker

Model

N/A

Provider / routing chain

N/A

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

Workaround: a recurring reconciliation job that compares cron_ids in tasks-store.json against the active cron list and recreates orphaned one-shots after each gateway restart.

Impact: any one-shot reminder that fires exactly when the gateway restarts (model switch, container restart, update) is permanently lost without any user-visible signal.

extent analysis

TL;DR

Modify the runMissedJobs function in src/cron/service/ops.ts to retry one-shot jobs that were interrupted during gateway restart instead of skipping them.

Guidance

Review the src/cron/service/ops.ts file and update the logic for handling one-shot jobs with runningAtMs set during startup.
Consider adding a state.interruptedAt marker to flag interrupted jobs and surface them to the user.
Implement a retry mechanism for interrupted one-shot jobs, either by re-executing them immediately or scheduling them for a later retry.
Ensure that the runMissedJobs function is modified to handle one-shot jobs correctly, without skipping them.

Example

// src/cron/service/ops.ts
// ...

const runMissedJobs = (skipJobIds: string[]) => {
  // ...
  // Check if the job is a one-shot job that was interrupted
  if (job.type === 'at' && job.runningAtMs) {
    // Retry the job instead of skipping it
    retryJob(job);
  } else {
    // ...
  }
};

const retryJob = (job: Job) => {
  // Implement retry logic here, e.g., re-execute the job or schedule it for later
};

Notes

The provided workaround using a recurring reconciliation job can help mitigate the issue, but a proper fix requires modifying the runMissedJobs function to handle one-shot jobs correctly.

Recommendation

Apply the workaround using a recurring reconciliation job until the runMissedJobs function can be modified to retry interrupted one-shot jobs. This will help prevent permanent loss of one-shot reminders and scheduled tasks during gateway restarts.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

An interrupted one-shot job should be retried on restart, not skipped. Options:

Re-execute it immediately (treat as overdue)
Flag it with a state.interruptedAt marker and surface it to the user
At minimum: do not silently discard it — log a warning or deliver a failure notification

#agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: one-shot cron jobs silently lost after gateway restart [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #63675: fix(cron): replay interrupted one-shot jobs on startup recovery

Description (problem / solution / changelog)

Summary

Root cause

Fix

Tests

Precedent

Scope

Changed files

Bug type

Beta release blocker

Summary

Bug: one-shot cron jobs silently lost after gateway restart

What happens

Root cause

Observed symptoms

Workaround

Impact

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING