openclaw - 💡(How to fix) Fix Bug: `saveCronStore` overwrites jobs.json from partial in-memory state after restart, causing silent job loss [1 participants]

openclaw2026-03-24 14:52:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#53746•Fetched 2026-04-08 01:24:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dustinprojectcoordinator-alt

Participants

dustinprojectcoordinator-alt

Timeline (top)

cross-referenced ×2

When the gateway restarts and a cron fires before full state is loaded, saveCronStore writes a partial in-memory job list over the full on-disk job list — silently wiping all jobs not present in memory at that moment. We lost 86 jobs twice in 24 hours from this.

Error Message

All other jobs are gone — no error, no warning

Root Cause

File: src/cron/store.ts (dist: config-runtime-BYNizC50.js)
Function: saveCronStore(storePath, store, opts)

Current write path:

in-memory state → write to .tmp → rename to jobs.json

The function writes whatever is currently in memory as the complete canonical job list. If an isolated cron session fires 1–2 seconds after a gateway restart, only its own jobs exist in its memory scope. When it writes, it replaces all 50+ other jobs with its 1–2 job state.

OpenClaw already uses atomic writes (tmp → rename) which prevents file corruption — but atomic writes of the wrong data still cause silent data loss.

Fix Action

Fix / Workaround

Current Workaround

Code Example

in-memory state → write to .tmp → rename to jobs.json

---

// Proposed change to saveCronStore
async function saveCronStore(storePath, store, opts) {
  // Read current disk state
  const diskJobs = await readJobsFromDisk(storePath) ?? [];
  
  // Merge: apply delta from in-memory store onto disk state
  const merged = mergeJobStates(diskJobs, store.jobs);
  
  // Backup + atomic write (existing behavior, preserved)
  await backupAndAtomicWrite(storePath, merged);
}

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

File: src/cron/store.ts (dist: config-runtime-BYNizC50.js)
Function: saveCronStore(storePath, store, opts)

Current write path:

in-memory state → write to .tmp → rename to jobs.json

OpenClaw already uses atomic writes (tmp → rename) which prevents file corruption — but atomic writes of the wrong data still cause silent data loss.

Reproduction

Create 50+ cron jobs
Restart the gateway
Within ~5 seconds of restart, a cron fires in an isolated session
That session has only 1–2 jobs in memory
saveCronStore writes those 1–2 jobs to disk
All other jobs are gone — no error, no warning

This is amplified by any crash loop or rapid restart cycle (e.g., watchdog, config changes).

Proposed Fix: Read-Merge-Write Pattern

Instead of writing in-memory state directly, saveCronStore should:

Read current jobs.json from disk
Merge — apply only the in-memory delta (add/modify/remove the specific job that changed)
Backup — copy current jobs.json → jobs.json.bak (already done, keep this)
Write — write merged result to .tmp, then rename to jobs.json

// Proposed change to saveCronStore
async function saveCronStore(storePath, store, opts) {
  // Read current disk state
  const diskJobs = await readJobsFromDisk(storePath) ?? [];
  
  // Merge: apply delta from in-memory store onto disk state
  const merged = mergeJobStates(diskJobs, store.jobs);
  
  // Backup + atomic write (existing behavior, preserved)
  await backupAndAtomicWrite(storePath, merged);
}

This ensures:

A session with 1 job in memory cannot wipe 51 jobs from disk
Add/modify/delete operations apply as deltas, not full replacements
Behavior is identical to current for the normal (non-restart-race) case

Current Workaround

External watchdog (cron-guardian.sh via launchd) running every 5 minutes:

Detects count regression (< 10 jobs)
Auto-restores from rotating timestamped backups
Sends Telegram alert
Preserves forensics file

This mitigates impact but does not prevent the race. Restoration window is ~5 minutes worst-case.

Environment

OpenClaw version: 2026.3.23-2
OS: macOS 15.x (Darwin arm64)
Gateway: launchd-managed, self-heal watchdog enabled
Cron jobs at time of loss: ~86 (first incident), ~52 (second incident)

Issue #53481 — cron.onChange webhook (filed separately — a registry hook would also help detect this faster)

extent analysis

Fix Plan

To address the issue of silent data loss when the gateway restarts and a cron fires before full state is loaded, we will implement a read-merge-write pattern in the saveCronStore function.

Here are the concrete steps:

Read the current jobs.json from disk using readJobsFromDisk.
Merge the in-memory job list with the disk state using mergeJobStates.
Backup the current jobs.json to jobs.json.bak.
Write the merged job list to a temporary file and then rename it to jobs.json using backupAndAtomicWrite.

Example code:

async function saveCronStore(storePath, store, opts) {
  // Read current disk state
  const diskJobs = await readJobsFromDisk(storePath) ?? [];
  
  // Merge: apply delta from in-memory store onto disk state
  const merged = mergeJobStates(diskJobs, store.jobs);
  
  // Backup + atomic write (existing behavior, preserved)
  await backupAndAtomicWrite(storePath, merged);
}

// Helper function to merge job states
function mergeJobStates(diskJobs, memoryJobs) {
  const mergedJobs = [...diskJobs];
  
  // Add or update jobs from memory
  memoryJobs.forEach((job) => {
    const existingJob = mergedJobs.find((j) => j.id === job.id);
    if (existingJob) {
      Object.assign(existingJob, job);
    } else {
      mergedJobs.push(job);
    }
  });
  
  // Remove deleted jobs
  return mergedJobs.filter((job) => !memoryJobs.find((j) => j.id === job.id && j.deleted));
}

Verification

To verify that the fix worked, you can:

Restart the gateway and trigger a cron job within 5 seconds of restart.
Check the jobs.json file to ensure that all jobs are still present.
Verify that the jobs.json.bak file is updated correctly.

Extra Tips

Make sure to test the mergeJobStates function thoroughly to ensure it handles all edge cases correctly.
Consider adding additional logging or monitoring to detect any issues with the new implementation.
Review the cron-guardian.sh script to ensure it is still necessary and effective with the new implementation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Bug: `saveCronStore` overwrites jobs.json from partial in-memory state after restart, causing silent job loss [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Code Example

Summary

Root Cause

Reproduction

Proposed Fix: Read-Merge-Write Pattern

Current Workaround

Environment

Related

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Bug: `saveCronStore` overwrites jobs.json from partial in-memory state after restart, causing silent job loss [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current Workaround

Code Example

Summary

Root Cause

Reproduction

Proposed Fix: Read-Merge-Write Pattern

Current Workaround

Environment

Related

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING