openclaw - ✅(Solved) Fix Cron runningAtMs zombie state regression — cleared state returns after gateway restart [1 pull requests, 1 comments, 2 participants]

openclaw2026-04-01 14:05:58

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#59056•Fetched 2026-04-08 02:29:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ponchoooPenguin

Participants

ponchoooPenguin

sharkqwy

Timeline (top)

commented ×1cross-referenced ×1

Error Message

9 cron jobs simultaneously stuck with runningAtMs set to the exact same timestamp (1775045523045 = 2026-04-01 08:12:03 ET), even though their last runs completed successfully (status: ok or error with proper lastDurationMs values).

Root Cause

Downstream cron jobs (e.g. revenue-loop) stop firing because the scheduler appears to be at concurrency limit or skips scheduling when too many jobs are in running state.

Fix Action

Fix / Workaround

Workaround Applied

PR fix notes

PR #68112: fix(cron): prevent scheduler death when startup catch-up fails

Repository: openclaw/openclaw
Author: alexanderxfgl-bit
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/68112

Description (problem / solution / changelog)

Summary

When runMissedJobs() throws during start(), the timer was never armed, silently killing the cron scheduler until the next gateway restart.

Root Cause

The start() function in src/cron/service/ops.ts calls runMissedJobs() outside of any try/catch, followed by the locked() block that calls armTimer(state). If runMissedJobs() throws (e.g. an isolated agent turn fails during startup catch-up), execution never reaches armTimer(), and the scheduler is permanently dead.

Changes

Guard armTimer in start() - Wrap runMissedJobs() in try/catch so the timer is always armed
Zombie detection in onTimer() - Clear stale runningAtMs markers at runtime (threshold: 2x job timeout or 2 min) so jobs recover without restart
Tests - Verify timer stays armed on catch-up failure and stale markers are cleared

Fixes #67854 Related: #59056, #60799

Changed files

src/cron/service.start-error-resilience.test.ts (added, +409/-0)
src/cron/service/ops.ts (modified, +86/-5)
src/cron/service/timer.ts (modified, +78/-1)

RAW_BUFFERClick to expand / collapse

Problem

Regression of #18120 (closed as fixed). runningAtMs is still not reliably clearing on job completion.

Environment

OpenClaw 2026.3.22 (4dcc39c)
macOS (arm64), Node v22.22.0 (gateway service) / v25.5.0 (CLI)
45 cron jobs configured, ~14 scraper crons running frequently (every 1-2 hours)

Observed Behavior

The stuck jobs included a mix of:

Jobs that last ran 3-12 hours ago and completed normally
Jobs with different schedules and timeouts (2700s, 3600s, null)
Both haiku and sonnet46 model jobs

The identical runningAtMs across all 9 suggests a batch state write (possibly on scheduler wake/catch-up) that sets runningAtMs without a corresponding session actually starting.

Impact

Downstream cron jobs (e.g. revenue-loop) stop firing because the scheduler appears to be at concurrency limit or skips scheduling when too many jobs are in running state.

Workaround Applied

Edited ~/.openclaw/cron/jobs.json to set runningAtMs: null for all stuck jobs
Restarted gateway to pick up clean state
Jobs resumed firing normally

Hypothesis

The fix in #18120 handles the case where a session completes/errors through applyJobResult. But there appears to be a path where runningAtMs gets set without a real session starting — possibly during catch-up scheduling when overdue jobs are detected. The catch-up path may set runningAtMs but then skip actual execution (e.g. due to stale delivery), leaving the flag permanently set.

Evidence: gateway log shows skipping stale delivery warnings for some of these same jobs around the time the zombie state appeared.

Possibly Related

Two different Node.js versions observed in gateway logs (22.22.0 for the service, 25.5.0 for CLI). Unknown if this contributes to state inconsistency.

extent analysis

TL;DR

Manually resetting runningAtMs to null for stuck jobs and restarting the gateway service may temporarily resolve the issue, but a permanent fix likely requires addressing the underlying issue with catch-up scheduling.

Guidance

Investigate the catch-up scheduling mechanism to determine why runningAtMs is being set without a corresponding session start, potentially due to stale delivery warnings.
Review gateway logs for patterns or correlations between skipping stale delivery warnings and the appearance of stuck jobs.
Consider implementing a retry or timeout mechanism for catch-up scheduling to prevent runningAtMs from being permanently set.
Verify that the Node.js version discrepancy between the gateway service and CLI does not contribute to the state inconsistency.

Example

No code snippet is provided due to the lack of specific implementation details in the issue.

Notes

The provided workaround may not be scalable or reliable for production environments, and a more permanent solution is needed to prevent recurrence. The discrepancy in Node.js versions may or may not be related to the issue, but it is worth investigating.

Recommendation

Apply workaround: Manually reset runningAtMs to null for stuck jobs and restart the gateway service, while simultaneously investigating the underlying cause of the issue to implement a permanent fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#embedding generation #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Cron runningAtMs zombie state regression — cleared state returns after gateway restart [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround Applied

PR fix notes

PR #68112: fix(cron): prevent scheduler death when startup catch-up fails

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Changed files

Problem

Environment

Observed Behavior

Impact

Workaround Applied

Hypothesis

Possibly Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING