openclaw - ✅(Solved) Fix [Bug] Cron spin loop via MIN_REFIRE_GAP_MS fallback when computeJobNextRunAtMs returns undefined (incomplete fix for #17821) [3 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#66019Fetched 2026-04-14 05:39:14
View on GitHub
Comments
2
Participants
2
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3referenced ×3commented ×2closed ×1

applyJobResult in src/cron/service/timer.ts conflates two semantics on the MIN_REFIRE_GAP_MS constant:

  1. Lower bound for the next cron fire, protecting against the case where computeJobNextRunAtMs returns a value in the same second as the run just finished (the #17821 fix).
  2. Fallback value for when computeJobNextRunAtMs returns undefined entirely.

The second usage is wrong: when the schedule cannot be computed at all, falling back to endedAt + 2s causes the scheduler to re-fire the job 2 seconds later. That run then fails/skips for the same reason, and we're in a tight loop. The structurally identical bug exists in the error-backoff branch.

#17821 is the closed predecessor and covers the "landed in the same second" case. It does not cover the undefined case, and the existing regression tests (timer.regression.test.ts: spin-loop-17821 and spin-gap-17821) do not exercise it.

Error Message

The second usage is wrong: when the schedule cannot be computed at all, falling back to endedAt + 2s causes the scheduler to re-fire the job 2 seconds later. That run then fails/skips for the same reason, and we're in a tight loop. The structurally identical bug exists in the error-backoff branch. If croner's cron.nextRun() returns null (observed for certain tz/ICU combinations — see environment below), the function silently returns undefined. No exception is thrown.

4. Same shape in the error-backoff branch

} else if (result.status === "error" && isJobEnabled(job)) { state.deps.log.warn( The error-backoff branch should get a parallel treatment. Additionally the catch blocks around computeJobNextRunAtMs could early-return after recordScheduleComputeError (or set a flag checked below) so the subsequent assignment doesn't race with the error handler. An analogous test for the error-backoff branch would prevent regression in both paths. I've been running the success-path fix above on the shipped bundle since the bug was diagnosed, with cron jobs behaving correctly. The new warn log line has not fired in my environment, consistent with the note above that the undefined path is rare on my specific setup — the patch functions as a safety net rather than a hot path.

Root Cause

Root cause walkthrough

Fix Action

Fix / Workaround

Local patch (for reference)

I've been running the success-path fix above on the shipped bundle since the bug was diagnosed, with cron jobs behaving correctly. The new warn log line has not fired in my environment, consistent with the note above that the undefined path is rare on my specific setup — the patch functions as a safety net rather than a hot path.

PR fix notes

PR #66076: fix(cron): promote silent-undefined schedule to error, break spin loop (#66019)

Description (problem / solution / changelog)

Summary

`applyJobResult` in `src/cron/service/timer.ts` conflated two semantics on `MIN_REFIRE_GAP_MS`:

  1. Lower bound for the next cron fire (the #17821 fix — protects against "next run lands in the same second as the run that just finished").
  2. Fallback value for when `computeJobNextRunAtMs` returns `undefined`.

The second usage caused a spin loop: `computeNextRunAtMs` and the 4-attempt staggered loop in `src/cron/schedule.ts` can silently return `undefined` when croner's `cron.nextRun()` returns `null` on certain tz/ICU edge cases. When that happened, `nextRunAtMs` got pinned to `endedAt + 2s`, the job re-fired 2s later, computation silently failed again, and the scheduler tight-looped.

The throw path had the same structural issue: `recordScheduleComputeError` set `nextRunAtMs = undefined`, but the subsequent unconditional assignment in the same branch overwrote it back to `minNext`, defeating the intended auto-disable.

Credit to #66019 for a precise root-cause trace through `computeNextRunAtMs`, `computeStaggeredCronNextRunAtMs`, and both branches of `applyJobResult`.

Closes #66019.

Fix

  • Track schedule-compute failure via a local `scheduleComputeFailed` flag.
  • Treat silent `undefined` returns as a schedule error by calling `recordScheduleComputeError` with a synthetic `Error("schedule computation returned undefined")` — matches the existing throw-path behavior (increments `scheduleErrorCount`, sets `nextRunAtMs = undefined`, auto-disables after `MAX_SCHEDULE_ERRORS = 3`).
  • Skip the `minNext`/`backoffNext` fallback assignment when the compute failed, so the auto-disable state is preserved.

Same fix applied to both the cron success branch and the error-backoff branch — the reporter identified both as structurally affected.

Changes

  • `src/cron/service/timer.ts` — promote silent undefined to schedule error in both branches of `applyJobResult`

Test plan

  • Mirrors the existing throw-path semantics via `recordScheduleComputeError` — no new auto-disable logic
  • Preserves the #17821 `MIN_REFIRE_GAP_MS` safety net on the normal path (only skipped when compute failed)
  • Existing `timer.regression.test.ts` tests (`spin-loop-17821`, `spin-gap-17821`) continue to cover the "same-second" case
  • Ideally needs a new regression test for the silent-undefined path — happy to add in a follow-up once the approach is confirmed

🤖 Generated with Claude Code

Changed files

  • src/cli/update-cli.test.ts (modified, +15/-0)
  • src/cron/service/jobs.schedule-error-isolation.test.ts (modified, +53/-1)
  • src/cron/service/jobs.ts (modified, +74/-6)
  • src/cron/service/ops.regression.test.ts (modified, +126/-0)
  • src/cron/service/ops.ts (modified, +34/-5)
  • src/cron/service/timer.regression.test.ts (modified, +287/-1)
  • src/cron/service/timer.ts (modified, +102/-36)
  • src/infra/update-runner.test.ts (modified, +6/-0)

PR #66083: fix(cron): stop unresolved next-run refire loops

Description (problem / solution / changelog)

Summary

  • fix the cron scheduler path where computeJobNextRunAtMs returning undefined was treated as a short retry instead of an unresolved schedule
  • keep the #17821 lower-bound guard for same-second refires, but stop inventing synthetic retries for unschedulable cron runs
  • keep a periodic maintenance wake armed for enabled jobs with no nextRunAtMs so the scheduler does not go fully idle after clearing an unresolved schedule
  • add focused regression coverage for both the completion path and the cron error-backoff path

Root cause

src/cron/service/timer.ts used MIN_REFIRE_GAP_MS and backoff delays for two different meanings:

  1. lower bounds when a valid next run exists
  2. fallback schedule values when cron next-run computation returned undefined

That second meaning was wrong. An unschedulable cron run could be re-armed a few seconds later and refire forever.

Scope

In scope:

  • #66019

Explicitly out of scope:

  • #66016, #65916, #65193: missing job.state startup-crash family
  • #65981: isolated cron-agent execution / cron-tool mismatch
  • #65987: task timestamp audit noise

Validation

  • pnpm test -- src/cron/service/timer.regression.test.ts src/cron/service.armtimer-tight-loop.test.ts

Notes

  • pnpm check is currently failing on unrelated latest-main TypeScript errors outside this slice (Discord, Feishu, Nextcloud Talk, WhatsApp, MCP, wizard setup, and one existing cron isolated-agent test type issue). I did not broaden this PR into those unrelated failures.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/cron/service.armtimer-tight-loop.test.ts (modified, +37/-0)
  • src/cron/service.issue-66019-unresolved-next-run.test.ts (added, +114/-0)
  • src/cron/service/timer.regression.test.ts (modified, +86/-3)
  • src/cron/service/timer.ts (modified, +47/-3)

PR #66122: test(cron): fix #66019 maintenance regression coverage

Description (problem / solution / changelog)

Summary

  • make the #66019 regression test force maintenance to repair nextRunAtMs
  • keep the PR scoped to the existing test only

Why

The previous mock sequence let applyJobResult consume the second computeNextRunAtMs result during the built-in next-second retry, so the new maintenance path was not actually what made the test pass.

Testing

  • pnpm test -- src/cron/service.issue-66019-unresolved-next-run.test.ts

Changed files

  • src/cron/service.issue-66019-unresolved-next-run.test.ts (modified, +1/-1)

Code Example

const cron = resolveCronFromSchedule(schedule);
if (!cron) return;
let next = cron.nextRun(new Date(nowMs));
if (!next) return;

---

for (let attempt = 0; attempt < 4; attempt += 1) {
  const baseNext = computeNextRunAtMs(job.schedule, cursorMs);
  if (baseNext === undefined) return;
  const shifted = baseNext + offsetMs;
  if (shifted > nowMs) return shifted;
  cursorMs = Math.max(cursorMs + 1, baseNext + 1000);
}
// implicit undefined return

---

} else if (isJobEnabled(job)) {
  let naturalNext: number | undefined;
  try {
    naturalNext = opts?.preserveSchedule && job.schedule.kind === "every"
      ? computeNextWithPreservedLastRun(result.endedAt)
      : computeJobNextRunAtMs(job, result.endedAt);
  } catch (err) {
    recordScheduleComputeError({ state, job, err });
    // no return; execution continues
  }
  if (job.schedule.kind === "cron") {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs =
      naturalNext !== undefined ? Math.max(naturalNext, minNext) : minNext;
    //                                                           ^^^^^^^ bug
  } else {
    job.state.nextRunAtMs = naturalNext;
  }
}

---

} else if (result.status === "error" && isJobEnabled(job)) {
  // ...
  const backoffNext = result.endedAt + backoff;
  job.state.nextRunAtMs =
    normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext;
  //                                                             ^^^^^^^^^^^^ same issue
}

---

if (job.schedule.kind === "cron") {
  if (naturalNext !== undefined) {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs = Math.max(naturalNext, minNext);
  } else {
    // Schedule computation failed; clearing nextRunAtMs avoids a
    // respin loop. The job will be rescheduled on the next heartbeat /
    // timer arm when the schedule can be recomputed from scratch.
    job.state.nextRunAtMs = undefined;
    state.deps.log.warn(
      { jobId: job.id, jobName: job.name },
      "cron: nextRun unresolved — clearing to avoid respin loop",
    );
  }
} else {
  job.state.nextRunAtMs = naturalNext;
}

---

it("clears nextRunAtMs when schedule compute returns undefined (does not spin)", async () => {
  // Arrange: enabled cron job whose computeJobNextRunAtMs path returns undefined
  // (mock via a stubbed croner/schedule helper, or a test hook).
  // ...
  applyJobResult(state, job, { status: "ok", startedAt, endedAt });
  expect(job.state.nextRunAtMs).toBeUndefined();
});
RAW_BUFFERClick to expand / collapse

Summary

applyJobResult in src/cron/service/timer.ts conflates two semantics on the MIN_REFIRE_GAP_MS constant:

  1. Lower bound for the next cron fire, protecting against the case where computeJobNextRunAtMs returns a value in the same second as the run just finished (the #17821 fix).
  2. Fallback value for when computeJobNextRunAtMs returns undefined entirely.

The second usage is wrong: when the schedule cannot be computed at all, falling back to endedAt + 2s causes the scheduler to re-fire the job 2 seconds later. That run then fails/skips for the same reason, and we're in a tight loop. The structurally identical bug exists in the error-backoff branch.

#17821 is the closed predecessor and covers the "landed in the same second" case. It does not cover the undefined case, and the existing regression tests (timer.regression.test.ts: spin-loop-17821 and spin-gap-17821) do not exercise it.

Root cause walkthrough

1. Silent undefined return from computeNextRunAtMs

src/cron/schedule.tscomputeNextRunAtMs:

const cron = resolveCronFromSchedule(schedule);
if (!cron) return;
let next = cron.nextRun(new Date(nowMs));
if (!next) return;

If croner's cron.nextRun() returns null (observed for certain tz/ICU combinations — see environment below), the function silently returns undefined. No exception is thrown.

2. Silent fall-through in computeStaggeredCronNextRunAtMs

src/cron/schedule.tscomputeStaggeredCronNextRunAtMs:

for (let attempt = 0; attempt < 4; attempt += 1) {
  const baseNext = computeNextRunAtMs(job.schedule, cursorMs);
  if (baseNext === undefined) return;
  const shifted = baseNext + offsetMs;
  if (shifted > nowMs) return shifted;
  cursorMs = Math.max(cursorMs + 1, baseNext + 1000);
}
// implicit undefined return

The 4-attempt loop can exit without returning, implicitly giving undefined.

3. applyJobResult bad fallback in success path

src/cron/service/timer.ts, cron success branch:

} else if (isJobEnabled(job)) {
  let naturalNext: number | undefined;
  try {
    naturalNext = opts?.preserveSchedule && job.schedule.kind === "every"
      ? computeNextWithPreservedLastRun(result.endedAt)
      : computeJobNextRunAtMs(job, result.endedAt);
  } catch (err) {
    recordScheduleComputeError({ state, job, err });
    // no return; execution continues
  }
  if (job.schedule.kind === "cron") {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs =
      naturalNext !== undefined ? Math.max(naturalNext, minNext) : minNext;
    //                                                           ^^^^^^^ bug
  } else {
    job.state.nextRunAtMs = naturalNext;
  }
}

When naturalNext === undefined, nextRunAtMs becomes result.endedAt + 2000. Next tick (2 seconds later) the scheduler re-fires the job.

There's also a subtle secondary issue: recordScheduleComputeError already sets job.state.nextRunAtMs = undefined for the THROW case, but because the catch does not return, the subsequent assignment in the same branch overwrites it back to minNext.

4. Same shape in the error-backoff branch

} else if (result.status === "error" && isJobEnabled(job)) {
  // ...
  const backoffNext = result.endedAt + backoff;
  job.state.nextRunAtMs =
    normalNext !== undefined ? Math.max(normalNext, backoffNext) : backoffNext;
  //                                                             ^^^^^^^^^^^^ same issue
}

Here the fallback is errorBackoffMs(consecutiveErrors) rather than MIN_REFIRE_GAP_MS, which mitigates the tight loop via exponential backoff — but for the first few errors the interval is still short enough to cause noticeable channel spam, and the structural confusion is the same.

Observed symptom

A single cron job fires ~10 times in ~10 minutes. Each run completes in 4–6 seconds (agent turn duration); between runs the scheduler wakes up on its newly-set short nextRunAtMs and re-fires. Visible as channel spam (Telegram/WeChat messages repeating). On macOS specifically; same version on Ubuntu appears unaffected.

I don't have a deterministic minimal repro for naturalNext === undefined — a test * * * * * cron in Asia/Shanghai tz fired cleanly every 60s in my environment during verification. The undefined path appears to need a specific croner/ICU edge case that's not reliably triggered by a simple test job, but the structural bug is present in the code regardless of how often the path is hit.

Proposed fix

Treat undefined as "unknown, clear the schedule" rather than as "use the minimum gap". Success path:

if (job.schedule.kind === "cron") {
  if (naturalNext !== undefined) {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs = Math.max(naturalNext, minNext);
  } else {
    // Schedule computation failed; clearing nextRunAtMs avoids a
    // respin loop. The job will be rescheduled on the next heartbeat /
    // timer arm when the schedule can be recomputed from scratch.
    job.state.nextRunAtMs = undefined;
    state.deps.log.warn(
      { jobId: job.id, jobName: job.name },
      "cron: nextRun unresolved — clearing to avoid respin loop",
    );
  }
} else {
  job.state.nextRunAtMs = naturalNext;
}

The error-backoff branch should get a parallel treatment. Additionally the catch blocks around computeJobNextRunAtMs could early-return after recordScheduleComputeError (or set a flag checked below) so the subsequent assignment doesn't race with the error handler.

Suggested regression test

In src/cron/service/timer.regression.test.ts, alongside spin-loop-17821 / spin-gap-17821:

it("clears nextRunAtMs when schedule compute returns undefined (does not spin)", async () => {
  // Arrange: enabled cron job whose computeJobNextRunAtMs path returns undefined
  // (mock via a stubbed croner/schedule helper, or a test hook).
  // ...
  applyJobResult(state, job, { status: "ok", startedAt, endedAt });
  expect(job.state.nextRunAtMs).toBeUndefined();
});

An analogous test for the error-backoff branch would prevent regression in both paths.

Environment

  • OpenClaw: 2026.4.11 (installed via Homebrew, /opt/homebrew/lib/node_modules/openclaw/)
  • OS: macOS 26.4.1 (Darwin 25.4.0), arm64, MacBook Pro 14" (M5)
  • Node: v25.9.0
  • Affected job combinations observed: sessionTarget: isolated + payload.kind: agentTurn and sessionTarget: main + payload.kind: systemEvent + wakeMode: next-heartbeat
  • Not observed on Ubuntu running the same OpenClaw version, suggesting croner/ICU/tz edge case is platform-specific at the trigger layer — though the structural bug above is platform-independent.

Related

  • #17821 — closed; added the MIN_REFIRE_GAP_MS safety net for the same-second case. This issue reports the undefined case that fix missed.
  • #52097 — open; "Cron job runs repeatedly after manual trigger". Symptom similar, root cause not confirmed the same but worth cross-linking for triage.

Local patch (for reference)

I've been running the success-path fix above on the shipped bundle since the bug was diagnosed, with cron jobs behaving correctly. The new warn log line has not fired in my environment, consistent with the note above that the undefined path is rare on my specific setup — the patch functions as a safety net rather than a hot path.

extent analysis

TL;DR

The proposed fix involves treating undefined as "unknown, clear the schedule" rather than as "use the minimum gap" in the success path of applyJobResult to prevent a respin loop.

Guidance

  • Identify and update the applyJobResult function in src/cron/service/timer.ts to handle undefined values from computeJobNextRunAtMs by clearing nextRunAtMs instead of using MIN_REFIRE_GAP_MS.
  • Apply a parallel treatment to the error-backoff branch to prevent similar issues.
  • Consider adding early returns or flags in catch blocks around computeJobNextRunAtMs to prevent racing with error handlers.
  • Create regression tests for both the success and error-backoff paths to ensure the fix works as expected and to prevent future regressions.

Example

The proposed fix for the success path is provided in the issue:

if (job.schedule.kind === "cron") {
  if (naturalNext !== undefined) {
    const minNext = result.endedAt + MIN_REFIRE_GAP_MS;
    job.state.nextRunAtMs = Math.max(naturalNext, minNext);
  } else {
    job.state.nextRunAtMs = undefined;
    state.deps.log.warn(
      { jobId: job.id, jobName: job.name },
      "cron: nextRun unresolved — clearing to avoid respin loop",
    );
  }
}

Notes

The issue seems to be platform-specific due to croner/ICU/tz edge cases, but the structural bug is platform-independent. The proposed fix should be applied regardless of the environment.

Recommendation

Apply the proposed fix to treat undefined as "unknown, clear the schedule" to prevent respin loops and create regression tests to ensure the fix works as expected. This approach addresses the root cause of the issue and provides a safe and reliable solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING