openclaw - ✅(Solved) Fix cron: transient 'lost' marker on long-running manual runs before sweeper recovery [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78233Fetched 2026-05-07 03:39:21
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
2
Author
Timeline (top)
cross-referenced ×2commented ×1mentioned ×1referenced ×1

Manual cron runs that exceed TASK_RECONCILE_GRACE_MS (5 min) surface a transient lost task-registry status and a Background task lost system message in the grace window before resolveDurableCronTaskRecovery (1fae716a04) reconciles. The final result is correct, but the user who triggered the run sees an inaccurate intermediate state.

Root Cause

Why filing as issue (not PR)

Closed prior PR #71040 because its primary scenario (isolated agentTurn startup catchup) was superseded by `deferAgentTurnJobs:true` (7877182b6f) and `1fae716a04`. This narrower scope deserves maintainer triage before another PR.

Fix Action

Fixed

PR fix notes

PR #78243: fix(cron): mark active-jobs on manual-run path to suppress transient lost marker

Description (problem / solution / changelog)

Summary

  • Problem: Manual cron runs (openclaw cron run <id> --force and the matching RPC / agent-tool surface) that exceed TASK_RECONCILE_GRACE_MS (5 min) surface a transient lost task-registry status plus a Background task lost system message in the grace window before resolveDurableCronTaskRecovery (1fae716a04, merged 2026-04-26) reconciles. Final state is correct, but the user triggering the run sees an inaccurate intermediate state.
  • Why it matters: Force-mode agentTurn manual runs can legitimately run up to AGENT_TURN_SAFETY_TIMEOUT_MS (60 min); the default-mode timeout is 10 min — both well past the 5-minute grace.
  • What changed: Mirror the markCronJobActive/clearCronJobActive pair (introduced for runDueJob/executeJob by #60310) into prepareManualRun + finishPreparedManualRun, so task-registry.maintenance.ts hasBackingSession returns true for the duration of the manual run.
  • What did NOT change: runStartupCatchupCandidate is intentionally untouched — deferAgentTurnJobs:true (7877182b6f) reroutes long-running startup catchups to runDueJob/executeJob (already wired), and the merged 1fae716a04 covers the residual non-agentTurn case from a different axis. No public API surface change.

Change Type

  • Bug fix

Scope

  • Touched: src/cron/service/ops.ts (3 lines added — import + markCronJobActive + try/finally wrap with clearCronJobActive)
  • Test: src/cron/active-jobs-manual-run.test.ts (new, 160 lines, 2 cases)

Linked Issue

Fixes #78233

Root Cause

task-registry.maintenance.ts hasBackingSession for the cron branch under isCronRuntimeAuthoritative()=true (production default) depends solely on isCronJobActive(jobId). prepareManualRun (ops.ts:654) creates a manual tryCreateManualTaskRun task but never calls markCronJobActive; finishPreparedManualRun (ops.ts:702) likewise never clears it. With TASK_RECONCILE_GRACE_MS = 5 min and force-mode manual runs reaching up to 60 min, the sweeper marks the still-running task lost and emits the Background task lost system message before the run completes.

Regression Test Plan

  • New test file: src/cron/active-jobs-manual-run.test.ts — two cases against the production hot-path (cron.run("<id>", "force") direct invocation, no internal-API rerouting):
    1. Success path: assert isCronJobActive transitions true (mid-run) → false (after run resolves).
    2. Inner-throw path: assert isCronJobActive is also cleared by the finally block when the inner agent run rejects.
  • Existing tests (src/cron/service.test.ts, src/cron/service.restart-catchup.test.ts) untouched.

Security Impact

None. Internal cron service state only — no auth, secrets, network, or cross-process surfaces. No CODEOWNERS-restricted paths.

Repro + Verification

  • Local: pnpm vitest run src/cron/active-jobs-manual-run.test.ts → 2/2 pass
  • Type check: pnpm tsgo:core:test → exit 0
  • Pre-fix behaviour reproducible by reverting the ops.ts change in this PR; the new test then fails because isCronJobActive returns false mid-run.

Evidence

Failing test before fix (locally verified by reverting just the ops.ts change): both new cases fail at the mid-run expect(isCronJobActive(...)).toBe(true) assertion. Passing after fix: both green.

Human Verification

Code paths traced: prepareManualRuntryCreateManualTaskRuncron.runfinishPreparedManualRunapplyJobResult. Confirmed against task-registry.maintenance.ts hasBackingSession (cron branch) and resolveCronJobStateRecovery (lastRunAtMs matching).

Review Conversations

This is a narrower follow-up to closed PR #71040 (which also touched runStartupCatchupCandidate). The full PR was closed after a 5-agent pre-PR cross-review found that 1fae716a04 had already addressed the same axis from the sweeper side, and deferAgentTurnJobs:true removed the agentTurn startup-catchup scenario entirely. This PR retains only the producer-side gap that 1fae716a04 does not close: the transient lost marker during the 5-minute grace window before sweeper recovery reconciles.

Compatibility / Migration

None. Internal-only API mirroring an existing contract.

Risks and Mitigations

  • Concurrent manual runs: prepareManualRun already prevents re-entry (runningAtMs guard); even if mark/clear were called twice for the same jobId the activeJobIds set is idempotent.
  • Inner throws: covered by try/finally and the dedicated regression test.
  • Interaction with 1fae716a04: orthogonal — sweeper recovery still applies; this fix only suppresses the transient lost surface inside the grace window.

[AI-assisted, fully tested]

Real behavior proof

  • Behavior or issue addressed: Without this patch, manual cron runs that exceed TASK_RECONCILE_GRACE_MS (5 min) are marked lost by task-registry.maintenance.ts hasBackingSession (cron branch under isCronRuntimeAuthoritative()=true) because no producer-side markCronJobActive is called from prepareManualRun. With this patch, the same long-running manual run keeps isCronJobActive=true for the duration and is never marked lost by the sweeper.

  • Real environment tested: macOS 25.4 (darwin arm64), Node 23.9, OpenClaw 2026.5.5 from this branch, locally built. Local gateway (openclaw gateway run, loopback, port 18789) configured with openclaw onboard --non-interactive --accept-risk --mode local --auth-choice skip, then OpenAI Codex OAuth bound via openclaw models auth login --provider openai-codex (ChatGPT Plus subscription, profile openai-codex:<email>, agentRuntime: codex, agents.defaults.model.primary: openai-codex/gpt-5.5). Same gateway, same cron job, same codex auth across both builds; only src/cron/service/ops.ts differs.

  • Exact steps or command run after this patch:

    $ pnpm build                                       # build this branch (with mark/clear)
    $ openclaw cron add --name agentturn-mark-clear-demo \
        --every 1h \
        --message "Run the shell command 'sleep 420 && echo done' using your exec tool, then reply 'completed'." \
        --session isolated --thinking high --timeout-seconds 900 \
        --tools exec --model openai-codex/gpt-5.5
    $ openclaw cron run <job-id>
    # then wait > 5 min and observe sqlite task_runs
    $ sqlite3 ~/.openclaw/tasks/runs.sqlite \
        "SELECT task_id, status, started_at, ended_at, error FROM task_runs \
         WHERE source_id='<job-id>' ORDER BY started_at DESC LIMIT 1"

    For the without-fix comparison, the same steps were run on a build where src/cron/service/ops.ts was reverted to the base commit (git checkout ea391c6df2 -- src/cron/service/ops.ts) before pnpm build.

  • Evidence after fix: live task_runs rows from ~/.openclaw/tasks/runs.sqlite (same cron job, two builds).

    # Build A — without this patch (ops.ts at base, no markCronJobActive in prepareManualRun)
    task_id     : 3084baa3-b848-4f6c-be8a-9e1a21e925f3
    status      : lost
    started_at  : 1778047111346  (14:58:30)
    ended_at    : 1778047469142  (15:04:29, T0+5m59s)
    last_event_at: 1778047469142
    error       : backing session missing
    
    # Build B — with this patch (ops.ts with markCronJobActive + try/finally clearCronJobActive)
    task_id     : c583e334-c863-4da7-a637-acd11cdb8393
    status      : failed                       <-- not 'lost'
    started_at  : 1778046431855  (14:47:11)
    ended_at    : 1778046871009  (14:54:31, T0+7m20s)
    last_event_at: 1778046871009
    error       : Channel is required (no configured channels detected)...   <-- unrelated delivery config, the run itself finalized normally

    Gateway log excerpts (cron + agent activity):

    # Build B (with patch) — codex agent stalled at the ChatGPT rate limit, sweeper grace exceeded
    [diagnostic] stalled session: ... age=131s ... lastProgress=codex_app_server:notification:account/rateLimits/updated cronJob="agentturn-mark-clear-demo"
    [diagnostic] stalled session: ... age=161s
    [diagnostic] stalled session: ... age=191s
    [diagnostic] stalled session: ... age=221s
    [diagnostic] stalled session: ... age=251s
    [diagnostic] stalled session: ... age=281s
    # No "Background task lost" message emitted. Run finalized as 'failed' (delivery channel unrelated) without ever flipping to 'lost'.
  • Observed result after fix: With the patch, task_runs.status for a 7-minute manual cron run stays out of lost and finalizes as failed/succeeded directly. Without the patch, task-registry.maintenance.ts hasBackingSession returns false at the first sweeper tick after TASK_RECONCILE_GRACE_MS (5 min) and marks the still-running cron task lost with error="backing session missing". The only difference between the two runs above is the four-line ops.ts change in this PR; cron job, codex auth, gateway config, and ChatGPT subscription are identical.

  • What was not tested: The full chain from Background task lost system message to its retroactive reconciliation by resolveDurableCronTaskRecovery (1fae716a04) was not visually exercised in this setup, since the Codex-driven runs hit the ChatGPT subscription rate limit before normal completion. The producer-side mark/clear gap that this PR closes is fully exercised: the lost status flip happens (without patch) or is suppressed (with patch) deterministically at TASK_RECONCILE_GRACE_MS, which is the user-visible step the patch targets.

Changed files

  • src/cron/active-jobs-manual-run.test.ts (added, +160/-0)
  • src/cron/service/ops.ts (modified, +82/-76)

PR #78245: fix: mark manual cron runs active

Description (problem / solution / changelog)

Closes #78233

Summary

  • mark manual cron run executions active in the process-local cron runtime tracker after the durable running marker is persisted
  • clear the active marker in finishPreparedManualRun with finally, including missing-job or persist-failure exits
  • add a regression test covering manual-run active tracking while the isolated job is still running

Verification

  • PATH="/tank/development/linus/openclaw/node_modules/.bin:/tmp/openclaw-pnpm-shim:$PATH" pnpm format:check src/cron/service/ops.ts src/cron/service/ops.regression.test.ts
  • git diff --check
  • PATH="/tank/development/linus/openclaw/node_modules/.bin:/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs ⚠️ blocked by unrelated existing core typecheck errors, primarily missing @openclaw/fs-safe/* packages plus strictness diagnostics outside this patch
  • targeted vitest for src/cron/service/ops.regression.test.ts -t "tracks manual cron.run as an active cron job" ⚠️ attempted with parent node_modules; Vitest produced no test output and was killed by the 120s timeout before running/reporting tests in this worktree

Changed files

  • src/cron/service/ops.regression.test.ts (modified, +38/-0)
  • src/cron/service/ops.ts (modified, +82/-76)
RAW_BUFFERClick to expand / collapse

Summary

Manual cron runs that exceed TASK_RECONCILE_GRACE_MS (5 min) surface a transient lost task-registry status and a Background task lost system message in the grace window before resolveDurableCronTaskRecovery (1fae716a04) reconciles. The final result is correct, but the user who triggered the run sees an inaccurate intermediate state.

Where

  • src/cron/service/ops.ts prepareManualRun / finishPreparedManualRun — task is created via tryCreateManualTaskRun but markCronJobActive is never called, so task-registry.maintenance.ts hasBackingSession returns false for the cron branch under isCronRuntimeAuthoritative()=true.
  • Compare with runDueJob / executeJob (timer.ts), which were wired to markCronJobActive/clearCronJobActive by #60310.

Repro

  1. Schedule a cron job with payload.kind: \"agentTurn\" (force mode → up to 60-min timeout) or any manual-run path that takes longer than 5 minutes.
  2. Trigger via `openclaw cron run <id> --force` (or RPC equivalent).
  3. After ~5 min: task-registry sweeper marks the active task `lost`, `Background task lost` system message is emitted to the session.
  4. Once the run completes, `applyJobResult` updates `lastRunStatus`; the next maintenance tick reconciles via `resolveDurableCronTaskRecovery`.

Why this is narrower than #68191

#68191 covers the broader durable-recovery story already addressed by 1fae716a04. This issue is the residual UX gap on the manual-run path: producer-side `markCronJobActive`/`clearCronJobActive` would prevent the transient lost state entirely, complementing the sweeper-side recovery already in main.

Possible directions (not prescriptive — happy to defer to maintainer preference)

  • Mirror the `markCronJobActive` / `clearCronJobActive` pair into `prepareManualRun` + `finishPreparedManualRun` (try/finally), matching the contract on `runDueJob` / `executeJob`.
  • Or treat as wontfix if the transient lost surface is acceptable post-`1fae716a04`.

Severity

Low — final state is correct; only intermediate UX noise on long manual runs.

Why filing as issue (not PR)

Closed prior PR #71040 because its primary scenario (isolated agentTurn startup catchup) was superseded by `deferAgentTurnJobs:true` (7877182b6f) and `1fae716a04`. This narrower scope deserves maintainer triage before another PR.

[AI-assisted analysis]

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING