openclaw - 💡(How to fix) Fix Cron jobs show stale/partial failure state after model recovery; fallback errors hide successful reruns [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62424Fetched 2026-04-08 03:04:27
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed different realities:

  • openclaw cron list --all --json exposed stale state.lastError / state.lastRunStatus for many jobs.
  • Some jobs had already recovered on newer runs, but the last visible state we first checked made them look still broken.
  • openclaw cron runs --id <job> gave the real picture, but the latest successful manual rerun was not obvious unless explicitly inspected.
  • Fallback chains also fail noisily with mixed provider errors, which makes it harder to distinguish "current failure" from "historic failure".

Error Message

  • 1775553942278error (FallbackSummaryError across anthropic/google/zai)

3) Fallback error composition is confusing

Typical latest error: When all fallbacks fail, the aggregated error is preserved, but when a later run succeeds on a different provider, it is too easy to keep looking at stale failure narratives.

  • latest=ok (gpt-5.4, 09:36 UTC), previous=error (fallback_summary)

Root Cause

When debugging production automations, the key question is not just “did this ever fail?” but “is it still failing right now?”. In a mixed recovery scenario, the current surfaces make that harder than it should be.

RAW_BUFFERClick to expand / collapse

Summary

After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed different realities:

  • openclaw cron list --all --json exposed stale state.lastError / state.lastRunStatus for many jobs.
  • Some jobs had already recovered on newer runs, but the last visible state we first checked made them look still broken.
  • openclaw cron runs --id <job> gave the real picture, but the latest successful manual rerun was not obvious unless explicitly inspected.
  • Fallback chains also fail noisily with mixed provider errors, which makes it harder to distinguish "current failure" from "historic failure".

Environment

  • OpenClaw 2026.4.5 (3e72c03)
  • Gateway local, WhatsApp + Slack enabled
  • Default interactive sessions currently on gpt-5.4

What I observed

1) cron list vs cron runs mismatch during recovery

For job ops-monitor (40893c57-2658-4f3f-9073-3b645c6a7763):

Recent run history:

  • 1775556202217ok on gpt-5.4 / openai-codex
  • 1775553942278error (FallbackSummaryError across anthropic/google/zai)
  • 1775550327550ok on gemini-2.5-flash

This means the job recovered, but during debugging the surfaced job state and prior assistant-visible context still strongly suggested the job was failing.

2) Other jobs still genuinely failing

For example:

  • vault-sentinel-watch latest run: FallbackSummaryError
  • vault-curator-extract latest run: FallbackSummaryError
  • vault-scribe-sweep latest run: FallbackSummaryError
  • vault-scribe-LOCAL-experiment latest run: FallbackSummaryError
  • comms-health-check latest run: direct anthropic quota failure

So the system had a mixed state: some jobs recovered, others did not.

3) Fallback error composition is confusing

Typical latest error:

FallbackSummaryError: All models failed (4): anthropic/claude-sonnet-4-6: ... out of extra usage ... | google/gemini-2.5-flash: API rate limit reached | google/gemini-2.5-pro: API rate limit reached | zai/glm-5: HTTP 404: Not Found (model_not_found)

This is useful, but operationally it obscures:

  • whether the job is currently failing or only failed on a previous run,
  • which provider/model was actually selected for the next attempt,
  • whether a newer successful run superseded the bad state.

Suspected root causes

  1. Job state surfacing is too coarse cron list appears to summarize the persisted job state, but that state is not enough to understand mixed recovery scenarios.

  2. Latest successful rerun is not prominent enough Operators need an easy way to see: “latest run OK, previous run failed”. Right now that requires separate cron runs inspection.

  3. Fallback state/reporting is lossy When all fallbacks fail, the aggregated error is preserved, but when a later run succeeds on a different provider, it is too easy to keep looking at stale failure narratives.

  4. Potential model alias/config drift Jobs configured with anthropic models later ran on Gemini and then on gpt-5.4 / openai-codex, likely due to runtime override/reversion behavior. That may be expected, but it is not surfaced clearly in job/state views.

Repro-ish path

  1. Configure cron jobs with Anthropic primary and Google/ZAI fallbacks.
  2. Exhaust Anthropic quota.
  3. Let multiple cron jobs fail.
  4. Restore service / allow a different default runtime model / manual rerun.
  5. Compare:
    • openclaw cron list --all --json
    • openclaw cron runs --id <job> --limit 5
  6. Notice some jobs have recovered while others still fail, but top-level state/debug flow remains ambiguous.

Expected

  • cron list should make the latest run outcome unambiguous, especially after recovery.
  • If the latest run succeeded, stale lastError should either be cleared or clearly marked as historical.
  • Operator-facing views should show both:
    • latest run status/model/provider
    • previous failure summary when relevant

Suggested fixes

CLI / state UX

  • Add lastSuccessAtMs, lastSuccessModel, lastSuccessProvider to job state.
  • In cron list, show lastRunStatus, lastRunAt, and if latest is OK, suppress or annotate prior lastError as historical.
  • Add a compact status line like:
    • latest=ok (gpt-5.4, 09:36 UTC), previous=error (fallback_summary)

Run history UX

  • Add openclaw cron doctor <id> or openclaw cron explain <id> to summarize mixed-state recovery.

Fallback diagnostics

  • In FallbackSummaryError, include whether the failing run was superseded by a later successful run.
  • Surface configured fallback chain + actual chosen model/provider for each run more prominently.

Why this matters

When debugging production automations, the key question is not just “did this ever fail?” but “is it still failing right now?”. In a mixed recovery scenario, the current surfaces make that harder than it should be.

extent analysis

TL;DR

Enhance the cron list command to display the latest run outcome unambiguously, including the status, model, and provider, and differentiate between current and historical failures.

Guidance

  • Modify the cron list output to include lastSuccessAtMs, lastSuccessModel, and lastSuccessProvider fields to provide a clear picture of the latest run.
  • Introduce a compact status line that shows the latest run status, model, and provider, and annotates prior errors as historical if the latest run was successful.
  • Consider adding a cron doctor or cron explain command to summarize mixed-state recovery and provide a clear overview of job status.
  • Improve the FallbackSummaryError to include information on whether the failing run was superseded by a later successful run.

Example

A possible implementation of the enhanced cron list output could be:

{
  "jobId": "40893c57-2658-4f3f-9073-3b645c6a7763",
  "lastRunStatus": "ok",
  "lastRunAt": 1775556202217,
  "lastSuccessAtMs": 1775556202217,
  "lastSuccessModel": "gpt-5.4",
  "lastSuccessProvider": "openai-codex",
  "statusLine": "latest=ok (gpt-5.4, 09:36 UTC), previous=error (fallback_summary)"
}

Notes

The proposed changes aim to address the ambiguity in the current cron list output and provide a clearer picture of job status, especially in mixed recovery scenarios. However, the implementation details may vary depending on the specific requirements and constraints of the OpenClaw system.

Recommendation

Apply the suggested fixes to enhance the cron list command and improve the overall debugging experience for production automations. This will provide operators with a clearer understanding of job status and help them identify and resolve issues more efficiently.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING