openclaw - 💡(How to fix) Fix Cron jobs show stale/partial failure state after model recovery; fallback errors hide successful reruns [1 participants]

upendradevsingh · 2026-04-07T10:06:38Z

[openclaw] After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed… After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed different realities: - `openclaw cron list --all --json` exposed stale `state.lastError` / `state.lastRunStatus` for many jobs. - Some jobs had already recovered on newer runs, but the last visible state we first checked made them look still broken. - `openclaw cron runs --id ` gave the real picture, but the latest successful manual rerun was not obvious unless explicitly inspected. - Fallback chains also fail noisily with mixed provider errors, which makes it harder to distinguish "current failure" from "historic failure". ## Summary After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed different realities: - `openclaw cron list --all --json` exposed stale `state.lastError` / `state.lastRunStatus` for many jobs. - Some jobs had already recovered on newer runs, but the last visible state we first checked made them look still broken. - `openclaw cron runs --id ` gave the real picture, but the latest successful manual rerun was not obvious unless explicitly inspected. - Fallback chains also fail noisily with mixed provider errors, which makes it harder to distinguish "current failure" from "historic failure". ## Environment - OpenClaw `2026.4.5 (3e72c03)` - Gateway local, WhatsApp + Slack enabled - Default interactive sessions currently on `gpt-5.4` ## What I observed ### 1) `cron list` vs `cron runs` mismatch during recovery For job `ops-monitor` (`40893c57-2658-4f3f-9073-3b645c6a7763`): Recent run history: - `1775556202217` → `ok` on `gpt-5.4` / `openai-codex` - `1775553942278` → `error` (`FallbackSummaryError` across anthropic/google/zai) - `1775550327550` → `ok` on `gemini-2.5-flash` This means the job recovered, but during debugging the surfaced job state and prior assistant-visible context still strongly suggested the job was failing. ### 2) Other jobs still genuinely failing For example: - `vault-sentinel-watch` latest run: `FallbackSummaryError` - `vault-curator-extract` latest run: `FallbackSummaryError` - `vault-scribe-sweep` latest run: `FallbackSummaryError` - `vault-scribe-LOCAL-experiment` latest run: `FallbackSummaryError` - `comms-health-check` latest run: direct anthropic quota failure So the system had a mixed state: some jobs recovered, others did not. ### 3) Fallback error composition is confusing Typical latest error: `FallbackSummaryError: All models failed (4): anthropic/claude-sonnet-4-6: ... out of extra usage ... | google/gemini-2.5-flash: API rate limit reached | google/gemini-2.5-pro: API rate limit reached | zai/glm-5: HTTP 404: Not Found (model_not_found)` This is useful, but operationally it obscures: - whether the job is currently failing or only failed on a previous run, - which provider/model was actually selected for the next attempt, - whether a newer successful run superseded the bad state. ## Suspected root causes 1. **Job state surfacing is too coarse** `cron list` appears to summarize the persisted job state, but that state is not enough to understand mixed recovery scenarios. 2. **Latest successful rerun is not prominent enough** Operators need an easy way to see: “latest run OK, previous run failed”. Right now that requires separate `cron runs` inspection. 3. **Fallback state/reporting is lossy** When all fallbacks fail, the aggregated error is preserved, but when a later run succeeds on a different provider, it is too easy to keep looking at stale failure narratives. 4. **Potential model alias/config drift** Jobs configured with anthropic models later ran on Gemini and then on `gpt-5.4` / `openai-codex`, likely due to runtime override/reversion behavior. That may be expected, but it is not surfaced clearly in job/state views. ## Repro-ish path 1. Configure cron jobs with Anthropic primary and Google/ZAI fallbacks. 2. Exhaust Anthropic quota. 3. Let multiple cron jobs fail. 4. Restore service / allow a different default runtime model / manual rerun. 5. Compare: - `openclaw cron list --all --json` - `openclaw cron runs --id --limit 5` 6. Notice some jobs have recovered while others still fail, but top-level state/debug flow remains ambiguous. ## Expected - `cron list` should make the latest run outcome unambiguous, especially after recovery. - If the latest run succeeded, stale `lastError` should either be cleared or clearly marked as historical. - Operator-facing views should show both: - latest run status/model/provider - previous failure summary when relevant ## Suggested fixes ### CLI / state UX - Add `lastSuccessAtMs`, `lastSuccessModel`, `lastSuccessProvider` to job state. - In `cron list`, show `lastRunStatus`, `lastRunAt`, and if latest is OK, supp

openclaw2026-04-07 10:06:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#62424•Fetched 2026-04-08 03:04:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

upendradevsingh

Participants

upendradevsingh

After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed different realities:

openclaw cron list --all --json exposed stale state.lastError / state.lastRunStatus for many jobs.
Some jobs had already recovered on newer runs, but the last visible state we first checked made them look still broken.
openclaw cron runs --id <job> gave the real picture, but the latest successful manual rerun was not obvious unless explicitly inspected.
Fallback chains also fail noisily with mixed provider errors, which makes it harder to distinguish "current failure" from "historic failure".

Error Message

1775553942278 → error (FallbackSummaryError across anthropic/google/zai)

3) Fallback error composition is confusing

Typical latest error: When all fallbacks fail, the aggregated error is preserved, but when a later run succeeds on a different provider, it is too easy to keep looking at stale failure narratives.

latest=ok (gpt-5.4, 09:36 UTC), previous=error (fallback_summary)

Root Cause

When debugging production automations, the key question is not just “did this ever fail?” but “is it still failing right now?”. In a mixed recovery scenario, the current surfaces make that harder than it should be.

RAW_BUFFERClick to expand / collapse

Summary

After Anthropic quota exhaustion and later recovery/model fallback changes, cron troubleshooting became misleading because different surfaces showed different realities:

openclaw cron list --all --json exposed stale state.lastError / state.lastRunStatus for many jobs.
Some jobs had already recovered on newer runs, but the last visible state we first checked made them look still broken.
openclaw cron runs --id <job> gave the real picture, but the latest successful manual rerun was not obvious unless explicitly inspected.
Fallback chains also fail noisily with mixed provider errors, which makes it harder to distinguish "current failure" from "historic failure".

Environment

OpenClaw 2026.4.5 (3e72c03)
Gateway local, WhatsApp + Slack enabled
Default interactive sessions currently on gpt-5.4

What I observed

1) `cron list` vs `cron runs` mismatch during recovery

For job ops-monitor (40893c57-2658-4f3f-9073-3b645c6a7763):

Recent run history:

1775556202217 → ok on gpt-5.4 / openai-codex
1775553942278 → error (FallbackSummaryError across anthropic/google/zai)
1775550327550 → ok on gemini-2.5-flash

This means the job recovered, but during debugging the surfaced job state and prior assistant-visible context still strongly suggested the job was failing.

2) Other jobs still genuinely failing

For example:

vault-sentinel-watch latest run: FallbackSummaryError
vault-curator-extract latest run: FallbackSummaryError
vault-scribe-sweep latest run: FallbackSummaryError
vault-scribe-LOCAL-experiment latest run: FallbackSummaryError
comms-health-check latest run: direct anthropic quota failure

So the system had a mixed state: some jobs recovered, others did not.

3) Fallback error composition is confusing

Typical latest error:

FallbackSummaryError: All models failed (4): anthropic/claude-sonnet-4-6: ... out of extra usage ... | google/gemini-2.5-flash: API rate limit reached | google/gemini-2.5-pro: API rate limit reached | zai/glm-5: HTTP 404: Not Found (model_not_found)

This is useful, but operationally it obscures:

whether the job is currently failing or only failed on a previous run,
which provider/model was actually selected for the next attempt,
whether a newer successful run superseded the bad state.

Suspected root causes

Job state surfacing is too coarse cron list appears to summarize the persisted job state, but that state is not enough to understand mixed recovery scenarios.
Latest successful rerun is not prominent enough Operators need an easy way to see: “latest run OK, previous run failed”. Right now that requires separate cron runs inspection.
Fallback state/reporting is lossy When all fallbacks fail, the aggregated error is preserved, but when a later run succeeds on a different provider, it is too easy to keep looking at stale failure narratives.
Potential model alias/config drift Jobs configured with anthropic models later ran on Gemini and then on gpt-5.4 / openai-codex, likely due to runtime override/reversion behavior. That may be expected, but it is not surfaced clearly in job/state views.

Repro-ish path

Configure cron jobs with Anthropic primary and Google/ZAI fallbacks.
Exhaust Anthropic quota.
Let multiple cron jobs fail.
Restore service / allow a different default runtime model / manual rerun.
Compare:
- openclaw cron list --all --json
- openclaw cron runs --id <job> --limit 5
Notice some jobs have recovered while others still fail, but top-level state/debug flow remains ambiguous.

Expected

cron list should make the latest run outcome unambiguous, especially after recovery.
If the latest run succeeded, stale lastError should either be cleared or clearly marked as historical.
Operator-facing views should show both:
- latest run status/model/provider
- previous failure summary when relevant

Suggested fixes

CLI / state UX

Add lastSuccessAtMs, lastSuccessModel, lastSuccessProvider to job state.
In cron list, show lastRunStatus, lastRunAt, and if latest is OK, suppress or annotate prior lastError as historical.
Add a compact status line like:
- latest=ok (gpt-5.4, 09:36 UTC), previous=error (fallback_summary)

Run history UX

Add openclaw cron doctor <id> or openclaw cron explain <id> to summarize mixed-state recovery.

Fallback diagnostics

In FallbackSummaryError, include whether the failing run was superseded by a later successful run.
Surface configured fallback chain + actual chosen model/provider for each run more prominently.

Why this matters

extent analysis

TL;DR

Enhance the cron list command to display the latest run outcome unambiguously, including the status, model, and provider, and differentiate between current and historical failures.

Guidance

Modify the cron list output to include lastSuccessAtMs, lastSuccessModel, and lastSuccessProvider fields to provide a clear picture of the latest run.
Introduce a compact status line that shows the latest run status, model, and provider, and annotates prior errors as historical if the latest run was successful.
Consider adding a cron doctor or cron explain command to summarize mixed-state recovery and provide a clear overview of job status.
Improve the FallbackSummaryError to include information on whether the failing run was superseded by a later successful run.

Example

A possible implementation of the enhanced cron list output could be:

{
  "jobId": "40893c57-2658-4f3f-9073-3b645c6a7763",
  "lastRunStatus": "ok",
  "lastRunAt": 1775556202217,
  "lastSuccessAtMs": 1775556202217,
  "lastSuccessModel": "gpt-5.4",
  "lastSuccessProvider": "openai-codex",
  "statusLine": "latest=ok (gpt-5.4, 09:36 UTC), previous=error (fallback_summary)"
}

Notes

The proposed changes aim to address the ambiguity in the current cron list output and provide a clearer picture of job status, especially in mixed recovery scenarios. However, the implementation details may vary depending on the specific requirements and constraints of the OpenClaw system.

Recommendation

Apply the suggested fixes to enhance the cron list command and improve the overall debugging experience for production automations. This will provide operators with a clearer understanding of job status and help them identify and resolve issues more efficiently.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #mixed precision #training loop #device allocation #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Cron jobs show stale/partial failure state after model recovery; fallback errors hide successful reruns [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

3) Fallback error composition is confusing

Root Cause

Summary

Environment

What I observed

1) `cron list` vs `cron runs` mismatch during recovery

2) Other jobs still genuinely failing

3) Fallback error composition is confusing

Suspected root causes

Repro-ish path

Expected

Suggested fixes

CLI / state UX

Run history UX

Fallback diagnostics

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Cron jobs show stale/partial failure state after model recovery; fallback errors hide successful reruns [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

3) Fallback error composition is confusing

Root Cause

Summary

Environment

What I observed

1) cron list vs cron runs mismatch during recovery

2) Other jobs still genuinely failing

3) Fallback error composition is confusing

Suspected root causes

Repro-ish path

Expected

Suggested fixes

CLI / state UX

Run history UX

Fallback diagnostics

Why this matters

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1) `cron list` vs `cron runs` mismatch during recovery