hermes - 💡(How to fix) Fix Feature: Automatic retry for failed cron jobs

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

When a cron job fails (provider hiccup, rate limit, transient error), it records the error and waits for the next scheduled run. For daily jobs, that means waiting 24 hours because of a one-time blip.

  • Failed jobs record last_status = "error" and advance next_run_at normally
  • If all retries exhausted, record final error and advance next_run_at as normal
  • Successful retry clears the error state

Root Cause

When a cron job fails (provider hiccup, rate limit, transient error), it records the error and waits for the next scheduled run. For daily jobs, that means waiting 24 hours because of a one-time blip.

Code Example

retry:
  enabled: true
  max_retries: 3          # Max retry attempts before giving up
  delay: 5m               # Initial delay before first retry
  backoff: exponential     # "exponential" or "linear" or "fixed"
  max_delay: 30m           # Cap on retry delay
RAW_BUFFERClick to expand / collapse

Problem

When a cron job fails (provider hiccup, rate limit, transient error), it records the error and waits for the next scheduled run. For daily jobs, that means waiting 24 hours because of a one-time blip.

Real-world example: A daily ParentVUE grade monitor failed because ZAI rate-limited simultaneously with another cron job, fell back to the DeepSeek fallback model, and hit a thinking-mode context compaction bug. The job won't retry until tomorrow at the same time — a full day of missed grade monitoring because of a transient concurrency spike.

Current Behavior

  • Failed jobs record last_status = "error" and advance next_run_at normally
  • No retry_on_failure, max_retries, retry_delay, or backoff config exists
  • The job schema has no retry-related fields
  • Only manual retry via cronjob action=run or gateway restart (one-shot only)

Proposed Feature

Add configurable retry-on-failure for cron jobs:

Schema additions:

retry:
  enabled: true
  max_retries: 3          # Max retry attempts before giving up
  delay: 5m               # Initial delay before first retry
  backoff: exponential     # "exponential" or "linear" or "fixed"
  max_delay: 30m           # Cap on retry delay

Behavior:

  • On failure, schedule a retry after delay (with optional backoff)
  • If retry also fails, try again up to max_retries times
  • Each retry attempt is logged in the job record
  • If all retries exhausted, record final error and advance next_run_at as normal
  • Successful retry clears the error state

Default: retry: enabled: false — backward compatible, no behavior change for existing jobs

Why This Can't Be a Plugin

The cron scheduler's tick cycle (tick()_process_job()run_job()mark_job_run()) is core engine code. Retry scheduling requires modifying mark_job_run() and advance_next_run() to support re-queuing failed jobs with computed delay — this is fundamental scheduler behavior, not something a plugin can intercept.

Alternative Approaches Considered

  1. Wrapper script with retry logic — works for no_agent=True script jobs but not for LLM-driven agent cron jobs
  2. External watchdog — defeats the purpose of built-in scheduling
  3. More granular scheduling — running a daily job every hour "in case it fails" wastes resources and produces duplicate output

Environment

  • Hermes Agent v0.14.x
  • Multiple daily cron jobs (homework, grade monitor, weather, journal, email)
  • Flat-rate provider (ZAI) with occasional concurrency rate limiting that triggers fallback to thinking-mode models

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING