hermes - 💡(How to fix) Fix [Feature]: Cron delivery retry mechanism for transient network failures [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13566Fetched 2026-04-22 08:05:49
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×2labeled ×2

The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.

For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check ~/.hermes/cron/output/.

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Error Message

The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.

  • Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in last_delivery_error.

Root Cause

The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.

For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check ~/.hermes/cron/output/.

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Fix Action

Fix / Workaround

  • The retry queue's read-modify-write cycle must be protected by the same threading.Lock pattern used for jobs.json in #13021 (see advance_next_run, mark_job_run).
  • ContextVars-based session/delivery state (gateway/session_context.py, introduced in #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.

Code Example

1. advance_next_run()       — next run time is advanced first (crash-safe)
2. run_job()                — agent executes
3. save_job_output()        — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check           — if [SILENT], delivery is skipped
5. _deliver_result()Telegram delivery
   - live adapter path      — future.result(timeout=60)
   - standalone fallback    — asyncio.run(), timeout=30
6. mark_job_run()           — last_delivery_error is recorded
RAW_BUFFERClick to expand / collapse

Summary

The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.

For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check ~/.hermes/cron/output/.

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Current behavior

From cron/scheduler.py:938-977 and cron/scheduler.py:295-363 (approximate positions; line numbers shift with upstream churn):

1. advance_next_run()       — next run time is advanced first (crash-safe)
2. run_job()                — agent executes
3. save_job_output()        — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check           — if [SILENT], delivery is skipped
5. _deliver_result()        — Telegram delivery
   - live adapter path      — future.result(timeout=60)
   - standalone fallback    — asyncio.run(), timeout=30
6. mark_job_run()           — last_delivery_error is recorded

Key observations:

  • Output is always persisted on disk regardless of delivery success, so the information itself is not lost.
  • last_delivery_error is set on failure and cleared on success, but nothing consumes it — the next tick proceeds with fresh jobs and the failed delivery is never retried.
  • There is no delivery queue, no backoff, no retry loop anywhere in cron/scheduler.py.

Reproduction scenario

  1. Configure a cron job with Telegram delivery on a mobile laptop.
  2. Put the laptop on an unstable network (train commute, moving between Wi-Fi/cellular, etc.).
  3. A scheduled run completes; _deliver_result()'s live-adapter future.result(timeout=60) times out; the standalone fallback also times out.
  4. last_delivery_error = "timed out" is set; the output remains in ~/.hermes/cron/output/{job_id}/ but is never delivered.
  5. Network recovers 2 minutes later. The user sees nothing until the next scheduled run (N minutes/hours later), and the earlier output is never surfaced.

The user cannot tell whether "no notification" means "job found nothing" or "job found something but delivery failed".

Proposed approaches

I see three approaches, presented in order of my preference. I'm aware option C is the most invasive; I'm raising this issue to learn which direction fits the project best before committing to an implementation.

Option C (preferred): in-scheduler retry queue

Add a persistent retry queue to the scheduler itself, covering all cron jobs regardless of the skill that produced them.

Sketch:

  • Persist a queue at ~/.hermes/cron/delivery_queue.json with entries {job_id, output_path, retry_count, next_retry_at, last_error}.
  • On delivery failure in _deliver_result(), enqueue an entry instead of just recording last_delivery_error.
  • On each _tick(), check the queue first and retry entries whose next_retry_at has passed.
  • Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in last_delivery_error.
  • Unique entry IDs to prevent duplicate sends if a retry succeeds concurrently with a new tick.

Trade-offs:

  • ✅ Single root fix for all cron jobs and all skills.
  • ✅ Deterministic, LLM-independent behavior.
  • ✅ Consistent UX: "if the job fires, you eventually hear about it."
  • ⚠️ Introduces a new persisted state file that needs schema design, versioning, and compatibility with concurrent ticks (relevant after #13021 parallelization).
  • ⚠️ Adds retry-policy parameters that require upstream decisions (backoff curve, max attempts, how to surface "gave up" to the user).
  • ⚠️ Testing cost is higher (state machine).

Option A (fallback): per-skill "check last delivery status" step

Extend skills/monitoring/github-release-watch/scripts/check_releases.py (or any skill that cares about delivery reliability) with a --check-delivery-status subcommand that reads ~/.hermes/cron/jobs.json, detects a non-null last_delivery_error for its own job_id, and prepends the previous output to the current run's report.

The SKILL.md adds a Step 0 that calls this subcommand before running the main procedure.

Trade-offs:

  • ✅ No Hermes core changes. Scoped to one skill.
  • ✅ Safe to ship as a skill-level improvement first; can be promoted to core later if the pattern proves useful.
  • ⚠️ Relies on the LLM reliably executing the added Step 0 every time. If the model skips it (fatigue, hallucination, modified prompt), the failure is silently missed again.
  • ⚠️ Only helps skills that have been updated to use this pattern. Other cron jobs (agent-driven, other skills) remain exposed.

Option B (not recommended): external recovery daemon

A separate delivery_recovery.py launched via launchd that polls jobs.json for non-null last_delivery_error, reads the most recent output file, and sends it directly through python-telegram-bot — bypassing Hermes entirely.

Trade-offs:

  • ⚠️ Requires duplicating the Telegram bot token outside Hermes config.
  • ⚠️ Independent state file (delivery_recovery_state.json) that must stay consistent with Hermes's view.
  • ⚠️ Feels foreign to Hermes's in-tree philosophy.

I mention this for completeness but do not propose it as an upstream contribution.

Design-alignment concern: interaction with #13021

PR #13021 recently parallelized tick() via ThreadPoolExecutor. Option C needs to be designed with this in mind:

  • The retry queue's read-modify-write cycle must be protected by the same threading.Lock pattern used for jobs.json in #13021 (see advance_next_run, mark_job_run).
  • ContextVars-based session/delivery state (gateway/session_context.py, introduced in #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.

Happy to prototype either Option A or Option C depending on upstream direction.

Related

  • PR #13495 (this author): fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback — addresses duplicate delivery on the cron path.
  • PR #13542 (this author): fix(gateway): prevent duplicate final send when only cosmetic edit failed — addresses duplicate delivery on the gateway/Telegram path.

These two PRs close the "same message twice" failure mode. This issue addresses the inverse failure mode: "message never arrives".

Question for maintainers

Is Option C (in-scheduler retry queue) a direction the project would accept, given the added complexity and the recent churn in the cron subsystem from #13021? If not, would Option A (per-skill delivery-status step) be a welcome contribution as a narrower, lower-risk stepping stone?

Happy to prepare the corresponding PR once the direction is clear.

extent analysis

TL;DR

Implementing an in-scheduler retry queue, as described in Option C, is likely the most effective solution to ensure reliable delivery of cron job results.

Guidance

  1. Evaluate the trade-offs: Consider the benefits of a single, deterministic fix for all cron jobs against the added complexity and potential testing costs of Option C.
  2. Assess compatibility with #13021: Ensure that the retry queue's implementation is compatible with the parallelized tick() functionality introduced in #13021, using appropriate synchronization mechanisms like threading.Lock.
  3. Design a robust retry policy: Define a suitable backoff curve, maximum attempts, and error handling strategy to balance reliability with resource usage and user experience.
  4. Consider incremental implementation: If Option C is deemed too invasive, Option A (per-skill delivery-status step) could serve as a lower-risk, narrower stepping stone towards improving delivery reliability.

Example

A basic example of how the retry queue might be implemented in cron/scheduler.py:

import json
import os
import threading

RETRY_QUEUE_FILE = os.path.join('~/.hermes/cron', 'delivery_queue.json')
RETRY_LOCK = threading.Lock()

def load_retry_queue():
    with RETRY_LOCK:
        try:
            with open(RETRY_QUEUE_FILE, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return []

def save_retry_queue(queue):
    with RETRY_LOCK:
        with open(RETRY_QUEUE_FILE, 'w') as f:
            json.dump(queue, f)

def enqueue_retry(job_id, output_path, retry_count, next_retry_at, last_error):
    queue = load_retry_queue()
    queue.append({
        'job_id': job_id,
        'output_path': output_path,
        'retry_count': retry_count,
        'next_retry_at': next_retry_at,
        'last_error': last_error
    })
    save_retry_queue(queue)

def process_retry_queue():
    queue = load_retry_queue()
    current_time = time.time()
    for entry in queue:
        if entry['next_retry_at'] <= current_time:
            # Attempt to deliver the result again
            try:
                deliver_result(entry['job_id'], entry['output_path'])
                # Remove the entry from the queue if delivery succeeds
                queue.remove(entry)
            except Exception as e:
                # Update the entry with the new error and schedule the next retry
                entry['last_error'] = str(e)
                entry['next_retry_at'] = current_time + 300  # 5 minutes
                entry['retry_count'] += 1
                if entry['retry_count'] >= 4:  # Max 4 attempts
                    # Remove the entry from the queue and record the final error
                    queue.remove(entry)
                    mark_job_run(entry['job_id'], last_delivery_error=entry['last_error'])
    save_retry

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Feature]: Cron delivery retry mechanism for transient network failures [2 comments, 2 participants]