hermes - 💡(How to fix) Fix [Feature]: Cron delivery retry mechanism for transient network failures [2 comments, 2 participants]

hermes2026-04-21 14:39:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#13566•Fetched 2026-04-22 08:05:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

VTRiot

Participants

alt-glitch

VTRiot

Timeline (top)

commented ×2labeled ×2

The cron scheduler (cron/scheduler.py) currently has no retry mechanism for delivery failures. When a scheduled job's result cannot be delivered (e.g. transient Telegram API timeout, mobile network instability), the error is recorded in last_delivery_error and the result is silently dropped until the next scheduled run.

For users running Hermes on laptops or other intermittently-connected devices, this can result in multiple missed notifications in a row without the user being aware until they manually check ~/.hermes/cron/output/.

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Error Message

Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in last_delivery_error.

Root Cause

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Fix Action

Fix / Workaround

The retry queue's read-modify-write cycle must be protected by the same threading.Lock pattern used for jobs.json in #13021 (see advance_next_run, mark_job_run).
ContextVars-based session/delivery state (gateway/session_context.py, introduced in #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.

Code Example

1. advance_next_run()       — next run time is advanced first (crash-safe)
2. run_job()                — agent executes
3. save_job_output()        — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check           — if [SILENT], delivery is skipped
5. _deliver_result()        — Telegram delivery
   - live adapter path      — future.result(timeout=60)
   - standalone fallback    — asyncio.run(), timeout=30
6. mark_job_run()           — last_delivery_error is recorded

RAW_BUFFERClick to expand / collapse

Summary

This issue proposes adding a retry mechanism. I'd like upstream input on which of the three approaches below best fits the project direction before preparing a PR.

Current behavior

From cron/scheduler.py:938-977 and cron/scheduler.py:295-363 (approximate positions; line numbers shift with upstream churn):

1. advance_next_run()       — next run time is advanced first (crash-safe)
2. run_job()                — agent executes
3. save_job_output()        — always saved to ~/.hermes/cron/output/{job_id}/
4. [SILENT] check           — if [SILENT], delivery is skipped
5. _deliver_result()        — Telegram delivery
   - live adapter path      — future.result(timeout=60)
   - standalone fallback    — asyncio.run(), timeout=30
6. mark_job_run()           — last_delivery_error is recorded

Key observations:

Output is always persisted on disk regardless of delivery success, so the information itself is not lost.
last_delivery_error is set on failure and cleared on success, but nothing consumes it — the next tick proceeds with fresh jobs and the failed delivery is never retried.
There is no delivery queue, no backoff, no retry loop anywhere in cron/scheduler.py.

Reproduction scenario

Configure a cron job with Telegram delivery on a mobile laptop.
Put the laptop on an unstable network (train commute, moving between Wi-Fi/cellular, etc.).
A scheduled run completes; _deliver_result()'s live-adapter future.result(timeout=60) times out; the standalone fallback also times out.
last_delivery_error = "timed out" is set; the output remains in ~/.hermes/cron/output/{job_id}/ but is never delivered.
Network recovers 2 minutes later. The user sees nothing until the next scheduled run (N minutes/hours later), and the earlier output is never surfaced.

The user cannot tell whether "no notification" means "job found nothing" or "job found something but delivery failed".

Proposed approaches

I see three approaches, presented in order of my preference. I'm aware option C is the most invasive; I'm raising this issue to learn which direction fits the project best before committing to an implementation.

Option C (preferred): in-scheduler retry queue

Add a persistent retry queue to the scheduler itself, covering all cron jobs regardless of the skill that produced them.

Sketch:

Persist a queue at ~/.hermes/cron/delivery_queue.json with entries {job_id, output_path, retry_count, next_retry_at, last_error}.
On delivery failure in _deliver_result(), enqueue an entry instead of just recording last_delivery_error.
On each _tick(), check the queue first and retry entries whose next_retry_at has passed.
Exponential backoff (e.g. 2m → 5m → 15m → 30m, max 4 attempts), then give up and record the final error in last_delivery_error.
Unique entry IDs to prevent duplicate sends if a retry succeeds concurrently with a new tick.

Trade-offs:

✅ Single root fix for all cron jobs and all skills.
✅ Deterministic, LLM-independent behavior.
✅ Consistent UX: "if the job fires, you eventually hear about it."
⚠️ Introduces a new persisted state file that needs schema design, versioning, and compatibility with concurrent ticks (relevant after #13021 parallelization).
⚠️ Adds retry-policy parameters that require upstream decisions (backoff curve, max attempts, how to surface "gave up" to the user).
⚠️ Testing cost is higher (state machine).

Option A (fallback): per-skill "check last delivery status" step

Extend skills/monitoring/github-release-watch/scripts/check_releases.py (or any skill that cares about delivery reliability) with a --check-delivery-status subcommand that reads ~/.hermes/cron/jobs.json, detects a non-null last_delivery_error for its own job_id, and prepends the previous output to the current run's report.

The SKILL.md adds a Step 0 that calls this subcommand before running the main procedure.

Trade-offs:

✅ No Hermes core changes. Scoped to one skill.
✅ Safe to ship as a skill-level improvement first; can be promoted to core later if the pattern proves useful.
⚠️ Relies on the LLM reliably executing the added Step 0 every time. If the model skips it (fatigue, hallucination, modified prompt), the failure is silently missed again.
⚠️ Only helps skills that have been updated to use this pattern. Other cron jobs (agent-driven, other skills) remain exposed.

Option B (not recommended): external recovery daemon

A separate delivery_recovery.py launched via launchd that polls jobs.json for non-null last_delivery_error, reads the most recent output file, and sends it directly through python-telegram-bot — bypassing Hermes entirely.

Trade-offs:

⚠️ Requires duplicating the Telegram bot token outside Hermes config.
⚠️ Independent state file (delivery_recovery_state.json) that must stay consistent with Hermes's view.
⚠️ Feels foreign to Hermes's in-tree philosophy.

I mention this for completeness but do not propose it as an upstream contribution.

Design-alignment concern: interaction with #13021

PR #13021 recently parallelized tick() via ThreadPoolExecutor. Option C needs to be designed with this in mind:

The retry queue's read-modify-write cycle must be protected by the same threading.Lock pattern used for jobs.json in #13021 (see advance_next_run, mark_job_run).
ContextVars-based session/delivery state (gateway/session_context.py, introduced in #13021) must be re-established when a retry is dispatched from a later tick — the original job's delivery target must be preserved with the queue entry, not re-resolved at retry time.

Happy to prototype either Option A or Option C depending on upstream direction.

PR #13495 (this author): fix(cron): cancel orphan coroutine on delivery timeout before standalone fallback — addresses duplicate delivery on the cron path.
PR #13542 (this author): fix(gateway): prevent duplicate final send when only cosmetic edit failed — addresses duplicate delivery on the gateway/Telegram path.

These two PRs close the "same message twice" failure mode. This issue addresses the inverse failure mode: "message never arrives".

Question for maintainers

Is Option C (in-scheduler retry queue) a direction the project would accept, given the added complexity and the recent churn in the cron subsystem from #13021? If not, would Option A (per-skill delivery-status step) be a welcome contribution as a narrower, lower-risk stepping stone?

Happy to prepare the corresponding PR once the direction is clear.

extent analysis

TL;DR

Implementing an in-scheduler retry queue, as described in Option C, is likely the most effective solution to ensure reliable delivery of cron job results.

Guidance

Evaluate the trade-offs: Consider the benefits of a single, deterministic fix for all cron jobs against the added complexity and potential testing costs of Option C.
Assess compatibility with #13021: Ensure that the retry queue's implementation is compatible with the parallelized tick() functionality introduced in #13021, using appropriate synchronization mechanisms like threading.Lock.
Design a robust retry policy: Define a suitable backoff curve, maximum attempts, and error handling strategy to balance reliability with resource usage and user experience.
Consider incremental implementation: If Option C is deemed too invasive, Option A (per-skill delivery-status step) could serve as a lower-risk, narrower stepping stone towards improving delivery reliability.

Example

A basic example of how the retry queue might be implemented in cron/scheduler.py:

import json
import os
import threading

RETRY_QUEUE_FILE = os.path.join('~/.hermes/cron', 'delivery_queue.json')
RETRY_LOCK = threading.Lock()

def load_retry_queue():
    with RETRY_LOCK:
        try:
            with open(RETRY_QUEUE_FILE, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            return []

def save_retry_queue(queue):
    with RETRY_LOCK:
        with open(RETRY_QUEUE_FILE, 'w') as f:
            json.dump(queue, f)

def enqueue_retry(job_id, output_path, retry_count, next_retry_at, last_error):
    queue = load_retry_queue()
    queue.append({
        'job_id': job_id,
        'output_path': output_path,
        'retry_count': retry_count,
        'next_retry_at': next_retry_at,
        'last_error': last_error
    })
    save_retry_queue(queue)

def process_retry_queue():
    queue = load_retry_queue()
    current_time = time.time()
    for entry in queue:
        if entry['next_retry_at'] <= current_time:
            # Attempt to deliver the result again
            try:
                deliver_result(entry['job_id'], entry['output_path'])
                # Remove the entry from the queue if delivery succeeds
                queue.remove(entry)
            except Exception as e:
                # Update the entry with the new error and schedule the next retry
                entry['last_error'] = str(e)
                entry['next_retry_at'] = current_time + 300  # 5 minutes
                entry['retry_count'] += 1
                if entry['retry_count'] >= 4:  # Max 4 attempts
                    # Remove the entry from the queue and record the final error
                    queue.remove(entry)
                    mark_job_run(entry['job_id'], last_delivery_error=entry['last_error'])
    save_retry

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #prompt template #agent execution #callback error #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Feature]: Cron delivery retry mechanism for transient network failures [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Current behavior

Reproduction scenario

Proposed approaches

Option C (preferred): in-scheduler retry queue

Option A (fallback): per-skill "check last delivery status" step

Option B (not recommended): external recovery daemon

Design-alignment concern: interaction with #13021

Related

Question for maintainers

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [Feature]: Cron delivery retry mechanism for transient network failures [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Current behavior

Reproduction scenario

Proposed approaches

Option C (preferred): in-scheduler retry queue

Option A (fallback): per-skill "check last delivery status" step

Option B (not recommended): external recovery daemon

Design-alignment concern: interaction with #13021

Related

Question for maintainers

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

RELATED_DISCOVERY

TRENDING