hermes - 💡(How to fix) Fix cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • No error logged, no alert — completely silent failure

Root Cause

Root cause (code refs)

Fix Action

Workaround

Set cron.max_parallel_jobs: 2 in config.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).

Code Example

# Acquire lock
with tick_lock:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # at-most-once preserved
# Lock released here — jobs run outside the critical section

with ThreadPoolExecutor(max_workers=_max_workers) as pool:
    futures = [pool.submit(run_job, job) for job in due_jobs]
    ...
RAW_BUFFERClick to expand / collapse

Problem

When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min), tick() holds an exclusive fcntl.LOCK_EX lock for the entire duration of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0.

Combined with the grace-window logic in compute_next_run (half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward to now + interval instead of catching up. Missed runs are silently dropped.

Observed impact

Production setup running 4 pdp-v1 epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations:

  • A 5m-interval cron had a 68-minute gap between runs (last: 16:22, next computed: 17:33)
  • An autonomous coding epic made zero progress for 2+ hours while human oversight was live
  • No error logged, no alert — completely silent failure

Root cause (code refs)

  1. cron/scheduler.py::tick()fcntl.LOCK_EX acquired at entry, held until ThreadPoolExecutor.__exit__ (i.e. all jobs complete). The lock is not needed during job execution — only during the get_due_jobs() + advance_next_run() critical section.

  2. cron/jobs.py::compute_next_run() — grace window for interval-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through to now + interval with no catch-up.

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

# Acquire lock
with tick_lock:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # at-most-once preserved
# Lock released here — jobs run outside the critical section

with ThreadPoolExecutor(max_workers=_max_workers) as pool:
    futures = [pool.submit(run_job, job) for job in due_jobs]
    ...

advance_next_run() already sets next_run_at before any job starts, so at-most-once semantics are preserved without holding the lock during execution.

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

For kind=interval, advance to the smallest last_run + N×interval > now rather than now + interval. This preserves cadence without accumulating missed runs.

Alternatively: cap grace at 1×period instead of 0.5×period so a 15m job tolerates a 15m delay.

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

max_parallel_jobs: is currently unbounded (empty). Default to 4 or 2 to prevent N concurrent heavy jobs from holding the lock indefinitely.

Workaround

Set cron.max_parallel_jobs: 2 in config.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).

Notes

  • A separate scheduler process does not fix this — same lock semantics apply, and Discord delivery has no standalone-process adapter.
  • The hermes cron run manual trigger resets next_run_at = now + interval, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler.
  • Confirmed on: Linux (6.8.0), gateway mode (Discord), 4 active interval crons, Anthropic/OpenRouter provider.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING