hermes - 💡(How to fix) Fix cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

StepCodex · 2026-05-17T15:57:08Z

[hermes] Problem When any cron job runs a long-running task e.g. an Opus delegation lasting 2–4 min , tick holds an exclusive fcntl.LOCK EX lock for the entire… ## Workaround Set `cron.max_parallel_jobs: 2` in `config.yaml`. This limits the blast radius but does not fix the root cause (lock held during execution). ## Problem When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min), `tick()` holds an exclusive `fcntl.LOCK_EX` lock for the **entire duration** of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0. Combined with the grace-window logic in `compute_next_run` (half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward to `now + interval` instead of catching up. **Missed runs are silently dropped.** ## Observed impact Production setup running 4 `pdp-v1` epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations: - A 5m-interval cron had a **68-minute gap** between runs (last: 16:22, next computed: 17:33) - An autonomous coding epic made **zero progress for 2+ hours** while human oversight was live - No error logged, no alert — completely silent failure ## Root cause (code refs) 1. **`cron/scheduler.py::tick()`** — `fcntl.LOCK_EX` acquired at entry, held until `ThreadPoolExecutor.__exit__` (i.e. all jobs complete). The lock is not needed during job execution — only during the `get_due_jobs()` + `advance_next_run()` critical section. 2. **`cron/jobs.py::compute_next_run()`** — grace window for `interval`-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through to `now + interval` with no catch-up. ## Proposed fixes ### Fix 1 — Release lock after dispatch, not after completion (~30 LOC) ```python # Acquire lock with tick_lock: due_jobs = get_due_jobs() for job in due_jobs: advance_next_run(job["id"]) # at-most-once preserved # Lock released here — jobs run outside the critical section with ThreadPoolExecutor(max_workers=_max_workers) as pool: futures = [pool.submit(run_job, job) for job in due_jobs] ... ``` `advance_next_run()` already sets `next_run_at` before any job starts, so at-most-once semantics are preserved without holding the lock during execution. ### Fix 2 — Better grace / catch-up for interval jobs (~10 LOC) For `kind=interval`, advance to the smallest `last_run + N×interval > now` rather than `now + interval`. This preserves cadence without accumulating missed runs. Alternatively: cap grace at `1×period` instead of `0.5×period` so a 15m job tolerates a 15m delay. ### Fix 3 — Cap max_parallel_jobs default (~1 LOC) `max_parallel_jobs:` is currently unbounded (empty). Default to `4` or `2` to prevent N concurrent heavy jobs from holding the lock indefinitely. ## Workaround Set `cron.max_parallel_jobs: 2` in `config.yaml`. This limits the blast radius but does not fix the root cause (lock held during execution). ## Notes - A separate scheduler process does **not** fix this — same lock semantics apply, and Discord delivery has no standalone-process adapter. - The `hermes cron run` manual trigger resets `next_run_at = now + interval`, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler. - Confirmed on: Linux (6.8.0), gateway mode (Discord), 4 active interval crons, Anthropic/OpenRouter provider.

hermes2026-05-17 15:57:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

No error logged, no alert — completely silent failure

Root Cause

Root cause (code refs)

Fix Action

Workaround

Set cron.max_parallel_jobs: 2 in config.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).

Code Example

# Acquire lock
with tick_lock:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # at-most-once preserved
# Lock released here — jobs run outside the critical section

with ThreadPoolExecutor(max_workers=_max_workers) as pool:
    futures = [pool.submit(run_job, job) for job in due_jobs]
    ...

RAW_BUFFERClick to expand / collapse

Problem

When any cron job runs a long-running task (e.g. an Opus delegation lasting 2–4 min), tick() holds an exclusive fcntl.LOCK_EX lock for the entire duration of all jobs in the batch — not just the scheduling decision. This causes every subsequent 60s ticker attempt to hit the lock, skip, and return 0.

Combined with the grace-window logic in compute_next_run (half-period, capped 2min–2hr), when the lock finally releases, the missed-run window has been exceeded and all overdue jobs fast-forward to now + interval instead of catching up. Missed runs are silently dropped.

Observed impact

Production setup running 4 pdp-v1 epic crons (15m + 5m + 5m + 15m intervals). During an active interactive session with several Opus delegations:

A 5m-interval cron had a 68-minute gap between runs (last: 16:22, next computed: 17:33)
An autonomous coding epic made zero progress for 2+ hours while human oversight was live
No error logged, no alert — completely silent failure

Root cause (code refs)

cron/scheduler.py::tick() — fcntl.LOCK_EX acquired at entry, held until ThreadPoolExecutor.__exit__ (i.e. all jobs complete). The lock is not needed during job execution — only during the get_due_jobs() + advance_next_run() critical section.
cron/jobs.py::compute_next_run() — grace window for interval-kind jobs is half-period (min 2m, max 2h). When grace is exceeded, falls through to now + interval with no catch-up.

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

# Acquire lock
with tick_lock:
    due_jobs = get_due_jobs()
    for job in due_jobs:
        advance_next_run(job["id"])  # at-most-once preserved
# Lock released here — jobs run outside the critical section

with ThreadPoolExecutor(max_workers=_max_workers) as pool:
    futures = [pool.submit(run_job, job) for job in due_jobs]
    ...

advance_next_run() already sets next_run_at before any job starts, so at-most-once semantics are preserved without holding the lock during execution.

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

For kind=interval, advance to the smallest last_run + N×interval > now rather than now + interval. This preserves cadence without accumulating missed runs.

Alternatively: cap grace at 1×period instead of 0.5×period so a 15m job tolerates a 15m delay.

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

max_parallel_jobs: is currently unbounded (empty). Default to 4 or 2 to prevent N concurrent heavy jobs from holding the lock indefinitely.

Workaround

Set cron.max_parallel_jobs: 2 in config.yaml. This limits the blast radius but does not fix the root cause (lock held during execution).

Notes

A separate scheduler process does not fix this — same lock semantics apply, and Discord delivery has no standalone-process adapter.
The hermes cron run manual trigger resets next_run_at = now + interval, causing further schedule drift. Avoid using it to "unstick" a stalled scheduler.
Confirmed on: Linux (6.8.0), gateway mode (Discord), 4 active interval crons, Anthropic/OpenRouter provider.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#response parsing #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (code refs)

Fix Action

Workaround

Code Example

Problem

Observed impact

Root cause (code refs)

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

Workaround

Notes

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix cron: tick lock held for full job duration causes scheduler starvation and missed runs on long-running jobs

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (code refs)

Fix Action

Workaround

Code Example

Problem

Observed impact

Root cause (code refs)

Proposed fixes

Fix 1 — Release lock after dispatch, not after completion (~30 LOC)

Fix 2 — Better grace / catch-up for interval jobs (~10 LOC)

Fix 3 — Cap max_parallel_jobs default (~1 LOC)

Workaround

Notes

Still need to ship something?

RELATED_DISCOVERY

TRENDING