hermes - ✅(Solved) Fix Cron jobs.json updates can be lost under concurrent create/update/remove and scheduler writes [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18003Fetched 2026-05-01 05:54:29
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Timeline (top)
labeled ×3commented ×2cross-referenced ×1referenced ×1

Cron job metadata is stored in jobs.json using load-modify-save operations. Some scheduler write paths use locking, but create/update/remove paths can load a stale snapshot and later save it over another writer's changes. Atomic file replacement prevents torn writes, but it does not prevent lost updates.

Root Cause

Cron job metadata is stored in jobs.json using load-modify-save operations. Some scheduler write paths use locking, but create/update/remove paths can load a stale snapshot and later save it over another writer's changes. Atomic file replacement prevents torn writes, but it does not prevent lost updates.

Fix Action

Fixed

PR fix notes

PR #18087: fix(cron): hold _jobs_file_lock across all jobs.json load-modify-save paths (#18003)

Description (problem / solution / changelog)

What does this PR do?

create_job, update_job, remove_job, and get_due_jobs all perform load-modify-save against jobs.json without holding _jobs_file_lock for the full transaction. mark_job_run and advance_next_run already lock, so a scheduler write can race with a concurrent user/tool write — both load the same snapshot, mutate disjoint entries, and one writer's save_jobs silently overwrites the other's mutations. Atomic file replacement prevents torn writes; it does not prevent lost updates.

This PR wraps the four load-modify-save paths in with _jobs_file_lock:, matching the existing pattern used by mark_job_run (line ~692) and advance_next_run (line ~766).

This is intentionally a minimal-diff fix: the root cause is "lock does not cover load → save"; the smallest change that captures the diagnosis is to extend the lock over the full transaction at the four remaining call sites. Per-process serialization is sufficient to fix the documented impact (lost jobs, missed/duplicate runs, resurrected pause/remove). Cross-process serialization remains the scheduler tick's existing responsibility and is out of scope here.

Related Issue

Fixes #18003

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • cron/jobs.py: wrap load-modify-save in _jobs_file_lock at four sites:
    • create_job (line ~546): protect load_jobs() + append + save_jobs().
    • update_job (line ~573): protect the entire body.
    • remove_job (line ~671): protect the entire body.
    • get_due_jobs (line ~792): protect the load → mutate-raw_jobs → optional save_jobs(raw_jobs) window. Without this, the fast-forward / one-shot-recovery branch can resurrect a job that was removed (or undo an update) between the function's load and its save.
  • tests/cron/test_jobs.py: add TestJobsConcurrency with three regression tests:
    • test_concurrent_updates_do_not_lose_writes — racing update_job threads must not lose writes.
    • test_concurrent_create_does_not_lose_writes — racing create_job threads must not lose writes.
    • test_get_due_jobs_does_not_resurrect_removed_job — captures the reproduction shape where, while get_due_jobs holds a stale snapshot, a concurrent remove_job is silently undone.

No deadlock risk: _jobs_file_lock is non-reentrant, but no locked block in cron/jobs.py or cron/scheduler.py calls into the four newly-locked public functions.

How to Test

  1. With this PR applied:

    pytest tests/cron/test_jobs.py::TestJobsConcurrency -xvs -o "addopts="

    All 3 tests pass.

  2. Verify the tests catch each lock site individually. Revert any one with _jobs_file_lock: line in cron/jobs.py (and re-dedent its body) — the corresponding test fails with a clear lost-update message.

  3. Full cron suite still green (serial mode, the test suite has known xdist parallel-test flakes unrelated to this PR):

    pytest tests/cron/ -q -o "addopts="

    269 passed on macOS / Python 3.14.2; the 3 TestSilentDelivery failures reproduce on stock main and are not caused by this PR.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(cron): ...)
  • I searched for existing PRs (#5199 is a 266-line refactor of the same area, open since 2026-04-05; this PR takes the minimal-diff approach instead)
  • My PR contains only changes related to this fix
  • I've run pytest tests/cron/ -q and all non-flaky tests pass
  • I've added tests for my changes (TestJobsConcurrency, three cases)
  • I've tested on my platform: macOS 15.6 (Darwin 24.6.0), Python 3.14.2

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (internal locking refinement, no API change)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md — N/A
  • I've considered cross-platform impact — threading.Lock and the existing atomic_replace flow are platform-neutral
  • I've updated tool descriptions/schemas — N/A

Changed files

  • cron/jobs.py (modified, +120/-116)
  • tests/cron/test_jobs.py (modified, +138/-0)
RAW_BUFFERClick to expand / collapse

Summary

Cron job metadata is stored in jobs.json using load-modify-save operations. Some scheduler write paths use locking, but create/update/remove paths can load a stale snapshot and later save it over another writer's changes. Atomic file replacement prevents torn writes, but it does not prevent lost updates.

Impact

Concurrent scheduler activity and user/tool operations can silently discard durable cron state changes, including newly created jobs, removed jobs, pause/resume or schedule edits, and scheduler-maintained fields such as next_run_at / last_run_at. Depending on the overwritten state, this can cause missed runs, duplicate runs, or re-enabled jobs that a user thought were removed or paused.

Evidence

In cron/jobs.py, these paths perform load-modify-save without holding the same lock across the full transaction:

  • create_job(): load_jobs(), append, save_jobs()
  • update_job(): load_jobs(), mutate selected job, save_jobs()
  • remove_job(): load_jobs(), filter, save_jobs()

By contrast, scheduler-maintained run-state updates such as mark_job_run() use _jobs_file_lock. The scheduler tick also has its own cross-process lock, but that does not protect all user/tool modifications to jobs.json.

Expected behavior

All jobs.json read-modify-write paths should use one consistent locking strategy that covers the entire transaction across scheduler, gateway, CLI, and tool entry points.

Suggested direction

Use a single cross-process lock for every jobs.json load-modify-save operation, or move cron state updates to a transactional store. Ensure create/update/remove and scheduler state writes cannot overwrite each other from stale snapshots.

extent analysis

TL;DR

Implement a single cross-process lock for all jobs.json load-modify-save operations to prevent lost updates and ensure data consistency.

Guidance

  • Identify all jobs.json load-modify-save paths in the codebase, including create_job(), update_job(), and remove_job(), to ensure they use the same locking strategy.
  • Consider using a transactional store for cron state updates as an alternative to a cross-process lock, to provide stronger consistency guarantees.
  • Review the existing _jobs_file_lock used in mark_job_run() and scheduler tick to determine if it can be adapted for use in all jobs.json modification paths.
  • Verify that the chosen locking strategy prevents concurrent modifications from overwriting each other, ensuring that all changes are properly persisted.

Example

import threading

# Create a cross-process lock
jobs_lock = threading.Lock()

def create_job():
    with jobs_lock:
        jobs = load_jobs()
        # modify jobs
        save_jobs(jobs)

def update_job():
    with jobs_lock:
        jobs = load_jobs()
        # modify jobs
        save_jobs(jobs)

def remove_job():
    with jobs_lock:
        jobs = load_jobs()
        # modify jobs
        save_jobs(jobs)

Notes

The suggested solution assumes that a cross-process lock can be effectively implemented to protect jobs.json modifications. However, the choice of locking strategy may depend on the specific requirements and constraints of the system, such as performance and scalability considerations.

Recommendation

Apply a cross-process lock to all jobs.json load-modify-save operations, as it provides a straightforward solution to prevent lost updates and ensure data consistency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

All jobs.json read-modify-write paths should use one consistent locking strategy that covers the entire transaction across scheduler, gateway, CLI, and tool entry points.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Cron jobs.json updates can be lost under concurrent create/update/remove and scheduler writes [1 pull requests, 2 comments, 2 participants]