hermes - 💡(How to fix) Fix cron scheduler: profile-job context bleeds into concurrent non-profile job (script not found)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Bug is silent until it bites — the cron output file says Script not found: <wrong-path>, but if the user has Outlook auto-archiving or doesn't check Sent Items vs. Inbox, they may not realize emails are missing for days. The cron's last_status: error does surface in cronjob action='list', which is how I caught it.

Root Cause

cron/scheduler.py::_job_profile_context mutates a module-global:

global _hermes_home
prior_override = _hermes_home
...
_hermes_home = profile_home   # ← module-global mutation

_get_hermes_home() reads that module-global first:

def _get_hermes_home() -> Path:
    return _hermes_home or get_hermes_home()

Meanwhile, tick() partitions due jobs into a sequential pool (jobs with profile or workdir) and a parallel pool (jobs without). Both pools execute concurrently in separate ThreadPoolExecutors. Each future runs in its own contextvars.copy_context() — which protects the contextvar-based set_hermes_home_override, but does NOT protect the module-global _hermes_home.

Race window:

t=0   sequential pool: profile job starts → _hermes_home = corpusiq-agent
t=0+ε parallel pool:   no-profile job runs → _get_hermes_home() returns corpusiq-agent
                       _run_job_script() resolves scripts_dir to the WRONG profile
                       → "Script not found"
t=N   sequential pool: profile job's finally → _hermes_home = None

The _job_profile_context finally-block restoration only fires when the sequential job finishes. Any parallel job that fires inside that window sees the bleed.

Fix Action

Workaround

I'm copying affected scripts into BOTH ~/.hermes/scripts/ and ~/.hermes/profiles/<profile>/scripts/ so the path-traversal guard passes from either profile context. Symlinks don't work because _run_job_script resolves the symlink target with .resolve() and then enforces path.relative_to(scripts_dir_resolved), which rejects cross-profile symlinks.

Code Example

Script not found: /home/<user>/.hermes/profiles/<other-profile>/scripts/<script>.py

---

global _hermes_home
prior_override = _hermes_home
...
_hermes_home = profile_home   # ← module-global mutation

---

def _get_hermes_home() -> Path:
    return _hermes_home or get_hermes_home()

---

t=0   sequential pool: profile job starts → _hermes_home = corpusiq-agent
t=0+ε parallel pool:   no-profile job runs → _get_hermes_home() returns corpusiq-agent
                       _run_job_script() resolves scripts_dir to the WRONG profile
"Script not found"
t=N   sequential pool: profile job's finally → _hermes_home = None

---

def _get_hermes_home() -> Path:
    # Read contextvar first; preserves test monkeypatch via module global as fallback
    from hermes_constants import get_hermes_home_override
    override = get_hermes_home_override()
    return override or _hermes_home or get_hermes_home()
RAW_BUFFERClick to expand / collapse

Symptom

A non-profile cron job (profile: null) intermittently fails with

Script not found: /home/<user>/.hermes/profiles/<other-profile>/scripts/<script>.py

even though the script is at ~/.hermes/scripts/<script>.py and the job has no profile configured. Subsequent ticks of the same job often succeed.

In my case (live since at least 2026-06-04):

  • Job b380708f7fa9 (Daily X Post Draft), profile: null, script: daily_x_post.py, no_agent: true
  • Was firing successfully for ~9 days, then suddenly failed once with the wrong-profile script path
  • The tick that failed fired ~95 seconds after a different job in the same tick had profile: corpusiq-agent

Root cause

cron/scheduler.py::_job_profile_context mutates a module-global:

global _hermes_home
prior_override = _hermes_home
...
_hermes_home = profile_home   # ← module-global mutation

_get_hermes_home() reads that module-global first:

def _get_hermes_home() -> Path:
    return _hermes_home or get_hermes_home()

Meanwhile, tick() partitions due jobs into a sequential pool (jobs with profile or workdir) and a parallel pool (jobs without). Both pools execute concurrently in separate ThreadPoolExecutors. Each future runs in its own contextvars.copy_context() — which protects the contextvar-based set_hermes_home_override, but does NOT protect the module-global _hermes_home.

Race window:

t=0   sequential pool: profile job starts → _hermes_home = corpusiq-agent
t=0+ε parallel pool:   no-profile job runs → _get_hermes_home() returns corpusiq-agent
                       _run_job_script() resolves scripts_dir to the WRONG profile
                       → "Script not found"
t=N   sequential pool: profile job's finally → _hermes_home = None

The _job_profile_context finally-block restoration only fires when the sequential job finishes. Any parallel job that fires inside that window sees the bleed.

Reproduction sketch

  1. Default profile gateway running with at least one job A configured with profile=<other> and another job B configured with profile=null and script=<some_default_profile_script>.py
  2. Configure their schedules so both are due in the same minute, with A slightly earlier
  3. Job A opens the profile context → mutates _hermes_home
  4. Within the next 1-2 seconds (before A returns), B is dispatched on the parallel pool and calls _run_job_script → fails with Script not found pointing at <other> profile's scripts dir

It's intermittent because:

  • The bleed only happens if the parallel job hits _get_hermes_home() during the profile job's window
  • A short profile job that finishes quickly closes the race window before the next parallel job fires
  • A profile job with no_agent: false (LLM-driven) can stay in the context for many seconds → wide window

Suggested fix

Replace the module-global mutation in _job_profile_context with the contextvar that's already in use (set_hermes_home_override from hermes_constants). Specifically, remove the global _hermes_home; _hermes_home = profile_home lines and rely on set_hermes_home_override(profile_home) exclusively. Then update _get_hermes_home() to read the contextvar:

def _get_hermes_home() -> Path:
    # Read contextvar first; preserves test monkeypatch via module global as fallback
    from hermes_constants import get_hermes_home_override
    override = get_hermes_home_override()
    return override or _hermes_home or get_hermes_home()

Keep _hermes_home as the test/monkeypatch hook (it's documented as such) but stop using it as the per-job profile carrier — that's what contextvars are for, and contextvars.copy_context() already isolates them across the parallel/sequential pool threads.

Workaround

I'm copying affected scripts into BOTH ~/.hermes/scripts/ and ~/.hermes/profiles/<profile>/scripts/ so the path-traversal guard passes from either profile context. Symlinks don't work because _run_job_script resolves the symlink target with .resolve() and then enforces path.relative_to(scripts_dir_resolved), which rejects cross-profile symlinks.

Severity

Bug is silent until it bites — the cron output file says Script not found: <wrong-path>, but if the user has Outlook auto-archiving or doesn't check Sent Items vs. Inbox, they may not realize emails are missing for days. The cron's last_status: error does surface in cronjob action='list', which is how I caught it.

Environment

  • Hermes commit (at time of report): 30412a977 (NousResearch/hermes-agent main)
  • Linux, Python 3.11.15, systemd-managed hermes-gateway + hermes-gateway-<profile> services
  • Single-host setup with default + one named profile (corpusiq-agent)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING