hermes - 💡(How to fix) Fix Agent can schedule gateway-restart cron job that kills its own runtime, creating respawn loop with launchctl/systemd KeepAlive

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A Hermes agent running inside the gateway can call hermes cron create to schedule a hermes gateway restart (or a script that invokes launchctl kickstart -k gui/<uid>/ai.hermes.gateway / systemctl restart hermes-gateway). When the cron fires, the gateway receives SIGTERM. If the gateway is under a process supervisor with auto-restart (launchctl KeepAlive, systemd Restart=), it is immediately revived. Auto-resume then picks up the very session that scheduled the cron, and the resumed turn re-runs the same restart logic. The result is a tight SIGTERM-respawn loop, every ~10 seconds, for as long as the supervisor keeps reviving the process.

The agent is effectively able to schedule its own euthanasia, with the supervisor configured to bring it back from the dead to do it again.

Root Cause

Three independent capabilities combine into a foot-gun no single one is:

  1. hermes cron create accepts shell command bodies / arbitrary scripts with no validation against gateway-lifecycle calls.
  2. hermes gateway restart and the supervisor-restart equivalents work just fine when invoked from inside the gateway process they target.
  3. Auto-resume + supervisor KeepAlive ensure the offending session comes back online and re-executes the same logic.

Any one of these is reasonable in isolation; together they make agent self-termination loops trivial to trigger.

Fix Action

Fix / Workaround

  • Agent's actual upstream-update work (git rebase + conflict resolution against local Anthropic-stream patch) was correct — the loop was triggered AFTER the update succeeded, by the cron-based restart it scheduled to apply the update.
  • No data loss, no corrupted state — just ~10 minutes of pegged CPU and a very expensive context-compression loop on the 158k-token toxic session.
  • Recovery required: launchctl unload the plist, manually delete the two cron jobs from jobs.json, clear resume_pending in sessions.json, quarantine the toxic session JSON, remove the helper scripts in ~/.hermes/scripts/.
RAW_BUFFERClick to expand / collapse

Summary

A Hermes agent running inside the gateway can call hermes cron create to schedule a hermes gateway restart (or a script that invokes launchctl kickstart -k gui/<uid>/ai.hermes.gateway / systemctl restart hermes-gateway). When the cron fires, the gateway receives SIGTERM. If the gateway is under a process supervisor with auto-restart (launchctl KeepAlive, systemd Restart=), it is immediately revived. Auto-resume then picks up the very session that scheduled the cron, and the resumed turn re-runs the same restart logic. The result is a tight SIGTERM-respawn loop, every ~10 seconds, for as long as the supervisor keeps reviving the process.

The agent is effectively able to schedule its own euthanasia, with the supervisor configured to bring it back from the dead to do it again.

Real-world reproduction (2026-05-22)

User on macOS asked their agent in Telegram "can you upgrade yourself?". The agent legitimately ran git fetch + git rebase origin/main (good), then scheduled two recurring/one-shot cron jobs to perform the post-update restart:

  • restart-hermes-gateway-once — ran bash ~/.hermes/scripts/restart_hermes_gateway_once.sh (which calls hermes gateway restart).
  • restart-gateway-after-openai-primary — ran launchctl kickstart -k gui/501/ai.hermes.gateway directly.

Once the first cron fired:

  1. Gateway received SIGTERM.
  2. launchctl plist KeepAlive.SuccessfulExit=false revived it.
  3. New gateway boot triggered "Scheduled auto-resume for 1 restart-interrupted session(s)" — auto-resume picked up the same Telegram session that had scheduled the cron.
  4. The resumed session re-ran the same cron-creation / restart logic.
  5. GOTO 1.

errors.log excerpt — SIGTERM landed every ~10 seconds:

``` 2026-05-22 20:55:14,894 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:55:25,486 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:55:36,743 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:55:46,977 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:55:56,579 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:56:07,266 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:56:17,931 WARNING gateway.run: Shutdown context: signal=SIGTERM ... 2026-05-22 20:56:28,535 WARNING gateway.run: Shutdown context: signal=SIGTERM ... ```

This continued for ~10 minutes before manual intervention (stop gateway + launchctl unload the plist + quarantine the toxic session JSON).

The toxic session had grown to 176 messages / ~158k tokens / 401 KB — every revival triggered a preflight compression pass on this context, which is what made the loop both expensive and impossible to break with a quick recovery.

Root cause

Three independent capabilities combine into a foot-gun no single one is:

  1. hermes cron create accepts shell command bodies / arbitrary scripts with no validation against gateway-lifecycle calls.
  2. hermes gateway restart and the supervisor-restart equivalents work just fine when invoked from inside the gateway process they target.
  3. Auto-resume + supervisor KeepAlive ensure the offending session comes back online and re-executes the same logic.

Any one of these is reasonable in isolation; together they make agent self-termination loops trivial to trigger.

Proposed defenses (defense-in-depth)

1. Refuse self-targeting gateway restart calls. At the entry to hermes gateway restart / hermes gateway stop / equivalent, detect whether the caller is itself inside the gateway runtime. The gateway already sets an env var when it spawns subprocesses (HERMES_RUNNING_AS_GATEWAY=1 or similar — pick a stable name); if that env var is set, exit non-zero with a clear message:

``` Refusing to restart the gateway from inside the gateway process. Use hermes gateway restart from a shell outside the running gateway. ```

2. Filter hermes cron create payloads. Reject job specs whose command/script contains a gateway-lifecycle call:

  • hermes gateway (restart|stop|start)
  • launchctl (kickstart|unload|load|stop) .*hermes
  • systemctl (restart|stop) .*hermes
  • direct kill/pkill against the gateway PID

A simple regex pre-flight is enough; let the user override with a flag like --allow-lifecycle if they really need this for a one-off migration.

3. Loop detector. Track gateway shutdowns within a rolling window. If we observe \u22653 SIGTERM-then-respawn cycles within 60 seconds, refuse the next auto-resume and write a recovery note to the log ("loop detected — auto-resume disabled, manual restart required"). This is the safety net for novel ways an agent finds to kill its own runtime.

4. Don't auto-resume on shutdown_timeout when the previous turn's outgoing tool calls touched gateway/launchctl/systemctl/cron. This is the narrowest fix and probably the highest ROI: it doesn't disable any feature, it just refuses to silently re-run the exact tool sequence that killed us last time.

Damage assessment in the wild

  • Agent's actual upstream-update work (git rebase + conflict resolution against local Anthropic-stream patch) was correct — the loop was triggered AFTER the update succeeded, by the cron-based restart it scheduled to apply the update.
  • No data loss, no corrupted state — just ~10 minutes of pegged CPU and a very expensive context-compression loop on the 158k-token toxic session.
  • Recovery required: launchctl unload the plist, manually delete the two cron jobs from jobs.json, clear resume_pending in sessions.json, quarantine the toxic session JSON, remove the helper scripts in ~/.hermes/scripts/.

Happy to send a PR for defenses 1 + 2 + 4 (they're small surface-area). Defense 3 is bigger; might want a maintainer's design call on where the bookkeeping lives.

Related issue: #28161 (Anthropic stream-stall — also encountered during the same recovery).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Agent can schedule gateway-restart cron job that kills its own runtime, creating respawn loop with launchctl/systemd KeepAlive