hermes - 💡(How to fix) Fix [Bug] `hermes gateway run --replace` / `gateway restart --profile X` from shell leaks orphan dispatcher → silent concurrent kanban.db writer → corruption

hermes2026-05-30 08:22:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

The orphan process is invisible to systemctl status (its PID isn't MainPID), is re-parented to PID 1 (PPID=1), and has no time-based supervisor — it can live for days. I caught two of them on a production Hermes droplet running stably-misbehaving for 1d4h and 2.5h respectively before the second concurrent dispatcher tick collided badly enough to corrupt the kanban DB.

Closely related to #27041 (hermes-web-ui spawning orphan gateways on macOS launchd) and #31485 (RFC for production-grade multi-profile supervisor with adopt-orphan), but distinct: the producer here is the CLI itself (hermes gateway restart|run --replace), not the web-UI Node process and not the /restart slash command (#12875).

Error Message

hermes gateway restart --profile <name> and hermes gateway run --replace invoked from a shell on a Linux systemd host spawn long-lived Python processes that escape the systemd service cgroup and survive subsequent systemctl restart hermes-gateway.service cycles. The orphans become silent concurrent writers on ~/.hermes/kanban.db, which—combined with the systemd-managed gateway also writing—produces classic multi-writer SQLite WAL corruption (disk I/O error, then database disk image is malformed, then dispatcher quarantine). This is the silent root cause of two confirmed kanban DB corruption incidents I had on 2026-05-28 and 2026-05-30. Each time the diagnostic was "disk I/O error → database disk image is malformed → dispatcher quarantine on all boards". Investigation initially pointed at agents running raw sqlite3 REINDEX on the live DB during a transient cloud-volume hiccup — that was a real contributor and I closed that hole locally — but the underlying first contributor was the orphan gateway dispatcher writing in parallel with the systemd-managed one.

Root Cause

This is the silent root cause of two confirmed kanban DB corruption incidents I had on 2026-05-28 and 2026-05-30. Each time the diagnostic was "disk I/O error → database disk image is malformed → dispatcher quarantine on all boards". Investigation initially pointed at agents running raw sqlite3 REINDEX on the live DB during a transient cloud-volume hiccup — that was a real contributor and I closed that hole locally — but the underlying first contributor was the orphan gateway dispatcher writing in parallel with the systemd-managed one.

Fix Action

Workaround

After confirming an orphan is alive, kill -TERM <pid> (no shell loop needed). To prevent recurrence: never run hermes gateway run --replace or hermes gateway restart --profile X from a shell on a systemd-managed host. Always systemctl restart hermes-gateway.service (or the per-profile service).

This is brittle — there's no warning when a user does the wrong thing, and the failure mode is silent until the kanban DB corrupts hours/days later.

Code Example

$ ps -o pid,ppid,etime,cmd -p 2361366 3982251
    PID    PPID     ELAPSED CMD
2361366       1  1-04:09:41 /usr/local/lib/.../python -m hermes_cli.main gateway run --replace
3982251       1     02:25:29 /usr/local/lib/.../python -m hermes_cli.main gateway restart --profile argus

$ systemctl show -p MainPID --value hermes-gateway.service
4154204    # ≠ either orphan

$ lsof /root/.hermes/kanban.db
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
python  3982251 root   20u   REG  253,1   864256 550038 /root/.hermes/kanban.db
python  3982251 root   32u   REG  253,1   864256 550038 /root/.hermes/kanban.db
python  3982251 root   35u   REG  253,1   864256 550038 /root/.hermes/kanban.db
# (plus the systemd-managed gateway's handles — TWO independent writers)

$ strace -f -e openat,write -p 2361366 -o /tmp/orphan.strace ; grep kanban /tmp/orphan.strace
... openat("/root/.hermes/kanban.db", O_RDONLY|O_CLOEXEC) = 30
... openat("/root/.hermes/kanban.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 30
... openat("/root/.hermes/kanban.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 31

RAW_BUFFERClick to expand / collapse

Summary

Environment

Linux 6.8.0-117-generic (Ubuntu)
systemd 255
Hermes Agent v0.15.1 (2026.5.29) — confirmed after a 312-commit update; the bug pre-existed in v0.14.0 and still reproduces
hermes-gateway.service installed via hermes gateway install, Restart=always
Single droplet, root-only install (HERMES_HOME=/root/.hermes, multiple --profile per-profile services for Aion + Lanista)

Reproduction

Install hermes-gateway.service under systemd (Restart=always).
From a shell (NOT via systemd), run any of:
- hermes gateway restart --profile argus (the case I caught — meant to be a one-shot restart of the argus per-profile gateway)
- hermes gateway run --replace (the case the older zombie originated from — looks like manual gateway run from a shell)
The command appears to complete; control returns to the shell.
ps -o pid,ppid,etime,cmd -p <new pid> shows the child Python process still running, PPID=1, not in any systemd unit's cgroup.
systemctl restart hermes-gateway.service later — the orphan is not affected (it isn't in the unit's cgroup).
lsof /root/.hermes/kanban.db — the orphan holds multiple read/write fds on the shared kanban DB.
strace -e openat,write -p <orphan> confirms the orphan is actively running its own kanban dispatcher tick and O_RDWR|O_CREAT opening kanban.db and kanban.db-wal.

Evidence from production

$ ps -o pid,ppid,etime,cmd -p 2361366 3982251
    PID    PPID     ELAPSED CMD
2361366       1  1-04:09:41 /usr/local/lib/.../python -m hermes_cli.main gateway run --replace
3982251       1     02:25:29 /usr/local/lib/.../python -m hermes_cli.main gateway restart --profile argus

$ systemctl show -p MainPID --value hermes-gateway.service
4154204    # ≠ either orphan

$ lsof /root/.hermes/kanban.db
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
python  3982251 root   20u   REG  253,1   864256 550038 /root/.hermes/kanban.db
python  3982251 root   32u   REG  253,1   864256 550038 /root/.hermes/kanban.db
python  3982251 root   35u   REG  253,1   864256 550038 /root/.hermes/kanban.db
# (plus the systemd-managed gateway's handles — TWO independent writers)

$ strace -f -e openat,write -p 2361366 -o /tmp/orphan.strace ; grep kanban /tmp/orphan.strace
... openat("/root/.hermes/kanban.db", O_RDONLY|O_CLOEXEC) = 30
... openat("/root/.hermes/kanban.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 30
... openat("/root/.hermes/kanban.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 31

Impact

The May-30 incident lost the full task_runs table (header destroyed), required Python permissive-read row-by-row rebuild (the build's sqlite3 lacks sqlite_dbpage so .recover doesn't help), and dropped one task entirely.

The closed kanban-concurrency fixes (#32424, #30445, #31618, #32532) and the SQLite busy_timeout PRAGMA help against one extra writer hitting a momentarily locked DB, but don't help against two long-lived dispatchers that each genuinely believe they own the file — both pass busy_timeout happily and then race on WAL frames.

Workaround

This is brittle — there's no warning when a user does the wrong thing, and the failure mode is silent until the kanban DB corrupts hours/days later.

Suggested fix directions

In rough order of robustness:

CLI guard: Before hermes gateway run --replace or hermes gateway restart --profile X actually spawns, detect that the same (HERMES_HOME, profile) is already supervised by a systemd (or launchd) unit and refuse with a clear hint: use 'systemctl restart hermes-gateway.service' instead. Allow --force to override.
Adopt-orphan on startup: When gateway run --replace does proceed, before forking, scan for any other process matching the (HERMES_HOME, profile) signature (e.g., via a pid file at $HERMES_HOME/gateway.pid or matching argv + env) and either kill it or refuse to start. This is the spirit of #31485 narrowed to "kill the same-identity sibling, don't accumulate."
Pid-file + flock-based single-writer guarantee: Wrap the kanban dispatcher loop in a flock on $HERMES_HOME/kanban.dispatch.lock so even if a second process spawns, only one of them gets to write. This is defense in depth and would have caught the corruption regardless of how the orphan got there.
Periodic supervisor watchdog (already proposed in #32574): A platform-liveness loop that notices "I'm not MainPID and a same-identity process is" and self-exits.

Happy to provide more reproduction detail or test a candidate patch on the same droplet.

Related: #27041, #12875, #31485, #32574, #34966, and the closed kanban-corruption cluster #32424 / #30445 / #31618 / #32532.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug] `hermes gateway run --replace` / `gateway restart --profile X` from shell leaks orphan dispatcher → silent concurrent kanban.db writer → corruption

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Environment

Reproduction

Evidence from production

Impact

Workaround

Suggested fix directions

Still need to ship something?

TRENDING