hermes - 💡(How to fix) Fix kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The kanban dispatcher embedded in hermes gateway run wedges after 3-4 verifier-subprocess completions, requiring a gateway restart to recover. The evidence points to a race between the gateway's multi-threaded SQLite connection cycling and verifier-subprocess connection lifecycles in WAL mode. Switching the kanban DB to journal_mode=DELETE eliminates the wedge in our environment; we've validated the patch under load. Submitting as an issue first so the maintainer can decide on the right fix shape — a more surgical WAL-preserving fix may be preferable.

Error Message

The kanban dispatcher embedded in hermes gateway run ticks every 60s. After 3-4 verifier-subprocess completions on a single board, every subsequent tick fails with sqlite3.OperationalError: disk I/O error in release_stale_claims (hermes_cli/kanban_db.py:2399). Failures continue indefinitely until hermes gateway restart. 3. After 3-4 completions, dispatcher ticks start failing with the I/O error. Our inferred mechanism: when (a) a verifier subprocess opens its own connection to call kanban_block, (b) all gateway threads happen to be momentarily between connect()s, and (c) the verifier exits as the last DB-holder, SQLite's "last connection cleanup" unlinks -wal and -shm. New gateway connections see fresh files on disk, but the gateway's still-cached unixShmNode references the deleted inode. Subsequent SELECTs in any gateway connection fail with what we believe is SQLITE_IOERR_SHMMAP surfacing as the visible disk I/O error. The wedge, the deleted-inode FDs, and the fix's effectiveness are measured. The specific unixShmNode poisoning mechanism is inferred from those measurements + SQLite's documented per-process WAL state semantics. We did not directly capture the SQLite internal error code (would require strace + a recompile to expose). 3. Detect-and-reopen on I/O error: catch the symptom and refresh connections, but doesn't fix the underlying race.

Root Cause

Root cause (our analysis — measured wedge + lsof evidence + inferred mechanism)

Fix Action

Fix / Workaround

The kanban dispatcher embedded in hermes gateway run wedges after 3-4 verifier-subprocess completions, requiring a gateway restart to recover. The evidence points to a race between the gateway's multi-threaded SQLite connection cycling and verifier-subprocess connection lifecycles in WAL mode. Switching the kanban DB to journal_mode=DELETE eliminates the wedge in our environment; we've validated the patch under load. Submitting as an issue first so the maintainer can decide on the right fix shape — a more surgical WAL-preserving fix may be preferable.

  • Hermes Agent v0.14.0 (2026.5.16) — commit d61785889
  • Linux 6.17.0-29-generic, Ubuntu 24.04
  • Filesystem: local XFS (rw,noatime) — not NFS/SMB
  • SQLite via Python 3.11 sqlite3 module
  • Affected: any board with multi-process write access (gateway dispatcher + verifier subprocesses)

The kanban dispatcher embedded in hermes gateway run ticks every 60s. After 3-4 verifier-subprocess completions on a single board, every subsequent tick fails with sqlite3.OperationalError: disk I/O error in release_stale_claims (hermes_cli/kanban_db.py:2399). Failures continue indefinitely until hermes gateway restart.

Code Example

python <PID> orion DEL-r REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm
python <PID> orion  26u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)
python <PID> orion  27ur REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm (deleted)
python <PID> orion  31u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)  [different inode]
python <PID> orion  35u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)  [different inode]

---

with _INIT_LOCK:
-            # WAL doesn't work on network filesystems (NFS/SMB/FUSE). Shared helper
-            # falls back to DELETE with one WARNING so kanban stays usable there.
-            from hermes_state import apply_wal_with_fallback
-            apply_wal_with_fallback(conn, db_label=f"kanban.db ({path.name})")
+            conn.execute("PRAGMA journal_mode=DELETE")
             conn.execute("PRAGMA synchronous=NORMAL")
             conn.execute("PRAGMA foreign_keys=ON")
RAW_BUFFERClick to expand / collapse

Summary

The kanban dispatcher embedded in hermes gateway run wedges after 3-4 verifier-subprocess completions, requiring a gateway restart to recover. The evidence points to a race between the gateway's multi-threaded SQLite connection cycling and verifier-subprocess connection lifecycles in WAL mode. Switching the kanban DB to journal_mode=DELETE eliminates the wedge in our environment; we've validated the patch under load. Submitting as an issue first so the maintainer can decide on the right fix shape — a more surgical WAL-preserving fix may be preferable.

Environment

  • Hermes Agent v0.14.0 (2026.5.16) — commit d61785889
  • Linux 6.17.0-29-generic, Ubuntu 24.04
  • Filesystem: local XFS (rw,noatime) — not NFS/SMB
  • SQLite via Python 3.11 sqlite3 module
  • Affected: any board with multi-process write access (gateway dispatcher + verifier subprocesses)

Symptom (measured)

The kanban dispatcher embedded in hermes gateway run ticks every 60s. After 3-4 verifier-subprocess completions on a single board, every subsequent tick fails with sqlite3.OperationalError: disk I/O error in release_stale_claims (hermes_cli/kanban_db.py:2399). Failures continue indefinitely until hermes gateway restart.

Repro

  1. Run hermes gateway start against a board with active dispatch (kanban.dispatch_in_gateway: true).
  2. Dispatch ~4 worker tasks whose verifier subprocesses each open their own kanban DB connection (e.g., a custom verifier profile that calls kanban_block from inside a skill).
  3. After 3-4 completions, dispatcher ticks start failing with the I/O error.
  4. Every subsequent tick fails until hermes gateway restart.

Direct evidence (measured)

lsof on wedged gateway:

python <PID> orion DEL-r REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm
python <PID> orion  26u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)
python <PID> orion  27ur REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm (deleted)
python <PID> orion  31u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)  [different inode]
python <PID> orion  35u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)  [different inode]

Three different deleted -wal inodes in one gateway lifetime. Meanwhile:

  • PRAGMA integrity_check and quick_check: both ok throughout
  • Fresh sqlite3.connect() from a different process: same SELECT runs cleanly
  • hermes kanban dispatch --dry-run from a fresh CLI process: works
  • Only the gateway's long-lived process's connections wedge

Root cause (our analysis — measured wedge + lsof evidence + inferred mechanism)

The gateway has ~7 threads. All 7 kanban DB connect sites use proper try/finally: close() patterns — no individual connection leak (we checked all of them):

  • gateway/run.py:4590 — notifier-watcher (~5s loop)
  • gateway/run.py:4842, 4857, 4878 — sub/unsub/advance helpers
  • gateway/run.py:5164 — dispatcher tick (60s)
  • gateway/run.py:5241 — spawn-budget watcher
  • gateway/run.py:9321 — auto-subscribe handler

The evidence is consistent with this mechanism: in WAL mode, SQLite shares a per-process unixShmNode and SHM mmap across all connections to the same DB within one process. The lsof output above shows the gateway holding multiple deleted-inode FDs on -wal / -shm — three different deleted -wal inodes per gateway lifetime, matching the 3-4 verifier-subprocess completions per wedge cycle.

Our inferred mechanism: when (a) a verifier subprocess opens its own connection to call kanban_block, (b) all gateway threads happen to be momentarily between connect()s, and (c) the verifier exits as the last DB-holder, SQLite's "last connection cleanup" unlinks -wal and -shm. New gateway connections see fresh files on disk, but the gateway's still-cached unixShmNode references the deleted inode. Subsequent SELECTs in any gateway connection fail with what we believe is SQLITE_IOERR_SHMMAP surfacing as the visible disk I/O error.

The wedge, the deleted-inode FDs, and the fix's effectiveness are measured. The specific unixShmNode poisoning mechanism is inferred from those measurements + SQLite's documented per-process WAL state semantics. We did not directly capture the SQLite internal error code (would require strace + a recompile to expose).

Fix we applied (works in our environment, may not be the right one for upstream)

One-line change in hermes_cli/kanban_db.py (around line 1049):

         with _INIT_LOCK:
-            # WAL doesn't work on network filesystems (NFS/SMB/FUSE). Shared helper
-            # falls back to DELETE with one WARNING so kanban stays usable there.
-            from hermes_state import apply_wal_with_fallback
-            apply_wal_with_fallback(conn, db_label=f"kanban.db ({path.name})")
+            conn.execute("PRAGMA journal_mode=DELETE")
             conn.execute("PRAGMA synchronous=NORMAL")
             conn.execute("PRAGMA foreign_keys=ON")

Other Hermes DBs that use apply_wal_with_fallback (state.db, memory_store.db, response_store.db) are unaffected — only kanban DBs switch.

Tradeoff this fix accepts

DELETE mode trades away WAL's concurrent-reader concurrency: writers serialize on the DB exclusive lock. For the kanban DB this looks invisible — small DB (~120KB in our case), low write rate (60s dispatcher tick + event-driven helpers, all sub-second), and the multi-process write pattern is exactly what makes WAL fragile here. The previous apply_wal_with_fallback call also removed (which exists to fall back to DELETE on WAL-incompatible filesystems like NFS) — DELETE mode is unconditional with this diff, so the NFS-fallback log message no longer fires.

A more surgical fix may preserve WAL. Options the maintainer might prefer:

  1. Shared long-lived gateway connection per board, used by all gateway threads (would need locking) — keeps WAL but eliminates the multi-connection-per-process pattern.
  2. Checkpoint discipline: pin the WAL/SHM lifecycle so verifier subprocesses can't trigger the "last connection cleanup" unlink (e.g., have the gateway hold a sentinel connection open for the lifetime of the process).
  3. Detect-and-reopen on I/O error: catch the symptom and refresh connections, but doesn't fix the underlying race.

We picked DELETE because it removes the failure class by construction; we don't have visibility into upstream's wider design considerations (NFS support, expected write concurrency, etc.). The patch above is offered as one resolution; happy to defer to the maintainer's preferred approach.

Validation (measured)

After applying the patch + gateway restart:

  • Existing -wal file auto-removed on first connect (PRAGMA migration worked)
  • 3-task stress test: all dispatched in one tick, completed cleanly, zero I/O errors
  • lsof shows 0 -wal/-shm FDs on the gateway across multiple verifier-subprocess cycles
  • 0 dispatcher tick failures across the validation window

Happy to open a PR with the change shown above if the DELETE-mode direction is what you want — or to take a different cut at it if you'd prefer one of the surgical alternatives.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning