hermes - 💡(How to fix) Fix Kanban dispatcher corrupt-board handling and multi-profile gateway ownership ambiguity

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

A production server with three Hermes gateway services (default, jimmy, and franky) all had kanban.dispatch_in_gateway: true while sharing the same board (ghi-property-launch). After the shared board DB became corrupt during an OOM/restart incident, all three gateways repeatedly attempted dispatcher/notifier work against the same corrupt DB. This produced repeated tracebacks/backups and contributed to restart/resource pressure.

There is already closely-related work in PR #33319 / issue #32593, but this incident adds a multi-profile gateway angle: docs say the default pattern is a gateway-embedded dispatcher and deprecated standalone daemon, but they don't clearly state whether a multi-profile install with several gateway services sharing one Kanban root should run one embedded dispatcher or one per profile.

Error Message

During the incident the gateway logs showed repeated errors across all three services:

Root Cause

The board directory accumulated many timestamped corrupt DB backups/sidecars because multiple gateway processes were probing/backing up the same corrupt file.

Fix Action

Fix / Workaround

Kanban dispatcher should avoid repeated corrupt-board work and clarify multi-profile gateway dispatcher ownership

A production server with three Hermes gateway services (default, jimmy, and franky) all had kanban.dispatch_in_gateway: true while sharing the same board (ghi-property-launch). After the shared board DB became corrupt during an OOM/restart incident, all three gateways repeatedly attempted dispatcher/notifier work against the same corrupt DB. This produced repeated tracebacks/backups and contributed to restart/resource pressure.

There is already closely-related work in PR #33319 / issue #32593, but this incident adds a multi-profile gateway angle: docs say the default pattern is a gateway-embedded dispatcher and deprecated standalone daemon, but they don't clearly state whether a multi-profile install with several gateway services sharing one Kanban root should run one embedded dispatcher or one per profile.

Code Example

KanbanDbCorruptError: Refusing to open corrupt kanban DB at /home/admin/.hermes/kanban/boards/ghi-property-launch/kanban.db: sqlite refused to open file: database disk image is malformed
KanbanDbCorruptError: ... integrity_check returned '*** in database main ***\nTree 2 page 2 cell 6: Rowid 14 out of order'
sqlite3.OperationalError: disk I/O error
sqlite3.OperationalError: no such table: tasks
WARNING gateway.run: kanban notifier tick failed: no such table: kanban_notify_subs
RAW_BUFFERClick to expand / collapse

Kanban dispatcher should avoid repeated corrupt-board work and clarify multi-profile gateway dispatcher ownership

Summary

A production server with three Hermes gateway services (default, jimmy, and franky) all had kanban.dispatch_in_gateway: true while sharing the same board (ghi-property-launch). After the shared board DB became corrupt during an OOM/restart incident, all three gateways repeatedly attempted dispatcher/notifier work against the same corrupt DB. This produced repeated tracebacks/backups and contributed to restart/resource pressure.

There is already closely-related work in PR #33319 / issue #32593, but this incident adds a multi-profile gateway angle: docs say the default pattern is a gateway-embedded dispatcher and deprecated standalone daemon, but they don't clearly state whether a multi-profile install with several gateway services sharing one Kanban root should run one embedded dispatcher or one per profile.

Environment

  • Host: Ubuntu/Linux, systemd services
  • Services:
    • hermes-gateway.service with HERMES_HOME=/home/admin/.hermes
    • hermes-gateway-jimmy.service with HERMES_HOME=/home/admin/.hermes/profiles/jimmy
    • hermes-gateway-franky.service with HERMES_HOME=/home/admin/.hermes/profiles/franky
  • Shared board:
    • /home/admin/.hermes/kanban/boards/ghi-property-launch/kanban.db
  • All three profiles had:
    • kanban.dispatch_in_gateway: true
    • dispatch_interval_seconds: 60

Observed behavior

During the incident the gateway logs showed repeated errors across all three services:

KanbanDbCorruptError: Refusing to open corrupt kanban DB at /home/admin/.hermes/kanban/boards/ghi-property-launch/kanban.db: sqlite refused to open file: database disk image is malformed
KanbanDbCorruptError: ... integrity_check returned '*** in database main ***\nTree 2 page 2 cell 6: Rowid 14 out of order'
sqlite3.OperationalError: disk I/O error
sqlite3.OperationalError: no such table: tasks
WARNING gateway.run: kanban notifier tick failed: no such table: kanban_notify_subs

The board directory accumulated many timestamped corrupt DB backups/sidecars because multiple gateway processes were probing/backing up the same corrupt file.

The current live DB after recovery is valid but empty (integrity_check: ok, tasks: 0), while backups contain the pre-incident tasks:

  • kanban.db.empty-current-before-recover.20260528_054113.bak: integrity_check: ok, 30 tasks, 24 links, 42 runs, 169 events
  • kanban.db.bak: 30 tasks, 24 links, 620 runs, 2999 events, but integrity_check reports wrong # of entries in index idx_runs_task

Why this seems like a product/code issue

hermes_cli.kanban_db.KanbanDbCorruptError currently inherits from RuntimeError, while gateway/run.py's corrupt-board disable path recognizes sqlite3.DatabaseError via _is_corrupt_board_db_error(). That means deeper integrity failures surfaced as KanbanDbCorruptError can miss the board-disable path and continue ticking/logging/backing up.

PR #33319 appears to address this by making KanbanDbCorruptError a sqlite3.DatabaseError and caching backups by file fingerprint. That seems directionally correct.

The remaining ambiguity: multi-profile installs can run several gateway services that share the same Kanban root by design (get_default_hermes_root() resolves profile homes back to the root). Docs currently say:

  • dispatcher runs inside gateway by default
  • one dispatcher sweeps all boards per tick
  • standalone hermes kanban daemon plus gateway dispatcher against the same DB is unsupported

But they don't explicitly say whether multiple profile gateways should all run embedded dispatchers against the same shared Kanban board. Operationally, that appears risky under corruption/failure conditions.

Expected behavior / proposed improvements

  1. Ensure KanbanDbCorruptError follows the same corrupt-board disable/quarantine path as sqlite3.DatabaseError invalid-header cases.
  2. Cache corrupt DB backups per unchanged file fingerprint so repeated ticks don't create backup storms.
  3. Make notifier/ready-queue health probes also skip disabled corrupt board fingerprints.
  4. Clarify docs for multi-profile deployments:
    • Is exactly one embedded dispatcher per shared Kanban root the supported pattern?
    • Should non-owner profile gateways set HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 or kanban.dispatch_in_gateway: false?
    • If multiple gateway-embedded dispatchers are intended/supported, document the concurrency guarantees and failure-mode behavior.
  5. Consider a lock/leader-election mechanism per board/root so only one gateway process dispatches a shared board at a time, while other gateways can still run as platform adapters/workers.

Operational impact

With previous service units using Restart=always and unlimited start-limit behavior, all three gateways repeatedly respawned during the incident, creating high CPU/RAM pressure. Systemd memory caps/restart limits mitigated this, but the Kanban side should still fail closed with one actionable error rather than repeated work across profiles.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING