A production server with three Hermes gateway services (default, jimmy, and franky) all had kanban.dispatch_in_gateway: true while sharing the same board (ghi-property-launch). After the shared board DB became corrupt during an OOM/restart incident, all three gateways repeatedly attempted dispatcher/notifier work against the same corrupt DB. This produced repeated tracebacks/backups and contributed to restart/resource pressure.

There is already closely-related work in PR #33319 / issue #32593, but this incident adds a multi-profile gateway angle: docs say the default pattern is a gateway-embedded dispatcher and deprecated standalone daemon, but they don't clearly state whether a multi-profile install with several gateway services sharing one Kanban root should run one embedded dispatcher or one per profile.

Fix Action

Fix / Workaround

Kanban dispatcher should avoid repeated corrupt-board work and clarify multi-profile gateway dispatcher ownership

Code Example

KanbanDbCorruptError: Refusing to open corrupt kanban DB at /home/admin/.hermes/kanban/boards/ghi-property-launch/kanban.db: sqlite refused to open file: database disk image is malformed
KanbanDbCorruptError: ... integrity_check returned '*** in database main ***\nTree 2 page 2 cell 6: Rowid 14 out of order'
sqlite3.OperationalError: disk I/O error
sqlite3.OperationalError: no such table: tasks
WARNING gateway.run: kanban notifier tick failed: no such table: kanban_notify_subs

Kanban dispatcher should avoid repeated corrupt-board work and clarify multi-profile gateway dispatcher ownership

Summary

Environment

Host: Ubuntu/Linux, systemd services
Services:
- hermes-gateway.service with HERMES_HOME=/home/admin/.hermes
- hermes-gateway-jimmy.service with HERMES_HOME=/home/admin/.hermes/profiles/jimmy
- hermes-gateway-franky.service with HERMES_HOME=/home/admin/.hermes/profiles/franky
Shared board:
- /home/admin/.hermes/kanban/boards/ghi-property-launch/kanban.db
All three profiles had:
- kanban.dispatch_in_gateway: true
- dispatch_interval_seconds: 60

Observed behavior

During the incident the gateway logs showed repeated errors across all three services:

KanbanDbCorruptError: Refusing to open corrupt kanban DB at /home/admin/.hermes/kanban/boards/ghi-property-launch/kanban.db: sqlite refused to open file: database disk image is malformed
KanbanDbCorruptError: ... integrity_check returned '*** in database main ***\nTree 2 page 2 cell 6: Rowid 14 out of order'
sqlite3.OperationalError: disk I/O error
sqlite3.OperationalError: no such table: tasks
WARNING gateway.run: kanban notifier tick failed: no such table: kanban_notify_subs

The board directory accumulated many timestamped corrupt DB backups/sidecars because multiple gateway processes were probing/backing up the same corrupt file.

The current live DB after recovery is valid but empty (integrity_check: ok, tasks: 0), while backups contain the pre-incident tasks:

kanban.db.empty-current-before-recover.20260528_054113.bak: integrity_check: ok, 30 tasks, 24 links, 42 runs, 169 events
kanban.db.bak: 30 tasks, 24 links, 620 runs, 2999 events, but integrity_check reports wrong # of entries in index idx_runs_task

Why this seems like a product/code issue

hermes_cli.kanban_db.KanbanDbCorruptError currently inherits from RuntimeError, while gateway/run.py's corrupt-board disable path recognizes sqlite3.DatabaseError via _is_corrupt_board_db_error(). That means deeper integrity failures surfaced as KanbanDbCorruptError can miss the board-disable path and continue ticking/logging/backing up.

PR #33319 appears to address this by making KanbanDbCorruptError a sqlite3.DatabaseError and caching backups by file fingerprint. That seems directionally correct.

The remaining ambiguity: multi-profile installs can run several gateway services that share the same Kanban root by design (get_default_hermes_root() resolves profile homes back to the root). Docs currently say:

dispatcher runs inside gateway by default
one dispatcher sweeps all boards per tick
standalone hermes kanban daemon plus gateway dispatcher against the same DB is unsupported

But they don't explicitly say whether multiple profile gateways should all run embedded dispatchers against the same shared Kanban board. Operationally, that appears risky under corruption/failure conditions.

Expected behavior / proposed improvements

Ensure KanbanDbCorruptError follows the same corrupt-board disable/quarantine path as sqlite3.DatabaseError invalid-header cases.
Cache corrupt DB backups per unchanged file fingerprint so repeated ticks don't create backup storms.
Make notifier/ready-queue health probes also skip disabled corrupt board fingerprints.
Clarify docs for multi-profile deployments:
- Is exactly one embedded dispatcher per shared Kanban root the supported pattern?
- Should non-owner profile gateways set HERMES_KANBAN_DISPATCH_IN_GATEWAY=0 or kanban.dispatch_in_gateway: false?
- If multiple gateway-embedded dispatchers are intended/supported, document the concurrency guarantees and failure-mode behavior.
Consider a lock/leader-election mechanism per board/root so only one gateway process dispatches a shared board at a time, while other gateways can still run as platform adapters/workers.

Operational impact

With previous service units using Restart=always and unlimited start-limit behavior, all three gateways repeatedly respawned during the incident, creating high CPU/RAM pressure. Systemd memory caps/restart limits mitigated this, but the Kanban side should still fail closed with one actionable error rather than repeated work across profiles.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Kanban dispatcher corrupt-board handling and multi-profile gateway ownership ambiguity

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Kanban dispatcher should avoid repeated corrupt-board work and clarify multi-profile gateway dispatcher ownership

Code Example

Kanban dispatcher should avoid repeated corrupt-board work and clarify multi-profile gateway dispatcher ownership

Summary

Environment

Observed behavior

Why this seems like a product/code issue

Expected behavior / proposed improvements

Operational impact

Still need to ship something?

TRENDING