hermes - 💡(How to fix) Fix kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The kanban dispatcher's SQLite database (~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.

Error Message

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ... 18:55:53 kanban dispatcher: tick failed on board default -> sqlite3.OperationalError: disk I/O error (in release_stale_claims) 18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated) 18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch 18:57:54 kanban dispatcher: board default database changed; retrying dispatch 18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

Root Cause

The corruption chain:

  1. Frequent SIGTERM -> gateway doesn't always complete WAL checkpoint before exit
  2. APFS + WAL edge case -> disk I/O error during a read query on partially-checkpointed WAL
  3. Index/table desync -> indices become stale relative to table data
  4. _is_corrupt_board_db_error() catches DatabaseError -> disables the board correctly
  5. Recovery heuristic is fragile -> fingerprint is (path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)

Fix Action

Fix / Workaround

The kanban dispatcher's SQLite database (~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.

  1. Run gateway with kanban dispatch enabled (kanban.dispatch_in_gateway: true)
  2. Restart gateway frequently (e.g., during development: hermes gateway restart multiple times within minutes)
  3. At some point release_stale_claims() hits sqlite3.OperationalError: disk I/O error during dispatch
  4. This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
  5. On next tick, connect() raises sqlite3.DatabaseError: database disk image is malformed
  6. Dispatcher disables the board. Recovery attempt on next tick sees the same (path, mtime, size) fingerprint and stays disabled
18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

Code Example

wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)

---

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

---

sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok
RAW_BUFFERClick to expand / collapse

Summary

The kanban dispatcher's SQLite database (~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.

Environment

  • Hermes Agent v0.14.0 (2026.5.16)
  • macOS (APFS sealed volume, journaled)
  • kanban.db uses WAL mode

Reproduction

  1. Run gateway with kanban dispatch enabled (kanban.dispatch_in_gateway: true)
  2. Restart gateway frequently (e.g., during development: hermes gateway restart multiple times within minutes)
  3. At some point release_stale_claims() hits sqlite3.OperationalError: disk I/O error during dispatch
  4. This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
  5. On next tick, connect() raises sqlite3.DatabaseError: database disk image is malformed
  6. Dispatcher disables the board. Recovery attempt on next tick sees the same (path, mtime, size) fingerprint and stays disabled

Evidence

Corruption pattern (integrity check)

wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)

Gateway log timeline

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

The second restart attempt already detects a file change but still fails. The .dump + rebuild was required to fix the indices.

Manual recovery that worked

sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok

This happened twice on 5/22 and 5/23 with identical symptoms.

Root Cause Analysis

The corruption chain:

  1. Frequent SIGTERM -> gateway doesn't always complete WAL checkpoint before exit
  2. APFS + WAL edge case -> disk I/O error during a read query on partially-checkpointed WAL
  3. Index/table desync -> indices become stale relative to table data
  4. _is_corrupt_board_db_error() catches DatabaseError -> disables the board correctly
  5. Recovery heuristic is fragile -> fingerprint is (path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

When _is_corrupt_board_db_error() fires in _tick_once_for_board(), instead of permanently disabling the board, attempt a self-heal:

  • Try REINDEX first (fast, handles most index-only corruption)
  • If that fails, fall back to .dump + rebuild (handles deeper page-level corruption)
  • If repair succeeds, retry dispatch; if it fails, then disable the board

2. Periodic WAL checkpoint

Add a periodic PRAGMA wal_checkpoint(TRUNCATE) in the dispatcher tick (e.g., every N ticks or every M minutes) to keep the WAL file small and reduce the window for corruption on unclean shutdown.

3. Improve recovery fingerprint heuristic

The current disabled_corrupt_boards recovery only retries when (path, mtime_ns, size) changes. Consider also tracking a generation counter that increments on any gateway restart, so a restart always gets at least one retry attempt before permanently disabling.

4. Add PRAGMA journal_mode=TRUNCATE fallback

For systems where WAL is problematic, allow configuring journal_mode per-board. WAL is preferred for concurrent read/write but TRUNCATE is more resilient to unclean shutdowns.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently