hermes - 💡(How to fix) Fix kanban.db index corruption after frequent gateway restarts

Error Message

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ... 18:55:53 kanban dispatcher: tick failed on board default -> sqlite3.OperationalError: disk I/O error (in release_stale_claims) 18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated) 18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch 18:57:54 kanban dispatcher: board default database changed; retrying dispatch 18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

Root Cause

The corruption chain:

Frequent SIGTERM -> gateway doesn't always complete WAL checkpoint before exit
APFS + WAL edge case -> disk I/O error during a read query on partially-checkpointed WAL
Index/table desync -> indices become stale relative to table data
_is_corrupt_board_db_error() catches DatabaseError -> disables the board correctly
Recovery heuristic is fragile -> fingerprint is (path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)

Fix Action

Fix / Workaround

The kanban dispatcher's SQLite database (~/.hermes/kanban.db) suffers repeated index corruption after frequent gateway restarts (SIGTERM). Once corrupted, the dispatcher permanently disables itself for the board until the file changes or the gateway restarts — but even after restart it detects the same fingerprint and stays disabled because the file size/mtime change is too subtle for the recovery heuristic.

Run gateway with kanban dispatch enabled (kanban.dispatch_in_gateway: true)
Restart gateway frequently (e.g., during development: hermes gateway restart multiple times within minutes)
At some point release_stale_claims() hits sqlite3.OperationalError: disk I/O error during dispatch
This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
On next tick, connect() raises sqlite3.DatabaseError: database disk image is malformed
Dispatcher disables the board. Recovery attempt on next tick sees the same (path, mtime, size) fingerprint and stays disabled

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

Code Example

wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)

---

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

---

sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok

Summary

Environment

Hermes Agent v0.14.0 (2026.5.16)
macOS (APFS sealed volume, journaled)
kanban.db uses WAL mode

Reproduction

Run gateway with kanban dispatch enabled (kanban.dispatch_in_gateway: true)
Restart gateway frequently (e.g., during development: hermes gateway restart multiple times within minutes)
At some point release_stale_claims() hits sqlite3.OperationalError: disk I/O error during dispatch
This causes an incomplete WAL checkpoint, leaving indices out of sync with the table data
On next tick, connect() raises sqlite3.DatabaseError: database disk image is malformed
Dispatcher disables the board. Recovery attempt on next tick sees the same (path, mtime, size) fingerprint and stays disabled

Evidence

Corruption pattern (integrity check)

wrong # of entries in index idx_events_run
wrong # of entries in index idx_events_task
wrong # of entries in index idx_runs_status
wrong # of entries in index idx_runs_task
wrong # of entries in index idx_tasks_status
wrong # of entries in index idx_tasks_assignee_status
row 120 missing from index idx_events_run
... (90+ missing index entries across 6 indices)

Gateway log timeline

18:54:53 kanban dispatcher [default]: spawned=1 reclaimed=0 ...
18:55:53 kanban dispatcher: tick failed on board default
  -> sqlite3.OperationalError: disk I/O error (in release_stale_claims)
18:55:53+ kanban notifier tick failed: cannot rollback - no transaction is active (repeated)
18:56:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch
18:57:54 kanban dispatcher: board default database changed; retrying dispatch
18:57:54 kanban dispatcher: board default database ... is not a valid SQLite database; disabling dispatch

The second restart attempt already detects a file change but still fails. The .dump + rebuild was required to fix the indices.

Manual recovery that worked

sqlite3 kanban.db '.dump' > dump.sql
sqlite3 kanban.db.new < dump.sql
mv kanban.db.new kanban.db
# integrity_check: ok

This happened twice on 5/22 and 5/23 with identical symptoms.

Root Cause Analysis

The corruption chain:

Frequent SIGTERM -> gateway doesn't always complete WAL checkpoint before exit
APFS + WAL edge case -> disk I/O error during a read query on partially-checkpointed WAL
Index/table desync -> indices become stale relative to table data
_is_corrupt_board_db_error() catches DatabaseError -> disables the board correctly
Recovery heuristic is fragile -> fingerprint is (path, mtime_ns, size). After a dump/rebuild the file does change, but simply restarting the gateway after the corrupt write doesn't change the fingerprint enough (file size may be identical)

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

When _is_corrupt_board_db_error() fires in _tick_once_for_board(), instead of permanently disabling the board, attempt a self-heal:

Try REINDEX first (fast, handles most index-only corruption)
If that fails, fall back to .dump + rebuild (handles deeper page-level corruption)
If repair succeeds, retry dispatch; if it fails, then disable the board

2. Periodic WAL checkpoint

Add a periodic PRAGMA wal_checkpoint(TRUNCATE) in the dispatcher tick (e.g., every N ticks or every M minutes) to keep the WAL file small and reduce the window for corruption on unclean shutdown.

3. Improve recovery fingerprint heuristic

The current disabled_corrupt_boards recovery only retries when (path, mtime_ns, size) changes. Consider also tracking a generation counter that increments on any gateway restart, so a restart always gets at least one retry attempt before permanently disabling.

4. Add `PRAGMA journal_mode=TRUNCATE` fallback

For systems where WAL is problematic, allow configuring journal_mode per-board. WAL is preferred for concurrent read/write but TRUNCATE is more resilient to unclean shutdowns.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Evidence

Corruption pattern (integrity check)

Gateway log timeline

Manual recovery that worked

Root Cause Analysis

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

2. Periodic WAL checkpoint

3. Improve recovery fingerprint heuristic

4. Add `PRAGMA journal_mode=TRUNCATE` fallback

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Evidence

Corruption pattern (integrity check)

Gateway log timeline

Manual recovery that worked

Root Cause Analysis

Suggested Fixes

1. Auto-repair on corruption detection (recommended)

2. Periodic WAL checkpoint

3. Improve recovery fingerprint heuristic

4. Add PRAGMA journal_mode=TRUNCATE fallback

Still need to ship something?

TRENDING

4. Add `PRAGMA journal_mode=TRUNCATE` fallback