hermes - 💡(How to fix) Fix [Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Three contributing factors in hermes_cli/kanban_db.py + gateway/run.py:

1. No explicit WAL checkpoint management. kanban_db.py has zero PRAGMA wal_checkpoint calls anywhere. In contrast, hermes_state.py (SessionDB) properly manages WAL with _try_wal_checkpoint() every 50 writes and in close(). Workers crash without proper connection close → WAL frames partially written → next connect() reads inconsistent WAL → sqlite3.DatabaseError.

2. synchronous=NORMAL in kanban_db.py connect(). With NORMAL, SQLite does NOT fsync on commit. If a worker process crashes between writing WAL frames and the checkpoint, WAL contains partially-written frames.

3. Fingerprint only tracks .db file, not -wal (gateway/run.py _board_db_fingerprint()). If only -wal is corrupted but .db mtime/size unchanged → fingerprint unchanged → board stays disabled permanently until gateway restart.

Fix Action

Fix / Workaround

Gateway Kanban dispatcher intermittently reports kanban.db is not a valid SQLite database and disables dispatch for the board. The DB auto-recovers after some time, but during the disabled window, ready tasks sit unprocessed.

Gateway log pattern (every few minutes):

16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1    ← DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs
  1. Run Hermes Gateway with Kanban dispatch enabled
  2. Create Kanban tasks that cause worker protocol violations (worker exits without kanban_complete())
  3. Gateway spawns workers → some crash
  4. Next dispatcher tick: connect() fails with "database disk image is malformed"
  5. Board disabled until .db file mtime changes or gateway restarts

Code Example

16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs
RAW_BUFFERClick to expand / collapse

Bug Description

Gateway Kanban dispatcher intermittently reports kanban.db is not a valid SQLite database and disables dispatch for the board. The DB auto-recovers after some time, but during the disabled window, ready tasks sit unprocessed.

Gateway log pattern (every few minutes):

16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1    ← DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs

The corruption always appears AFTER worker subprocesses complete (spawned=N → next tick: corrupted).

Root Cause

Three contributing factors in hermes_cli/kanban_db.py + gateway/run.py:

1. No explicit WAL checkpoint management. kanban_db.py has zero PRAGMA wal_checkpoint calls anywhere. In contrast, hermes_state.py (SessionDB) properly manages WAL with _try_wal_checkpoint() every 50 writes and in close(). Workers crash without proper connection close → WAL frames partially written → next connect() reads inconsistent WAL → sqlite3.DatabaseError.

2. synchronous=NORMAL in kanban_db.py connect(). With NORMAL, SQLite does NOT fsync on commit. If a worker process crashes between writing WAL frames and the checkpoint, WAL contains partially-written frames.

3. Fingerprint only tracks .db file, not -wal (gateway/run.py _board_db_fingerprint()). If only -wal is corrupted but .db mtime/size unchanged → fingerprint unchanged → board stays disabled permanently until gateway restart.

Steps to Reproduce

  1. Run Hermes Gateway with Kanban dispatch enabled
  2. Create Kanban tasks that cause worker protocol violations (worker exits without kanban_complete())
  3. Gateway spawns workers → some crash
  4. Next dispatcher tick: connect() fails with "database disk image is malformed"
  5. Board disabled until .db file mtime changes or gateway restarts

Expected Behavior

Worker crashes should not leave Kanban DB unreadable. WAL should be checkpointed to prevent partial-frame corruption from blocking the dispatcher.

Actual Behavior

After worker crashes, Gateway sees DB as corrupted, disables dispatch, cannot recover until .db file is externally modified or gateway restarts.

Proposed Fix

  1. Add WAL checkpoint on connection close — in gateway/run.py before conn.close(): conn.execute("PRAGMA wal_checkpoint(PASSIVE)") (mirrors SessionDB.close() at hermes_state.py:458)

  2. Include -wal file in fingerprint — track (wal_mtime_ns, wal_size) so dispatcher auto-recovers when only WAL corrupted.

  3. Consider synchronous=FULL — prevents WAL checkpoint crashes from corrupting main DB (trade-off: slightly slower writes).

Environment

  • Hermes Agent v0.14.0
  • macOS 15.7.4, Python 3.11.11, SQLite 3.47.1

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING