hermes - 💡(How to fix) Fix [Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint

hermes2026-05-26 10:19:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Three contributing factors in hermes_cli/kanban_db.py + gateway/run.py:

1. No explicit WAL checkpoint management. kanban_db.py has zero PRAGMA wal_checkpoint calls anywhere. In contrast, hermes_state.py (SessionDB) properly manages WAL with _try_wal_checkpoint() every 50 writes and in close(). Workers crash without proper connection close → WAL frames partially written → next connect() reads inconsistent WAL → sqlite3.DatabaseError.

2. synchronous=NORMAL in kanban_db.py connect(). With NORMAL, SQLite does NOT fsync on commit. If a worker process crashes between writing WAL frames and the checkpoint, WAL contains partially-written frames.

3. Fingerprint only tracks .db file, not -wal (gateway/run.py _board_db_fingerprint()). If only -wal is corrupted but .db mtime/size unchanged → fingerprint unchanged → board stays disabled permanently until gateway restart.

Fix Action

Fix / Workaround

Gateway Kanban dispatcher intermittently reports kanban.db is not a valid SQLite database and disables dispatch for the board. The DB auto-recovers after some time, but during the disabled window, ready tasks sit unprocessed.

Gateway log pattern (every few minutes):

16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1    ← DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs

Run Hermes Gateway with Kanban dispatch enabled
Create Kanban tasks that cause worker protocol violations (worker exits without kanban_complete())
Gateway spawns workers → some crash
Next dispatcher tick: connect() fails with "database disk image is malformed"
Board disabled until .db file mtime changes or gateway restarts

Code Example

16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1    ← DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs

RAW_BUFFERClick to expand / collapse

Bug Description

Gateway log pattern (every few minutes):

16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1    ← DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs

The corruption always appears AFTER worker subprocesses complete (spawned=N → next tick: corrupted).

Root Cause

Three contributing factors in hermes_cli/kanban_db.py + gateway/run.py:

Steps to Reproduce

Run Hermes Gateway with Kanban dispatch enabled
Create Kanban tasks that cause worker protocol violations (worker exits without kanban_complete())
Gateway spawns workers → some crash
Next dispatcher tick: connect() fails with "database disk image is malformed"
Board disabled until .db file mtime changes or gateway restarts

Expected Behavior

Worker crashes should not leave Kanban DB unreadable. WAL should be checkpointed to prevent partial-frame corruption from blocking the dispatcher.

Actual Behavior

After worker crashes, Gateway sees DB as corrupted, disables dispatch, cannot recover until .db file is externally modified or gateway restarts.

Proposed Fix

Add WAL checkpoint on connection close — in gateway/run.py before conn.close(): conn.execute("PRAGMA wal_checkpoint(PASSIVE)") (mirrors SessionDB.close() at hermes_state.py:458)
Include -wal file in fingerprint — track (wal_mtime_ns, wal_size) so dispatcher auto-recovers when only WAL corrupted.
Consider synchronous=FULL — prevents WAL checkpoint crashes from corrupting main DB (trade-off: slightly slower writes).

Environment

Hermes Agent v0.14.0
macOS 15.7.4, Python 3.11.11, SQLite 3.47.1

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug Description

Root Cause

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Environment

Still need to ship something?

TRENDING