hermes - 💡(How to fix) Fix [Bug]: kanban.db corruption when multiple profile gateways share the same board DB

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When multiple Hermes profile gateways (e.g., --profile bingge, --profile pixiel, --profile mafei) run concurrently and share the same kanban board DB (~/.hermes/kanban.db), the SQLite database becomes corrupted. The corruption specifically affects the kanban_notify_subs table indexes.

This is not a false positive from PRAGMA integrity_check — the indexes genuinely become inconsistent with the table data.

Error Message

Tree 10 page 10: btreeInitPage() returns error code 11

Root Cause

Root Cause Analysis

Fix Action

Workaround

Currently working around by:

  1. Monitoring for corruption via PRAGMA integrity_check
  2. Recovering from .recover.*.sql dumps when corruption is detected
  3. Restarting all gateways after recovery

This is fragile — the recovery SQL can be stale, losing recent task state.

Code Example

Tree 10 page 10: btreeInitPage() returns error code 11
   wrong # of entries in index idx_notify_task
   wrong # of entries in index sqlite_autoindex_kanban_notify_subs_1
RAW_BUFFERClick to expand / collapse

Description

When multiple Hermes profile gateways (e.g., --profile bingge, --profile pixiel, --profile mafei) run concurrently and share the same kanban board DB (~/.hermes/kanban.db), the SQLite database becomes corrupted. The corruption specifically affects the kanban_notify_subs table indexes.

This is not a false positive from PRAGMA integrity_check — the indexes genuinely become inconsistent with the table data.

Environment

  • macOS (APFS filesystem)
  • SQLite 3.51.0
  • Hermes Agent (latest main branch)
  • 4 gateway processes: default + 3 named profiles, all sharing kanban.db by design

Root Cause Analysis

Architecture

  • kanban_home() intentionally resolves to ~/.hermes/ for all profiles (by design, per the docstring: "The kanban board is shared across profiles")
  • Each profile gateway runs its own dispatcher, which opens independent SQLite connections
  • CLI commands (hermes kanban create/complete/block/link) also open new connections

The Race

Multiple processes concurrently execute BEGIN IMMEDIATE write transactions against the same DB. While SQLite WAL mode supports concurrent readers + single writer per connection, concurrent WAL checkpoints from separate processes can corrupt the main DB file.

Evidence

  1. 4 gateway processes had open file descriptors on kanban.db at time of corruption
  2. Last events before corruption show rapid concurrent activity from different profile dispatchers:
    • bingge gateway: completed Sprint 3 PRD (13:04:34)
    • pixiel gateway: spawned Sprint 3 design task (13:04:36)
    • mafei gateway: protocol_violation → gave_up → re-spawned Sprint 2 (13:05:04)
  3. Corruption was in kanban_notify_subs indexes (idx_notify_task + sqlite_autoindex_kanban_notify_subs_1)
  4. PRAGMA integrity_check returned:
    Tree 10 page 10: btreeInitPage() returns error code 11
    wrong # of entries in index idx_notify_task
    wrong # of entries in index sqlite_autoindex_kanban_notify_subs_1

Steps to Reproduce

  1. Start 3+ profile gateways: hermes --profile X gateway run --replace
  2. Run kanban operations that trigger concurrent writes (task create + claim + notify-subscribe)
  3. Observe corruption after ~30-60 minutes of active use

Suggested Fixes

Short-term

Add a file-level advisory lock (fcntl.flock) around all kanban write operations in kanban_db.py. The existing BEGIN IMMEDIATE handles SQLite-level serialization, but doesn't protect against concurrent WAL checkpoints from separate processes.

Medium-term

Serialize kanban writes through a single writer process/thread. Each gateway could send write requests to a central kanban writer instead of opening independent connections.

Long-term

Consider PostgreSQL as an optional backend for multi-profile setups. SQLite's WAL mode has documented limitations with concurrent writers from separate processes.

Workaround

Currently working around by:

  1. Monitoring for corruption via PRAGMA integrity_check
  2. Recovering from .recover.*.sql dumps when corruption is detected
  3. Restarting all gateways after recovery

This is fragile — the recovery SQL can be stale, losing recent task state.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING