hermes - 💡(How to fix) Fix [Bug]: kanban.db corruption when multiple profile gateways share the same board DB

hermes2026-05-26 05:43:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When multiple Hermes profile gateways (e.g., --profile bingge, --profile pixiel, --profile mafei) run concurrently and share the same kanban board DB (~/.hermes/kanban.db), the SQLite database becomes corrupted. The corruption specifically affects the kanban_notify_subs table indexes.

This is not a false positive from PRAGMA integrity_check — the indexes genuinely become inconsistent with the table data.

Error Message

Tree 10 page 10: btreeInitPage() returns error code 11

Root Cause

Root Cause Analysis

Fix Action

Workaround

Currently working around by:

Monitoring for corruption via PRAGMA integrity_check
Recovering from .recover.*.sql dumps when corruption is detected
Restarting all gateways after recovery

This is fragile — the recovery SQL can be stale, losing recent task state.

Code Example

Tree 10 page 10: btreeInitPage() returns error code 11
   wrong # of entries in index idx_notify_task
   wrong # of entries in index sqlite_autoindex_kanban_notify_subs_1

RAW_BUFFERClick to expand / collapse

Description

This is not a false positive from PRAGMA integrity_check — the indexes genuinely become inconsistent with the table data.

Environment

macOS (APFS filesystem)
SQLite 3.51.0
Hermes Agent (latest main branch)
4 gateway processes: default + 3 named profiles, all sharing kanban.db by design

Root Cause Analysis

Architecture

kanban_home() intentionally resolves to ~/.hermes/ for all profiles (by design, per the docstring: "The kanban board is shared across profiles")
Each profile gateway runs its own dispatcher, which opens independent SQLite connections
CLI commands (hermes kanban create/complete/block/link) also open new connections

The Race

Multiple processes concurrently execute BEGIN IMMEDIATE write transactions against the same DB. While SQLite WAL mode supports concurrent readers + single writer per connection, concurrent WAL checkpoints from separate processes can corrupt the main DB file.

Evidence

4 gateway processes had open file descriptors on kanban.db at time of corruption
Last events before corruption show rapid concurrent activity from different profile dispatchers:
- bingge gateway: completed Sprint 3 PRD (13:04:34)
- pixiel gateway: spawned Sprint 3 design task (13:04:36)
- mafei gateway: protocol_violation → gave_up → re-spawned Sprint 2 (13:05:04)
Corruption was in kanban_notify_subs indexes (idx_notify_task + sqlite_autoindex_kanban_notify_subs_1)

PRAGMA integrity_check returned:

Tree 10 page 10: btreeInitPage() returns error code 11
wrong # of entries in index idx_notify_task
wrong # of entries in index sqlite_autoindex_kanban_notify_subs_1

Steps to Reproduce

Start 3+ profile gateways: hermes --profile X gateway run --replace
Run kanban operations that trigger concurrent writes (task create + claim + notify-subscribe)
Observe corruption after ~30-60 minutes of active use

Suggested Fixes

Short-term

Add a file-level advisory lock (fcntl.flock) around all kanban write operations in kanban_db.py. The existing BEGIN IMMEDIATE handles SQLite-level serialization, but doesn't protect against concurrent WAL checkpoints from separate processes.

Medium-term

Serialize kanban writes through a single writer process/thread. Each gateway could send write requests to a central kanban writer instead of opening independent connections.

Long-term

Consider PostgreSQL as an optional backend for multi-profile setups. SQLite's WAL mode has documented limitations with concurrent writers from separate processes.

Workaround

Currently working around by:

Monitoring for corruption via PRAGMA integrity_check
Recovering from .recover.*.sql dumps when corruption is detected
Restarting all gateways after recovery

This is fragile — the recovery SQL can be stale, losing recent task state.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: kanban.db corruption when multiple profile gateways share the same board DB

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Analysis

Fix Action

Workaround

Code Example

Description

Environment

Root Cause Analysis

Architecture

The Race

Evidence

Steps to Reproduce

Suggested Fixes

Short-term

Medium-term

Long-term

Workaround

Still need to ship something?

TRENDING