When the guard detects a corrupt DB: - Back it up **once** (first detection) as `kanban.db.corrupt. .bak` - On subsequent ticks, **skip the backup write** if the corrupt source file's `mtime`/`size`/`sha256` matches the last `.bak` - Optionally: cap total `.bak` files in the dir to N (e.g., 5) with FIFO eviction This bounds disk impact and stops the runaway.

Root Cause

A bot with an active kanban board can silently consume gigabytes of disk over days without operator intervention, with no notification mechanism beyond raw journal log spam. The runaway behavior turned a small 104 KB corruption into a 1.7 GB problem, and would have continued indefinitely without manual cleanup.

Fix Action

Fix / Workaround

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.

Workaround used

Code Example

sqlite3.OperationalError: disk I/O error

---

$ ls /home/.../kanban/boards/<slug>/ | wc -l
7861   # all kanban.db.corrupt.<timestamp>.bak files

$ du -sh /home/.../kanban/boards/<slug>/
1.7G

---

kanban.db.corrupt.20260525_190559.1.bak
kanban.db.corrupt.20260525_190559.2.bak
kanban.db.corrupt.20260525_190559.3.bak
kanban.db.corrupt.20260525_190604.bak
kanban.db.corrupt.20260525_190609.bak
... (~3.5 backups/min average, peaking higher)

---

def _guard_existing_db_is_healthy(path):
    # ... existing corruption check ...
    if corrupt:
        # Find most recent existing .bak
        existing_baks = sorted(glob(f"{path}.corrupt.*.bak"))
        latest_bak = existing_baks[-1] if existing_baks else None

        # Skip if corrupt source hasn't changed since last backup
        if latest_bak and _file_signatures_match(path, latest_bak):
            raise KanbanDbCorruptError(path, latest_bak, reason)

        # Otherwise back up fresh
        new_bak = f"{path}.corrupt.{timestamp()}.bak"
        shutil.copy2(path, new_bak)

        # Optional: FIFO eviction
        if len(existing_baks) >= MAX_CORRUPT_BAKS:
            os.remove(existing_baks[0])

        raise KanbanDbCorruptError(path, new_bak, reason)

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

Hermes version: v2026.5.16-881-g186bf25cb (HEAD as of 2026-05-24) Profile: non-default profile (sophia) Symptom: 7,862 .corrupt.*.bak files (1.7 GB) accumulated in a single per-profile kanban board directory over ~37 hours, with no built-in pruning.

What happened

A board's kanban.db became corrupt at a specific point in time:

sqlite3.OperationalError: disk I/O error

(later re-classified by the newer guard as database disk image is malformed)

After the corruption, gateway/run.py:_tick_once_for_board() calls _kb.connect(board=slug) every minute, which calls _guard_existing_db_is_healthy() (hermes_cli/kanban_db.py:1132). The guard correctly refuses to open the corrupt DB and raises KanbanDbCorruptError — but it also writes a NEW .corrupt.*.bak of the same corrupt file on every tick.

Over ~37 hours and across periods of high-retry activity (with .1.bak, .2.bak, .3.bak suffixes from sub-second retries), this accumulated 7,861 backup files at ~224 KB each = ~1.7 GB, all bit-identical (or near-identical) copies of the same corrupt source.

The directory just before cleanup:

$ ls /home/.../kanban/boards/<slug>/ | wc -l
7861   # all kanban.db.corrupt.<timestamp>.bak files

$ du -sh /home/.../kanban/boards/<slug>/
1.7G

Filename pattern with sub-second collision suffixes:

kanban.db.corrupt.20260525_190559.1.bak
kanban.db.corrupt.20260525_190559.2.bak
kanban.db.corrupt.20260525_190559.3.bak
kanban.db.corrupt.20260525_190604.bak
kanban.db.corrupt.20260525_190609.bak
... (~3.5 backups/min average, peaking higher)

Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.

Expected behavior

When the guard detects a corrupt DB:

Back it up once (first detection) as kanban.db.corrupt.<timestamp>.bak
On subsequent ticks, skip the backup write if the corrupt source file's mtime/size/sha256 matches the last .bak
Optionally: cap total .bak files in the dir to N (e.g., 5) with FIFO eviction

This bounds disk impact and stops the runaway.

Suggested fix sketch (in `_guard_existing_db_is_healthy`)

def _guard_existing_db_is_healthy(path):
    # ... existing corruption check ...
    if corrupt:
        # Find most recent existing .bak
        existing_baks = sorted(glob(f"{path}.corrupt.*.bak"))
        latest_bak = existing_baks[-1] if existing_baks else None

        # Skip if corrupt source hasn't changed since last backup
        if latest_bak and _file_signatures_match(path, latest_bak):
            raise KanbanDbCorruptError(path, latest_bak, reason)

        # Otherwise back up fresh
        new_bak = f"{path}.corrupt.{timestamp()}.bak"
        shutil.copy2(path, new_bak)

        # Optional: FIFO eviction
        if len(existing_baks) >= MAX_CORRUPT_BAKS:
            os.remove(existing_baks[0])

        raise KanbanDbCorruptError(path, new_bak, reason)

Where _file_signatures_match could compare (st_size, st_mtime_ns) for cheapness, or sha256 for correctness.

Workaround used

Used Hermes' own kanban boards delete <slug> (default = archive) to remove the corrupt board, followed by kanban boards create <slug> to recreate fresh. Worked cleanly. The destructive cleanup of the 7,861 stale .bak files happened as a side effect of the boards delete action since the entire board dir is removed.

Root cause of the original corruption

Unknown. We never identified the trigger. The first OperationalError: disk I/O error fired at 2026-05-25 14:59:13 UTC and journalctl for the surrounding 10-minute window showed no kernel events, no disk pressure, no OOM, no apt activity, no fail2ban — nothing systemic. Disk had 36 GB free throughout. Both kanban DBs passed PRAGMA integrity_check at the time on a different (non-active) path; the corruption was confined to one nested per-board DB.

The corruption may correlate with concurrent Honcho/compression refactoring that occurred ~2-3 hours earlier (visible from backup filenames in the profile's migration scripts dir), but this is circumstantial.

This bug report focuses only on the unbounded-backup behavior, not the underlying corruption cause.

Why this matters

Reproducibility

Difficult to repro the corruption itself, but the runaway-backup behavior is trivial to repro:

Run any Hermes profile with kanban enabled.
Corrupt the per-board kanban.db (e.g., truncate to 64 bytes, overwrite a random middle page).
Restart the gateway.
Wait. Watch ls | wc -l in the board dir grow.

Environment

Hermes Agent: v2026.5.16-881-g186bf25cb
OS: Ubuntu (Hetzner VPS)
Python: 3.x via ~/.hermes/hermes-agent/venv
Profile setup: non-default profile (sophia) at ~/.hermes/profiles/sophia/, alongside default profile at ~/.hermes/
Service: per-profile systemd unit (hermes-gateway-sophia.service)

FAQ

Expected behavior

When the guard detects a corrupt DB:

Back it up once (first detection) as kanban.db.corrupt.<timestamp>.bak
On subsequent ticks, skip the backup write if the corrupt source file's mtime/size/sha256 matches the last .bak
Optionally: cap total .bak files in the dir to N (e.g., 5) with FIFO eviction

This bounds disk impact and stops the runaway.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Kanban dispatcher writes a new .corrupt.bak on every tick after corruption (7,861 backups / 1.7 GB over 37h)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

Workaround used

Code Example

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

What happened

Expected behavior

Suggested fix sketch (in `_guard_existing_db_is_healthy`)

Workaround used

Root cause of the original corruption

Why this matters

Reproducibility

Environment

FAQ

Expected behavior

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Kanban dispatcher writes a new .corrupt.bak on every tick after corruption (7,861 backups / 1.7 GB over 37h)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Bug: KanbanDbCorruptError backup writes a new .bak on every dispatcher tick → runaway disk usage

Workaround used

Code Example

Bug: KanbanDbCorruptError backup writes a new .bak on every dispatcher tick → runaway disk usage

What happened

Expected behavior

Suggested fix sketch (in _guard_existing_db_is_healthy)

Workaround used

Root cause of the original corruption

Why this matters

Reproducibility

Environment

FAQ

Expected behavior

Still need to ship something?

TRENDING

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

Bug: `KanbanDbCorruptError` backup writes a new `.bak` on every dispatcher tick → runaway disk usage

Suggested fix sketch (in `_guard_existing_db_is_healthy`)