hermes - 💡(How to fix) Fix Kanban dispatcher writes a new .corrupt.bak on every tick after corruption (7,861 backups / 1.7 GB over 37h)

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

sqlite3.OperationalError: disk I/O error

Root Cause

A bot with an active kanban board can silently consume gigabytes of disk over days without operator intervention, with no notification mechanism beyond raw journal log spam. The runaway behavior turned a small 104 KB corruption into a 1.7 GB problem, and would have continued indefinitely without manual cleanup.

Fix Action

Fix / Workaround

Bug: KanbanDbCorruptError backup writes a new .bak on every dispatcher tick → runaway disk usage

Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.

Workaround used

Code Example

sqlite3.OperationalError: disk I/O error

---

$ ls /home/.../kanban/boards/<slug>/ | wc -l
7861   # all kanban.db.corrupt.<timestamp>.bak files

$ du -sh /home/.../kanban/boards/<slug>/
1.7G

---

kanban.db.corrupt.20260525_190559.1.bak
kanban.db.corrupt.20260525_190559.2.bak
kanban.db.corrupt.20260525_190559.3.bak
kanban.db.corrupt.20260525_190604.bak
kanban.db.corrupt.20260525_190609.bak
... (~3.5 backups/min average, peaking higher)

---

def _guard_existing_db_is_healthy(path):
    # ... existing corruption check ...
    if corrupt:
        # Find most recent existing .bak
        existing_baks = sorted(glob(f"{path}.corrupt.*.bak"))
        latest_bak = existing_baks[-1] if existing_baks else None

        # Skip if corrupt source hasn't changed since last backup
        if latest_bak and _file_signatures_match(path, latest_bak):
            raise KanbanDbCorruptError(path, latest_bak, reason)

        # Otherwise back up fresh
        new_bak = f"{path}.corrupt.{timestamp()}.bak"
        shutil.copy2(path, new_bak)

        # Optional: FIFO eviction
        if len(existing_baks) >= MAX_CORRUPT_BAKS:
            os.remove(existing_baks[0])

        raise KanbanDbCorruptError(path, new_bak, reason)
RAW_BUFFERClick to expand / collapse

Bug: KanbanDbCorruptError backup writes a new .bak on every dispatcher tick → runaway disk usage

Hermes version: v2026.5.16-881-g186bf25cb (HEAD as of 2026-05-24) Profile: non-default profile (sophia) Symptom: 7,862 .corrupt.*.bak files (1.7 GB) accumulated in a single per-profile kanban board directory over ~37 hours, with no built-in pruning.

What happened

A board's kanban.db became corrupt at a specific point in time:

sqlite3.OperationalError: disk I/O error

(later re-classified by the newer guard as database disk image is malformed)

After the corruption, gateway/run.py:_tick_once_for_board() calls _kb.connect(board=slug) every minute, which calls _guard_existing_db_is_healthy() (hermes_cli/kanban_db.py:1132). The guard correctly refuses to open the corrupt DB and raises KanbanDbCorruptErrorbut it also writes a NEW .corrupt.*.bak of the same corrupt file on every tick.

Over ~37 hours and across periods of high-retry activity (with .1.bak, .2.bak, .3.bak suffixes from sub-second retries), this accumulated 7,861 backup files at ~224 KB each = ~1.7 GB, all bit-identical (or near-identical) copies of the same corrupt source.

The directory just before cleanup:

$ ls /home/.../kanban/boards/<slug>/ | wc -l
7861   # all kanban.db.corrupt.<timestamp>.bak files

$ du -sh /home/.../kanban/boards/<slug>/
1.7G

Filename pattern with sub-second collision suffixes:

kanban.db.corrupt.20260525_190559.1.bak
kanban.db.corrupt.20260525_190559.2.bak
kanban.db.corrupt.20260525_190559.3.bak
kanban.db.corrupt.20260525_190604.bak
kanban.db.corrupt.20260525_190609.bak
... (~3.5 backups/min average, peaking higher)

Each backup is essentially a clone of the same bytes — the source corrupt DB doesn't change between dispatcher ticks (no successful writes possible), so the backups are duplicative.

Expected behavior

When the guard detects a corrupt DB:

  • Back it up once (first detection) as kanban.db.corrupt.<timestamp>.bak
  • On subsequent ticks, skip the backup write if the corrupt source file's mtime/size/sha256 matches the last .bak
  • Optionally: cap total .bak files in the dir to N (e.g., 5) with FIFO eviction

This bounds disk impact and stops the runaway.

Suggested fix sketch (in _guard_existing_db_is_healthy)

def _guard_existing_db_is_healthy(path):
    # ... existing corruption check ...
    if corrupt:
        # Find most recent existing .bak
        existing_baks = sorted(glob(f"{path}.corrupt.*.bak"))
        latest_bak = existing_baks[-1] if existing_baks else None

        # Skip if corrupt source hasn't changed since last backup
        if latest_bak and _file_signatures_match(path, latest_bak):
            raise KanbanDbCorruptError(path, latest_bak, reason)

        # Otherwise back up fresh
        new_bak = f"{path}.corrupt.{timestamp()}.bak"
        shutil.copy2(path, new_bak)

        # Optional: FIFO eviction
        if len(existing_baks) >= MAX_CORRUPT_BAKS:
            os.remove(existing_baks[0])

        raise KanbanDbCorruptError(path, new_bak, reason)

Where _file_signatures_match could compare (st_size, st_mtime_ns) for cheapness, or sha256 for correctness.

Workaround used

Used Hermes' own kanban boards delete <slug> (default = archive) to remove the corrupt board, followed by kanban boards create <slug> to recreate fresh. Worked cleanly. The destructive cleanup of the 7,861 stale .bak files happened as a side effect of the boards delete action since the entire board dir is removed.

Root cause of the original corruption

Unknown. We never identified the trigger. The first OperationalError: disk I/O error fired at 2026-05-25 14:59:13 UTC and journalctl for the surrounding 10-minute window showed no kernel events, no disk pressure, no OOM, no apt activity, no fail2ban — nothing systemic. Disk had 36 GB free throughout. Both kanban DBs passed PRAGMA integrity_check at the time on a different (non-active) path; the corruption was confined to one nested per-board DB.

The corruption may correlate with concurrent Honcho/compression refactoring that occurred ~2-3 hours earlier (visible from backup filenames in the profile's migration scripts dir), but this is circumstantial.

This bug report focuses only on the unbounded-backup behavior, not the underlying corruption cause.

Why this matters

A bot with an active kanban board can silently consume gigabytes of disk over days without operator intervention, with no notification mechanism beyond raw journal log spam. The runaway behavior turned a small 104 KB corruption into a 1.7 GB problem, and would have continued indefinitely without manual cleanup.

Reproducibility

Difficult to repro the corruption itself, but the runaway-backup behavior is trivial to repro:

  1. Run any Hermes profile with kanban enabled.
  2. Corrupt the per-board kanban.db (e.g., truncate to 64 bytes, overwrite a random middle page).
  3. Restart the gateway.
  4. Wait. Watch ls | wc -l in the board dir grow.

Environment

  • Hermes Agent: v2026.5.16-881-g186bf25cb
  • OS: Ubuntu (Hetzner VPS)
  • Python: 3.x via ~/.hermes/hermes-agent/venv
  • Profile setup: non-default profile (sophia) at ~/.hermes/profiles/sophia/, alongside default profile at ~/.hermes/
  • Service: per-profile systemd unit (hermes-gateway-sophia.service)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When the guard detects a corrupt DB:

  • Back it up once (first detection) as kanban.db.corrupt.<timestamp>.bak
  • On subsequent ticks, skip the backup write if the corrupt source file's mtime/size/sha256 matches the last .bak
  • Optionally: cap total .bak files in the dir to N (e.g., 5) with FIFO eviction

This bounds disk impact and stops the runaway.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING