hermes - 💡(How to fix) Fix kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

~/.hermes/kanban.db returns database disk image is malformed on the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (synchronous=FULL + wal_autocheckpoint=100) protect against clean-shutdown durability races but do not protect against SIGKILL during a WAL frame write.

Related prior incidents:

  • #30896 — initial concurrent-writer corruption report
  • #30973 — synchronous=NORMALFULL + tight wal_autocheckpoint fix; resolved the #30896 case but did not generalise to the kill-mid-write path described here

Error Message

$ hermes kanban heartbeat <task_id> Error: database disk image is malformed

$ hermes kanban complete <task_id> Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;" *** in database main *** Page <N>: btreeInitPage() returns error code 11 ... (multiple damaged pages)

Root Cause

5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:

  • between BEGIN-allocated WAL frame and matching COMMIT, or
  • during wal_autocheckpoint=100 rollover (frequent because tight), or
  • while another worker is in a read transaction holding a shared lock

Fix Action

Fix / Workaround

Workload that surfaces this reliably:

  1. A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).

  2. Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's max_runtime_seconds).

  3. Heartbeat interval ~30s during the skill execution.

  4. No max_concurrent_workers ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).

  5. After M concurrent kill-9s on workers (reclaim path), kanban.db corrupts.

  6. max_concurrent_workers in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently _default_spawn is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.

  7. Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.

Code Example

$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed

$ hermes kanban complete <task_id>
Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)

---

# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
    if not _pid_alive(pid):
        return
    time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)

---

def _on_sigterm(signum, frame):
       try:
           if _kanban_conn is not None:
               _kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
               _kanban_conn.close()
       finally:
           sys.exit(143)
   signal.signal(signal.SIGTERM, _on_sigterm)

---

# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"

# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway

# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).
RAW_BUFFERClick to expand / collapse

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100


Summary

~/.hermes/kanban.db returns database disk image is malformed on the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (synchronous=FULL + wal_autocheckpoint=100) protect against clean-shutdown durability races but do not protect against SIGKILL during a WAL frame write.

Related prior incidents:

  • #30896 — initial concurrent-writer corruption report
  • #30973 — synchronous=NORMALFULL + tight wal_autocheckpoint fix; resolved the #30896 case but did not generalise to the kill-mid-write path described here

Symptom

After ~45 minutes of a workload that fires a new ready task every ~5 minutes (with some skill runs exceeding max_runtime_seconds), kanban operations on the affected profile start failing:

$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed

$ hermes kanban complete <task_id>
Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)

The actual skill work completes successfully (skills write back to their downstream system of record via API); only the kanban metadata layer is poisoned.

Repro recipe

Workload that surfaces this reliably:

  1. A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).
  2. Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's max_runtime_seconds).
  3. Heartbeat interval ~30s during the skill execution.
  4. No max_concurrent_workers ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).
  5. After M concurrent kill-9s on workers (reclaim path), kanban.db corrupts.

Re-creating in isolation should be possible by running the existing kanban concurrent-writer stress test with a SIGKILL injection partway through a write transaction.

Root-cause hypothesis

Workers don't install a SIGTERM handler that closes the SQLite connection. Reclaim path:

# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
    if not _pid_alive(pid):
        return
    time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)

5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:

  • between BEGIN-allocated WAL frame and matching COMMIT, or
  • during wal_autocheckpoint=100 rollover (frequent because tight), or
  • while another worker is in a read transaction holding a shared lock

… the WAL header can desync from the main DB pages. Once desynced, every subsequent open returns "disk image is malformed".

synchronous=FULL fsyncs committed WAL frames. It cannot rescue an in-flight transaction that gets killed before commit. WAL is normally rollback-safe for crashes, but the combination of:

  • concurrent writers holding write locks
  • one writer killed mid-transaction
  • another writer trying to checkpoint at wal_autocheckpoint=100

…seems to produce a state SQLite's recovery can't reconcile.

Suggested fix directions

In rough order of effort / impact:

  1. Worker-side SIGTERM handler that flushes the kanban DB connection:

    def _on_sigterm(signum, frame):
        try:
            if _kanban_conn is not None:
                _kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
                _kanban_conn.close()
        finally:
            sys.exit(143)
    signal.signal(signal.SIGTERM, _on_sigterm)

    Install this in the worker entry point that opens the kanban connection.

  2. Configurable reclaim grace windowHERMES_KANBAN_RECLAIM_GRACE_SECONDS, default 30s. 5s is too aggressive for children that are mid-LLM-call.

  3. BEGIN IMMEDIATE for kanban writes — moves the lock acquisition to the start of the transaction rather than the first write, so contention is serialised at acquire time. Reduces the window in which a kill leaves the WAL inconsistent.

  4. max_concurrent_workers in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently _default_spawn is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.

  5. Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.

Happy to send a PR for (1) and (2) if those directions are agreeable; they're the smallest deltas with the highest blast-radius reduction. (3) and (4) deserve their own design discussions.

Workaround for affected users

While a fix is in flight:

# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"

# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway

# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).

Environment

  • hermes-agent: based on ca63746f3 (one commit ahead of 7cd1f6e2e from #30973)
  • Python 3.12, Linux x86_64
  • SQLite 3.45.x (system default)
  • WAL mode, synchronous=FULL, wal_autocheckpoint=100 (per #30973)
  • Workload: profile with 5-minute poller-driven dispatch cadence, long-running skills (>30 min in some cases), 5+ concurrent workers during peak

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100