hermes - 💡(How to fix) Fix kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100

hermes2026-05-24 18:41:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

~/.hermes/kanban.db returns database disk image is malformed on the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (synchronous=FULL + wal_autocheckpoint=100) protect against clean-shutdown durability races but do not protect against SIGKILL during a WAL frame write.

Related prior incidents:

#30896 — initial concurrent-writer corruption report
#30973 — synchronous=NORMAL → FULL + tight wal_autocheckpoint fix; resolved the #30896 case but did not generalise to the kill-mid-write path described here

Error Message

$ hermes kanban heartbeat <task_id> Error: database disk image is malformed

$ hermes kanban complete <task_id> Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;" *** in database main *** Page <N>: btreeInitPage() returns error code 11 ... (multiple damaged pages)

Root Cause

5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:

between BEGIN-allocated WAL frame and matching COMMIT, or
during wal_autocheckpoint=100 rollover (frequent because tight), or
while another worker is in a read transaction holding a shared lock

Fix Action

Fix / Workaround

Workload that surfaces this reliably:

A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).
Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's max_runtime_seconds).
Heartbeat interval ~30s during the skill execution.
No max_concurrent_workers ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).
After M concurrent kill-9s on workers (reclaim path), kanban.db corrupts.
max_concurrent_workers in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently _default_spawn is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.
Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.

Code Example

$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed

$ hermes kanban complete <task_id>
Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)

---

# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
    if not _pid_alive(pid):
        return
    time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)

---

def _on_sigterm(signum, frame):
       try:
           if _kanban_conn is not None:
               _kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
               _kanban_conn.close()
       finally:
           sys.exit(143)
   signal.signal(signal.SIGTERM, _on_sigterm)

---

# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"

# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway

# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).

RAW_BUFFERClick to expand / collapse

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100

Summary

Related prior incidents:

#30896 — initial concurrent-writer corruption report
#30973 — synchronous=NORMAL → FULL + tight wal_autocheckpoint fix; resolved the #30896 case but did not generalise to the kill-mid-write path described here

Symptom

After ~45 minutes of a workload that fires a new ready task every ~5 minutes (with some skill runs exceeding max_runtime_seconds), kanban operations on the affected profile start failing:

$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed

$ hermes kanban complete <task_id>
Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)

The actual skill work completes successfully (skills write back to their downstream system of record via API); only the kanban metadata layer is poisoned.

Repro recipe

Workload that surfaces this reliably:

A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).
Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's max_runtime_seconds).
Heartbeat interval ~30s during the skill execution.
No max_concurrent_workers ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).
After M concurrent kill-9s on workers (reclaim path), kanban.db corrupts.

Re-creating in isolation should be possible by running the existing kanban concurrent-writer stress test with a SIGKILL injection partway through a write transaction.

Root-cause hypothesis

Workers don't install a SIGTERM handler that closes the SQLite connection. Reclaim path:

# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
    if not _pid_alive(pid):
        return
    time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)

5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:

between BEGIN-allocated WAL frame and matching COMMIT, or
during wal_autocheckpoint=100 rollover (frequent because tight), or
while another worker is in a read transaction holding a shared lock

… the WAL header can desync from the main DB pages. Once desynced, every subsequent open returns "disk image is malformed".

synchronous=FULL fsyncs committed WAL frames. It cannot rescue an in-flight transaction that gets killed before commit. WAL is normally rollback-safe for crashes, but the combination of:

concurrent writers holding write locks
one writer killed mid-transaction
another writer trying to checkpoint at wal_autocheckpoint=100

…seems to produce a state SQLite's recovery can't reconcile.

Suggested fix directions

In rough order of effort / impact:

Worker-side SIGTERM handler that flushes the kanban DB connection:

def _on_sigterm(signum, frame):
    try:
        if _kanban_conn is not None:
            _kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
            _kanban_conn.close()
    finally:
        sys.exit(143)
signal.signal(signal.SIGTERM, _on_sigterm)

Install this in the worker entry point that opens the kanban connection.

Configurable reclaim grace window — HERMES_KANBAN_RECLAIM_GRACE_SECONDS, default 30s. 5s is too aggressive for children that are mid-LLM-call.
BEGIN IMMEDIATE for kanban writes — moves the lock acquisition to the start of the transaction rather than the first write, so contention is serialised at acquire time. Reduces the window in which a kill leaves the WAL inconsistent.
max_concurrent_workers in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently _default_spawn is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.
Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.

Happy to send a PR for (1) and (2) if those directions are agreeable; they're the smallest deltas with the highest blast-radius reduction. (3) and (4) deserve their own design discussions.

Workaround for affected users

While a fix is in flight:

# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"

# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway

# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).

Environment

hermes-agent: based on ca63746f3 (one commit ahead of 7cd1f6e2e from #30973)
Python 3.12, Linux x86_64
SQLite 3.45.x (system default)
WAL mode, synchronous=FULL, wal_autocheckpoint=100 (per #30973)
Workload: profile with 5-minute poller-driven dispatch cadence, long-running skills (>30 min in some cases), 5+ concurrent workers during peak

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering