hermes - 💡(How to fix) Fix [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than consecutive_failures / DEFAULT_FAILURE_LIMIT increments + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_check reported B-tree errors), recoverable only via sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).

Distinct from existing issues:

  • #30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
  • #30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
  • #30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
  • #30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Root Cause

Distinct from existing issues:

  • #30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
  • #30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
  • #30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
  • #30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Fix Action

Fix / Workaround

When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than consecutive_failures / DEFAULT_FAILURE_LIMIT increments + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_check reported B-tree errors), recoverable only via sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).

Distinct from existing issues:

  • #30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
  • #30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
  • #30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
  • #30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.
  1. Create a kanban task on board B with assignee=<profile-that-does-not-exist-on-B> (e.g. board exists but profile is bound to a different board).
  2. Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
  3. Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
  4. After ~10 rapid cycles, sqlite3 kanban.db "PRAGMA integrity_check" reports B-tree corruption.
RAW_BUFFERClick to expand / collapse

Summary

When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than consecutive_failures / DEFAULT_FAILURE_LIMIT increments + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_check reported B-tree errors), recoverable only via sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).

Distinct from existing issues:

  • #30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
  • #30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
  • #30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
  • #30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Steps to reproduce

  1. Create a kanban task on board B with assignee=<profile-that-does-not-exist-on-B> (e.g. board exists but profile is bound to a different board).
  2. Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
  3. Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
  4. After ~10 rapid cycles, sqlite3 kanban.db "PRAGMA integrity_check" reports B-tree corruption.

Expected behavior

  • consecutive_failures increments and failure_limit (default 2) hard-stops spawning before corruption can occur.
  • DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
  • At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

Actual behavior

  • ~11 spawns complete before any block is registered. The task_runs table + dispatcher state-writes appear to be racing with each other (and/or with the worker's own DB handles in WAL mode), producing B-tree corruption.
  • Recovery requires offline .dump/.read rebuild; the in-tree recovery path (#30687) silently recreates an empty DB on top of the corruption.

Suggested fix

  1. Enforce failure_limit in the spawn path with a transactional consecutive_failures += 1; if >= limit: block; commit BEFORE the next spawn attempt is allowed. This prevents the race entirely.
  2. Per-task exponential backoff between spawn-crash retries (e.g. 1s → 4s → 16s). Even if (1) misses an edge case, this caps the loop frequency below corruption threshold.
  3. Defense in depth: validate assignee against kanban_known_profiles_for_board BEFORE the first spawn, returning auto_blocked with a clear last_failure_error like "profile X is not registered for board Y; check dispatch routing". (Fixing #30678 also closes most of the inputs to this path, but defense-in-depth here is cheap.)

Update — partial mitigation submitted

PR #30973 adds PRAGMA synchronous=FULL + PRAGMA wal_autocheckpoint=100 to connect(). This addresses the durability gap that lets the WAL race in the first place. It doesn't fix the failure-limit-not-enforced-in-spawn-path root cause, but it raises the bar enough that we observed no further corruption under the same workload that produced 5 corruptions in succession on the unpatched build. The 3 suggested fixes above are still the proper structural fix.

Environment

  • OS: Amazon Linux 2023 (AArch64)
  • Python: 3.11.15
  • Hermes version: v0.14.0 (2026.5.16); commit ba9964ff0 on 2026-05-21
  • Affected component: comp/gateway (kanban dispatcher; embedded gateway-dispatcher path)

Related issues

  • #30678 (misrouting root cause)
  • #30417 (slow variant, no corruption)
  • #30445 (multi-gateway corruption)
  • #30687 (corruption recovery)
  • #29320 (circuit-breaker for repeated bails — different signal)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • consecutive_failures increments and failure_limit (default 2) hard-stops spawning before corruption can occur.
  • DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
  • At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips