- `consecutive_failures` increments and `failure_limit` (default 2) **hard-stops** spawning *before* corruption can occur. - DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter. - At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

hermes - 💡(How to fix) Fix [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips

StepCodex · 2026-05-23T10:43:08Z

[hermes] When a kanban task is dispatched to a profile that doesn't exist on the board or any other deterministic spawn-time crash that returns very quickly ,… When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than `consecutive_failures` / `DEFAULT_FAILURE_LIMIT` increments + commits. We observed **11 spawn attempts in a handful of seconds** and ended with a corrupted board DB (`PRAGMA integrity_check` reported B-tree errors), recoverable only via `sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db`. Reproduced twice in one session on the same board (2026-05-22). Distinct from existing issues: - **#30417 Bug 1** (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt. - **#30445** (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway. - **#30687** (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt. - **#30678** (board override ignored in worker env): the misrouting root cause that fed our case — a worker called `kanban_create(board=" ")` from worker context, the `board` arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card. ## Fix / Workaround When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than `consecutive_failures` / `DEFAULT_FAILURE_LIMIT` increments + commits. We observed **11 spawn attempts in a handful of seconds** and ended with a corrupted board DB (`PRAGMA integrity_check` reported B-tree errors), recoverable only via `sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db`. Reproduced twice in one session on the same board (2026-05-22). Distinct from existing issues: - **#30417 Bug 1** (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt. - **#30445** (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway. - **#30687** (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt. - **#30678** (board override ignored in worker env): the misrouting root cause that fed our case — a worker called `kanban_create(board=" ")` from worker context, the `board` arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card. 1. Create a kanban task on board `B` with `assignee= ` (e.g. board exists but profile is bound to a different board). 2. Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found. 3. Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task). 4. After ~10 rapid cycles, `sqlite3 kanban.db "PRAGMA integrity_check"` reports B-tree corruption. ## Summary When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than `consecutive_failures` / `DEFAULT_FAILURE_LIMIT` increments + commits. We observed **11 spawn attempts in a handful of seconds** and ended with a corrupted board DB (`PRAGMA integrity_check` reported B-tree errors), recoverable only via `sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db`. Reproduced twice in one session on the same board (2026-05-22). Distinct from existing issues: - **#30417 Bug 1** (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt. - **#30445** (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway. - **#30687** (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt. - **#30678** (board override ignored in worker env): the misrouting root cause that fed our case — a worker called `kanban_create(board=" ")` from worker context, the `board` arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card. ## Steps to reproduce 1. Create a kanban task on board `B` with `assignee=<profile-that-does-no

hermes2026-05-23 10:43:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than consecutive_failures / DEFAULT_FAILURE_LIMIT increments + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_check reported B-tree errors), recoverable only via sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).

Distinct from existing issues:

#30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
#30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
#30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
#30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Root Cause

Distinct from existing issues:

#30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
#30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
#30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
#30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Fix Action

Fix / Workaround

Distinct from existing issues:

#30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
#30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
#30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
#30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Create a kanban task on board B with assignee=<profile-that-does-not-exist-on-B> (e.g. board exists but profile is bound to a different board).
Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
After ~10 rapid cycles, sqlite3 kanban.db "PRAGMA integrity_check" reports B-tree corruption.

RAW_BUFFERClick to expand / collapse

Summary

Distinct from existing issues:

#30417 Bug 1 (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
#30445 (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
#30687 (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
#30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

Steps to reproduce

Create a kanban task on board B with assignee=<profile-that-does-not-exist-on-B> (e.g. board exists but profile is bound to a different board).
Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
After ~10 rapid cycles, sqlite3 kanban.db "PRAGMA integrity_check" reports B-tree corruption.

Expected behavior

consecutive_failures increments and failure_limit (default 2) hard-stops spawning before corruption can occur.
DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

Actual behavior

~11 spawns complete before any block is registered. The task_runs table + dispatcher state-writes appear to be racing with each other (and/or with the worker's own DB handles in WAL mode), producing B-tree corruption.
Recovery requires offline .dump/.read rebuild; the in-tree recovery path (#30687) silently recreates an empty DB on top of the corruption.

Suggested fix

Enforce failure_limit in the spawn path with a transactional consecutive_failures += 1; if >= limit: block; commit BEFORE the next spawn attempt is allowed. This prevents the race entirely.
Per-task exponential backoff between spawn-crash retries (e.g. 1s → 4s → 16s). Even if (1) misses an edge case, this caps the loop frequency below corruption threshold.
Defense in depth: validate assignee against kanban_known_profiles_for_board BEFORE the first spawn, returning auto_blocked with a clear last_failure_error like "profile X is not registered for board Y; check dispatch routing". (Fixing #30678 also closes most of the inputs to this path, but defense-in-depth here is cheap.)

Update — partial mitigation submitted

PR #30973 adds PRAGMA synchronous=FULL + PRAGMA wal_autocheckpoint=100 to connect(). This addresses the durability gap that lets the WAL race in the first place. It doesn't fix the failure-limit-not-enforced-in-spawn-path root cause, but it raises the bar enough that we observed no further corruption under the same workload that produced 5 corruptions in succession on the unpatched build. The 3 suggested fixes above are still the proper structural fix.

Environment

OS: Amazon Linux 2023 (AArch64)
Python: 3.11.15
Hermes version: v0.14.0 (2026.5.16); commit ba9964ff0 on 2026-05-21
Affected component: comp/gateway (kanban dispatcher; embedded gateway-dispatcher path)

Related issues

#30678 (misrouting root cause)
#30417 (slow variant, no corruption)
#30445 (multi-gateway corruption)
#30687 (corruption recovery)
#29320 (circuit-breaker for repeated bails — different signal)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

consecutive_failures increments and failure_limit (default 2) hard-stops spawning before corruption can occur.
DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Summary

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix

Update — partial mitigation submitted

Environment

Related issues

FAQ

Expected behavior

Still need to ship something?

TRENDING