hermes - 💡(How to fix) Fix Kanban _has_sticky_block guard ineffective in cross-connection WAL-mode production (recurrence of #28712 loop)

hermes2026-05-29 15:36:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Direct query of the SQLite database confirms that at each point before the promoted event, the most recent blocked/unblocked event IS blocked — meaning _has_sticky_block would return True if called.

The discrepancy is between unit test and production environments:

Unit test (test_kanban_blocked_sticky.py): All operations (block_task + recompute_ready) happen on the same sqlite3.Connection within a single process. Changes are immediately visible.
Production: The worker (child process) opens Connection A, calls kanban_block → writes blocked event → COMMIT. The dispatcher (gateway process) opens Connection B in the next tick, calls recompute_ready(conn_B). Due to WAL mode read-view timing, Connection B's BEGIN IMMEDIATE transaction inside recompute_ready may not see Connection A's committed blocked event.

The relevant code path is in hermes_cli/kanban_db.py:

def recompute_ready(conn):
    with write_txn(conn):  # BEGIN IMMEDIATE on conn_B
        todo_rows = conn.execute(
            "SELECT id, status FROM tasks WHERE status IN ('todo', 'blocked')"
        ).fetchall()
        for row in todo_rows:
            if cur_status == "blocked" and _has_sticky_block(conn, task_id):
                continue  # ← This check should block promotion
            # ... but the task gets promoted anyway

Fix Action

Fix / Workaround

The _has_sticky_block guard (introduced in #28712 / commit 34120a0ae) is designed to prevent the kanban dispatcher from auto-promoting worker-initiated kanban_block calls back to ready. It checks the most recent blocked/unblocked event in task_events — if the last one is blocked, the task should stay blocked until an explicit kanban_unblock.

However, in production (cross-process, WAL-mode SQLite), the guard does not work. The dispatcher continues to auto-promote blocked tasks, re-spawning workers that immediately produce protocol_violation → gave_up → auto-promote → loop.

Unit test (test_kanban_blocked_sticky.py): All operations (block_task + recompute_ready) happen on the same sqlite3.Connection within a single process. Changes are immediately visible.
Production: The worker (child process) opens Connection A, calls kanban_block → writes blocked event → COMMIT. The dispatcher (gateway process) opens Connection B in the next tick, calls recompute_ready(conn_B). Due to WAL mode read-view timing, Connection B's BEGIN IMMEDIATE transaction inside recompute_ready may not see Connection A's committed blocked event.

Code Example

Event ID | Time     | Kind               | Run
   13    | 23:08:06 | blocked            | run=2   ← Worker blocked for user confirmation
   14    | 23:08:40 | promoted           | None    ← Auto-promoted 34s later!
   19    | 23:10:40 | protocol_violation | run=3   ← New worker spawned, crashed
   20    | 23:10:40 | gave_up            | None    ← Circuit breaker tripped
   21    | 23:10:40 | promoted           | None    ← Promoted again!
   26    | 23:11:23 | blocked            | run=4
   27    | 23:11:40 | promoted           | None
   32    | 23:12:45 | blocked            | run=5
   33    | 23:13:41 | promoted           | None
   41    | 23:15:58 | completed          | run=6

---

def recompute_ready(conn):
    with write_txn(conn):  # BEGIN IMMEDIATE on conn_B
        todo_rows = conn.execute(
            "SELECT id, status FROM tasks WHERE status IN ('todo', 'blocked')"
        ).fetchall()
        for row in todo_rows:
            if cur_status == "blocked" and _has_sticky_block(conn, task_id):
                continue  # ← This check should block promotion
            # ... but the task gets promoted anyway

---

# Connection A (worker)
conn_a = kb.connect()
kb.claim_task(conn_a, tid)
kb.block_task(conn_a, tid, reason="test", expected_run_id=run_id)
conn_a.close()

# Connection B (dispatcher) — separate connection
conn_b = kb.connect()
promoted = kb.recompute_ready(conn_b)
# promoted == 1 (BUG: should be 0)

RAW_BUFFERClick to expand / collapse

Bug Description

Evidence

Task t_eccbc5b6 on the default board:

Event ID | Time     | Kind               | Run
   13    | 23:08:06 | blocked            | run=2   ← Worker blocked for user confirmation
   14    | 23:08:40 | promoted           | None    ← Auto-promoted 34s later!
   19    | 23:10:40 | protocol_violation | run=3   ← New worker spawned, crashed
   20    | 23:10:40 | gave_up            | None    ← Circuit breaker tripped
   21    | 23:10:40 | promoted           | None    ← Promoted again!
   26    | 23:11:23 | blocked            | run=4
   27    | 23:11:40 | promoted           | None
   32    | 23:12:45 | blocked            | run=5
   33    | 23:13:41 | promoted           | None
   41    | 23:15:58 | completed          | run=6

Every blocked was followed by an automatic promoted within ~20-60 seconds, without any user invoking kanban_unblock.

Root Cause Analysis

The discrepancy is between unit test and production environments:

Unit test (test_kanban_blocked_sticky.py): All operations (block_task + recompute_ready) happen on the same sqlite3.Connection within a single process. Changes are immediately visible.
Production: The worker (child process) opens Connection A, calls kanban_block → writes blocked event → COMMIT. The dispatcher (gateway process) opens Connection B in the next tick, calls recompute_ready(conn_B). Due to WAL mode read-view timing, Connection B's BEGIN IMMEDIATE transaction inside recompute_ready may not see Connection A's committed blocked event.

The relevant code path is in hermes_cli/kanban_db.py:

def recompute_ready(conn):
    with write_txn(conn):  # BEGIN IMMEDIATE on conn_B
        todo_rows = conn.execute(
            "SELECT id, status FROM tasks WHERE status IN ('todo', 'blocked')"
        ).fetchall()
        for row in todo_rows:
            if cur_status == "blocked" and _has_sticky_block(conn, task_id):
                continue  # ← This check should block promotion
            # ... but the task gets promoted anyway

Steps to Reproduce

Create a kanban task assigned to any profile
Have the worker call kanban_block(reason="review-required: please confirm")
Wait for the next dispatcher tick (default 60s)
Observe: the task is auto-promoted back to ready and a new worker is spawned
The new worker finds nothing to do, exits cleanly → protocol_violation → gave_up → auto-promote → loop

Alternatively, reproduce the cross-connection scenario directly:

# Connection A (worker)
conn_a = kb.connect()
kb.claim_task(conn_a, tid)
kb.block_task(conn_a, tid, reason="test", expected_run_id=run_id)
conn_a.close()

# Connection B (dispatcher) — separate connection
conn_b = kb.connect()
promoted = kb.recompute_ready(conn_b)
# promoted == 1 (BUG: should be 0)

Environment

Hermes Agent: commit dc235e93c (May 29, 2026) — already includes the #28712 fix
OS: Linux (Ubuntu, kernel 6.17)
Kanban: default config (dispatch_in_gateway: true, dispatch_interval_seconds: 60)
SQLite: WAL mode with synchronous=FULL, wal_autocheckpoint=100

Expected Behavior

A worker-initiated kanban_block should be sticky — the task must stay blocked until an explicit kanban_unblock or hermes kanban unblock is issued by a human operator. This is the documented contract in the kanban-worker and kanban-orchestrator skills.

Actual Behavior

The _has_sticky_block guard appears to be a no-op in cross-connection production scenarios. The dispatcher promotes the blocked task to ready on the very next tick.

Suggested Fix

The WAL mode cross-connection visibility needs investigation. Possible approaches:

Store the sticky-block flag on the tasks row directly (e.g. a sticky_blocked_until timestamp column). kanban_block sets it; kanban_unblock clears it; recompute_ready checks the column directly on the same row it just read — no cross-table WAL visibility dependency. This is the most robust option.
Force a WAL checkpoint after writing the blocked event so the next connection is guaranteed to see it.
Add explicit read isolation hints — e.g. query task_events before entering the write_txn, or use PRAGMA schema.synchronous=NORMAL for the events read.

Related Issues

#28712 — Original report of the same auto-promote bug (fixed by _has_sticky_block)
#28903 — "Kanban auto-unblock is too eager"
#29014 — "Kanban dispatcher repeatedly respawns blocked/manual-gate tasks"
#29171 — "Kanban needs first-class waiting states for human, approval, and review gates"
#30417 — Bug 3: archived parent silently promotes children

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering