hermes - ✅(Solved) Fix kanban: dispatcher auto-promotes blocked task → respawn worker → protocol_violation loop [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28712Fetched 2026-05-20 04:02:22
View on GitHub
Comments
0
Participants
1
Timeline
15
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×6referenced ×4labeled ×3closed ×1

Kanban dispatcher auto-promotes a task that was correctly blocked by its worker (with outcome=blocked, reason=review-required), spawning a fresh worker that has no actionable instructions left. The fresh worker reads the task body, sees the existing review-required handoff comment, finds nothing to do (work is already applied to disk), and exits cleanly without calling kanban_complete or kanban_block — the dispatcher records this as protocol_violation and loops indefinitely.

Root Cause

While reproducing this I also found that hermes kanban init failed with no such column: session_id on a kanban.db that pre-dated the session_id migration. Workaround was ALTER TABLE tasks ADD COLUMN session_id TEXT; CREATE INDEX IF NOT EXISTS idx_tasks_session_id ON tasks(session_id);. The migration code in hermes_cli/kanban_db.py:1168-1179 exists but didn't fire on my DB — possibly because init errored out before reaching the schema-upgrade pass. Probably a separate issue but flagging in case it's part of the same code path.

Fix Action

Fix / Workaround

Kanban dispatcher auto-promotes a task that was correctly blocked by its worker (with outcome=blocked, reason=review-required), spawning a fresh worker that has no actionable instructions left. The fresh worker reads the task body, sees the existing review-required handoff comment, finds nothing to do (work is already applied to disk), and exits cleanly without calling kanban_complete or kanban_block — the dispatcher records this as protocol_violation and loops indefinitely.

Dispatcher then promotes the blocked task back to ready and respawns:

[2026-05-19 17:43] promoted
[2026-05-19 17:43] [run 11] claimed
[2026-05-19 17:43] [run 11] spawned {pid: 83656}
[2026-05-19 17:45] [run 11] protocol_violation {exit_code: 0}
[2026-05-19 17:45] gave_up
[2026-05-19 17:45] promoted                  # <- loops again
[2026-05-19 17:45] [run 12] claimed
[2026-05-19 17:45] [run 12] spawned
[2026-05-19 17:56] [run 12] protocol_violation {exit_code: 0}
[2026-05-19 17:56] gave_up
[2026-05-19 17:56] promoted                  # <- and again
[2026-05-19 17:56] [run 13] claimed

A task whose latest run ended with outcome=blocked should NOT be auto-promoted by the dispatcher. Promotion to ready should require explicit operator action (hermes kanban unblock <id>), exactly like the documented human-in-the-loop pattern in the kanban-orchestrator skill:

PR fix notes

PR #28726: fix(kanban): worker-initiated block must not be auto-promoted (#28712)

Description (problem / solution / changelog)

What does this PR do?

#28712 describes a kanban infinite loop:

  1. Worker calls kanban_block(reason="review-required: ...") to hand off to a human.
  2. Dispatcher's recompute_ready() flips the task back to ready on the next tick.
  3. Fresh worker spawns, finds no actionable instructions (work already applied, review-required comment already posted), exits cleanly.
  4. detect_crashed_workers records protocol_violation_record_task_failure(failure_limit=1)gave_upblocked.
  5. Next tick: recompute_ready promotes again → goto 3.

Result: burned API calls, phantom "crashed" runs polluting task history, misleading repeated_crashes flag from kanban diag, and amplified pressure on rate-limited providers (the reporter hit 429s during the loop on Kiro/Anthropic).

Root causerecompute_ready was treating every blocked task with satisfied parents as eligible for promotion, with no way to distinguish:

  • Worker / operator-initiated blocks (kanban_block) — deliberate human-in-the-loop handoff; must stay blocked until explicit kanban_unblock, per the documented kanban-orchestrator skill contract.
  • Circuit-breaker blocks (_record_task_failure tripping on repeated crashes) — should auto-recover when conditions change (parents complete, transient infra clears). This is the original intent of #40c1decb3 ("promote blocked tasks when parent dependencies complete").

Fix — distinguish the two using the cheapest available signal: the most recent "blocked"/"unblocked" event in task_events.

  • kanban_block already emits a "blocked" event row (with the review-required: … reason).
  • kanban_unblock already emits an "unblocked" event row.
  • Circuit-breaker _record_task_failure emits "gave_up", not "blocked".

New helper _has_sticky_block(conn, task_id) returns True iff the most recent of the two block-related events is "blocked". recompute_ready consults it and skips sticky-blocked tasks. The only legitimate exit is unblock_task(), which emits "unblocked" and flips the predicate back — exactly the documented human-in-the-loop pattern.

Also fixes the tangentially related schema-init crash the reporter flagged at the bottom of #28712 (init_db failed with no such column: session_id on a kanban.db that pre-dated the session_id migration). Three CREATE INDEX statements on tasks(<late-added-column>) were sitting at the top of SCHEMA_SQL, where they run before the additive-column migrations. On a legacy DB the table's CREATE TABLE IF NOT EXISTS is a no-op, the column doesn't exist yet, and the index DDL crashes the whole init script — including the migration that would have fixed it. The reporter had to ALTER TABLE + CREATE INDEX by hand to unstick their install. Moved all three (idx_tasks_tenant, idx_tasks_idempotency, idx_tasks_session_id) into _migrate_add_optional_columns, after the ALTER calls that guarantee the columns exist.

Related Issue

Fixes #28712

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)

Changes Made

  • hermes_cli/kanban_db.py
    • New _has_sticky_block(conn, task_id) -> bool helper that reads the most recent "blocked"/"unblocked" event for a task.
    • recompute_ready() now continues past blocked tasks whose latest block-related event is "blocked"; circuit-breaker blocks (with no event, or "gave_up" event) continue to auto-recover when parents complete.
    • Moved idx_tasks_tenant, idx_tasks_idempotency and idx_tasks_session_id out of SCHEMA_SQL and into _migrate_add_optional_columns, asserted unconditionally with IF NOT EXISTS so new DBs get them on first init and legacy DBs get them after the additive ALTER TABLE calls.
    • Combined: 92 added / 15 removed lines.
  • tests/hermes_cli/test_kanban_blocked_sticky.py — 344 lines, 7 new tests:
    • test_worker_block_is_not_auto_promoted_by_recompute_ready — five back-to-back ticks leave the task blocked.
    • test_worker_block_on_child_with_done_parents_is_still_sticky — the worst false-positive (parent-completion path) is closed.
    • test_circuit_breaker_block_still_auto_promotes — preserves the pre-#28712 recovery semantics for the original 40c1decb3 intent.
    • test_gave_up_event_alone_does_not_make_block_sticky — explicit guard so the protocol_violation loop's second leg can't regress.
    • test_unblock_clears_sticky_state_and_lets_block_recover — the only legitimate exit, and subsequent circuit-breaker blocks still auto-recover.
    • test_protocol_violation_loop_is_broken — full bug reproduction: block → tick → (would-be) crash + gave_up → next tick still blocked. Would loop indefinitely without the fix.
    • test_init_db_recovers_from_legacy_tasks_table_without_session_id — hand-crafted pre-tenant / pre-idempotency_key / pre-session_id tasks table, calls init_db, asserts all three columns + indexes end up present and legacy rows survive.

How to Test

# New regression suite (7 tests).
scripts/run_tests.sh tests/hermes_cli/test_kanban_blocked_sticky.py -q

# Existing dispatcher tests must still pass, including
# test_recompute_ready_promotes_blocked_with_done_parents which
# pins the circuit-breaker recovery contract.
scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -q
# Expected: 156 passed, 1 failed (pre-existing
# `test_max_runtime_uses_current_run_start_after_retry` — unrelated
# `os.kill(999999, 0)` live-system guard, verified on upstream/main).

# Broader sweep.
scripts/run_tests.sh \
  tests/hermes_cli/test_kanban_cli.py \
  tests/hermes_cli/test_kanban_core_functionality.py \
  tests/hermes_cli/test_kanban_notify.py \
  tests/hermes_cli/test_kanban_diagnostics.py \
  tests/tools/test_kanban_tools.py \
  -q
# Expected: only the 3 pre-existing `os.kill` flakes fail
# (verified identical on upstream/main).

Manual reproduction of the loop fix:

  1. From a worker session, call kanban_block with reason="review-required: please verify".
  2. Without running hermes kanban unblock, wait through one or more dispatcher ticks (or call hermes kanban dispatch directly).
  3. Before this PR: task flips back to ready, fresh worker spawns, exits cleanly with protocol_violation, repeats.
  4. After this PR: task stays blocked indefinitely until you run hermes kanban unblock <id>. Once unblocked, normal promotion + claim semantics resume.

Manual reproduction of the schema-init fix:

# Take any older kanban.db that pre-dates the session_id migration.
sqlite3 ~/.hermes/kanban.db 'PRAGMA table_info(tasks)' | grep -q session_id || echo "DB is pre-session_id"
# Run hermes kanban init — before this PR, crashed with
# `OperationalError: no such column: session_id`.  After this PR,
# completes silently and adds the missing columns + indexes.
hermes kanban init

Checklist

Code

  • My commit messages follow Conventional Commits (fix(kanban):, test(kanban):)
  • I searched for existing PRs — no open duplicate (the related #27796 was a different auto-unblock-on-comment design, closed)
  • My PR contains only changes related to this fix
  • I've run scripts/run_tests.sh tests/hermes_cli/test_kanban_blocked_sticky.py -q (7 passed)
  • I've added regression tests covering every leg of the loop and the migration path
  • I've tested on my platform: macOS 15.x (darwin 24.6.0)

Documentation & Housekeeping

  • N/A — no user-facing docs change. The fix restores the documented behaviour from the kanban-orchestrator skill ("Any task can kanban_block() to wait for input. Dispatcher respawns after /unblock."), which the bug was violating.
  • N/A — no config keys changed
  • N/A — no architecture / workflow change
  • I've considered cross-platform impact — pure SQLite-via-sqlite3, no platform-specific code
  • N/A — no tool description/schema changes (the kanban_block / kanban_unblock tool surfaces are unchanged; only the dispatcher's interpretation of their outputs changes)

Changed files

  • hermes_cli/kanban_db.py (modified, +92/-15)
  • tests/hermes_cli/test_kanban_blocked_sticky.py (added, +344/-0)

PR #28724: fix(kanban): prevent recompute_ready from unblocking parent-free blocked tasks

Description (problem / solution / changelog)

Summary

recompute_ready() was treating every blocked task with no parents as eligible for promotion because all(...) over an empty parent set returns True.

This meant an intentionally blocked task could jump back to ready whenever an unrelated board action triggered a recompute pass.

What changed

  • only auto-promote blocked tasks when they are actually dependency-gated
  • keep parent-free manual blocks in blocked until an explicit unblock
  • add a regression test covering the parent-free manual block case

Repro

  1. Create a parent-free task
  2. Move it to blocked
  3. Complete some unrelated task on the board
  4. Before this fix, the blocked task moved back to ready

Validation

  • pytest tests/hermes_cli/test_kanban_db.py -k recompute_ready
  • Result : 6 passed

Changed files

  • hermes_cli/kanban_db.py (modified, +8/-2)
  • tests/hermes_cli/test_kanban_db.py (modified, +18/-0)

PR #28994: fix(kanban): worker-initiated block must not be auto-promoted (#28712)

Description (problem / solution / changelog)

Summary

Salvage of #28726 — worker / operator-initiated kanban_block() is now sticky. recompute_ready skips tasks whose latest block-related event in task_events is "blocked" while continuing to auto-promote circuit-breaker blocks (which emit "gave_up", not "blocked"). Fixes #28712.

Why salvage instead of merge #28726

#28726 also included a schema-init ordering fix for idx_tasks_session_id on legacy DBs. That half is already on main via #28754 (Michael Nguyen) and #28781 (kshitijk4poor), which is why the original PR was showing CONFLICTING. This salvage cherry-picks only the sticky-block half — xxxigm's authorship is preserved per-commit.

Changes

  • hermes_cli/kanban_db.py — new _has_sticky_block(conn, task_id) helper; recompute_ready() skips tasks whose latest blocked/unblocked event is blocked. +55 / −1 lines.
  • tests/hermes_cli/test_kanban_blocked_sticky.py — 6 regression tests covering both legs of the loop, the circuit-breaker preservation, the unblock exit, and the full protocol-violation reproduction. The original PR's 7th test (legacy-DB init recovery) was dropped during salvage — that contract is already covered by test_kanban_db.py::test_connect_migrates_legacy_db_before_optional_column_indexes on main.

Validation

Result
tests/hermes_cli/test_kanban_blocked_sticky.py6 passed
tests/hermes_cli/test_kanban_db.py158 passed
tests/hermes_cli/test_kanban_{cli,core_functionality,notify,diagnostics}.py + tests/tools/test_kanban_tools.py350 passed
E2E (real SQLite DB, real kanban_block / recompute_ready / unblock_task)sticky-block holds 5 ticks; unblock_task clears it; circuit-breaker blocks (gave_up event only) still auto-recover; full protocol-violation loop confirmed broken

Credit

  • @xxxigm — original fix design, helper, tests, RCA write-up in #28726
  • Closes #28712 (reported by @vyductan)

Changed files

  • hermes_cli/kanban_db.py (modified, +55/-1)
  • tests/hermes_cli/test_kanban_blocked_sticky.py (added, +268/-0)

Code Example

Hermes Agent v0.14.0 (2026.5.16)
Up to date

---

[2026-05-19 17:23] [run 7] claimed
[2026-05-19 17:23] [run 7] spawned {pid: 78840}
[2026-05-19 17:32] commented {author: default, len: 1981}    # review-required handoff posted
[2026-05-19 17:32] [run 7] blocked {reason: "review-required: ..."}

---

[2026-05-19 17:43] promoted
[2026-05-19 17:43] [run 11] claimed
[2026-05-19 17:43] [run 11] spawned {pid: 83656}
[2026-05-19 17:45] [run 11] protocol_violation {exit_code: 0}
[2026-05-19 17:45] gave_up
[2026-05-19 17:45] promoted                  # <- loops again
[2026-05-19 17:45] [run 12] claimed
[2026-05-19 17:45] [run 12] spawned
[2026-05-19 17:56] [run 12] protocol_violation {exit_code: 0}
[2026-05-19 17:56] gave_up
[2026-05-19 17:56] promoted                  # <- and again
[2026-05-19 17:56] [run 13] claimed
RAW_BUFFERClick to expand / collapse

Summary

Kanban dispatcher auto-promotes a task that was correctly blocked by its worker (with outcome=blocked, reason=review-required), spawning a fresh worker that has no actionable instructions left. The fresh worker reads the task body, sees the existing review-required handoff comment, finds nothing to do (work is already applied to disk), and exits cleanly without calling kanban_complete or kanban_block — the dispatcher records this as protocol_violation and loops indefinitely.

Hermes version

Hermes Agent v0.14.0 (2026.5.16)
Up to date

Reproduce evidence (real task t_9d1f36e2)

Worker successfully blocks for human review:

[2026-05-19 17:23] [run 7] claimed
[2026-05-19 17:23] [run 7] spawned {pid: 78840}
[2026-05-19 17:32] commented {author: default, len: 1981}    # review-required handoff posted
[2026-05-19 17:32] [run 7] blocked {reason: "review-required: ..."}

Dispatcher then promotes the blocked task back to ready and respawns:

[2026-05-19 17:43] promoted
[2026-05-19 17:43] [run 11] claimed
[2026-05-19 17:43] [run 11] spawned {pid: 83656}
[2026-05-19 17:45] [run 11] protocol_violation {exit_code: 0}
[2026-05-19 17:45] gave_up
[2026-05-19 17:45] promoted                  # <- loops again
[2026-05-19 17:45] [run 12] claimed
[2026-05-19 17:45] [run 12] spawned
[2026-05-19 17:56] [run 12] protocol_violation {exit_code: 0}
[2026-05-19 17:56] gave_up
[2026-05-19 17:56] promoted                  # <- and again
[2026-05-19 17:56] [run 13] claimed

This loop only stopped after I manually hermes kanban reclaim + hermes kanban block again.

Expected behavior

A task whose latest run ended with outcome=blocked should NOT be auto-promoted by the dispatcher. Promotion to ready should require explicit operator action (hermes kanban unblock <id>), exactly like the documented human-in-the-loop pattern in the kanban-orchestrator skill:

Any task can kanban_block() to wait for input. Dispatcher respawns after /unblock.

The dispatcher is respawning even without /unblock.

Actual behavior

Dispatcher promotes the blocked task back to ready after some interval, even though no unblock was issued. The fresh worker has no instructions to act on (work already applied, review handoff already posted) so it exits cleanly without calling kanban_complete/kanban_block. Dispatcher records protocol_violation, gives up that run, and... promotes again.

Impact

  • Burns API calls in a tight loop (each respawn = full agent boot, context load, possibly tool calls before the worker realizes there's nothing to do).
  • Pollutes task history with phantom crashed runs that aren't actually crashes.
  • hermes kanban diag flags it as repeated_crashes which is misleading — the original work succeeded.
  • On rate-limited providers (we hit 429s on Kiro/Anthropic during this), the loop amplifies the rate-limit pressure.

Suggested fix

In the dispatcher promotion logic, check the task's most recent run's outcome:

  • If outcome IN ('blocked', 'completed') — don't promote unless an unblock event has been recorded since.
  • Or: track a requires_human_unblock flag on the task that gets set when a worker calls kanban_block and only cleared by unblock.

Either way, a worker-issued kanban_block should be sticky until the operator unblocks it.

Tangentially related

While reproducing this I also found that hermes kanban init failed with no such column: session_id on a kanban.db that pre-dated the session_id migration. Workaround was ALTER TABLE tasks ADD COLUMN session_id TEXT; CREATE INDEX IF NOT EXISTS idx_tasks_session_id ON tasks(session_id);. The migration code in hermes_cli/kanban_db.py:1168-1179 exists but didn't fire on my DB — possibly because init errored out before reaching the schema-upgrade pass. Probably a separate issue but flagging in case it's part of the same code path.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

A task whose latest run ended with outcome=blocked should NOT be auto-promoted by the dispatcher. Promotion to ready should require explicit operator action (hermes kanban unblock <id>), exactly like the documented human-in-the-loop pattern in the kanban-orchestrator skill:

Any task can kanban_block() to wait for input. Dispatcher respawns after /unblock.

The dispatcher is respawning even without /unblock.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix kanban: dispatcher auto-promotes blocked task → respawn worker → protocol_violation loop [3 pull requests, 1 participants]