A task whose latest run ended with `outcome=blocked` should NOT be auto-promoted by the dispatcher. Promotion to `ready` should require explicit operator action (`hermes kanban unblock `), exactly like the documented human-in-the-loop pattern in the kanban-orchestrator skill: > Any task can `kanban_block()` to wait for input. Dispatcher respawns after `/unblock`. The dispatcher is respawning even without `/unblock`.

hermes - ✅(Solved) Fix kanban: dispatcher auto-promotes blocked task → respawn worker → protocol_violation loop [3 pull requests, 1 participants]

hermes2026-05-19 11:43:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#28712•Fetched 2026-05-20 04:02:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

vyductan

Participants

vyductan

Timeline (top)

cross-referenced ×6referenced ×4labeled ×3closed ×1

Kanban dispatcher auto-promotes a task that was correctly blocked by its worker (with outcome=blocked, reason=review-required), spawning a fresh worker that has no actionable instructions left. The fresh worker reads the task body, sees the existing review-required handoff comment, finds nothing to do (work is already applied to disk), and exits cleanly without calling kanban_complete or kanban_block — the dispatcher records this as protocol_violation and loops indefinitely.

Root Cause

While reproducing this I also found that hermes kanban init failed with no such column: session_id on a kanban.db that pre-dated the session_id migration. Workaround was ALTER TABLE tasks ADD COLUMN session_id TEXT; CREATE INDEX IF NOT EXISTS idx_tasks_session_id ON tasks(session_id);. The migration code in hermes_cli/kanban_db.py:1168-1179 exists but didn't fire on my DB — possibly because init errored out before reaching the schema-upgrade pass. Probably a separate issue but flagging in case it's part of the same code path.

Fix Action

Fix / Workaround

Dispatcher then promotes the blocked task back to ready and respawns:

[2026-05-19 17:43] promoted
[2026-05-19 17:43] [run 11] claimed
[2026-05-19 17:43] [run 11] spawned {pid: 83656}
[2026-05-19 17:45] [run 11] protocol_violation {exit_code: 0}
[2026-05-19 17:45] gave_up
[2026-05-19 17:45] promoted                  # <- loops again
[2026-05-19 17:45] [run 12] claimed
[2026-05-19 17:45] [run 12] spawned
[2026-05-19 17:56] [run 12] protocol_violation {exit_code: 0}
[2026-05-19 17:56] gave_up
[2026-05-19 17:56] promoted                  # <- and again
[2026-05-19 17:56] [run 13] claimed

A task whose latest run ended with outcome=blocked should NOT be auto-promoted by the dispatcher. Promotion to ready should require explicit operator action (hermes kanban unblock <id>), exactly like the documented human-in-the-loop pattern in the kanban-orchestrator skill:

PR fix notes

PR #28726: fix(kanban): worker-initiated block must not be auto-promoted (#28712)

Repository: NousResearch/hermes-agent
Author: xxxigm
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/28726

Description (problem / solution / changelog)

What does this PR do?

#28712 describes a kanban infinite loop:

Worker calls kanban_block(reason="review-required: ...") to hand off to a human.
Dispatcher's recompute_ready() flips the task back to ready on the next tick.
Fresh worker spawns, finds no actionable instructions (work already applied, review-required comment already posted), exits cleanly.
detect_crashed_workers records protocol_violation → _record_task_failure(failure_limit=1) → gave_up → blocked.
Next tick: recompute_ready promotes again → goto 3.

Result: burned API calls, phantom "crashed" runs polluting task history, misleading repeated_crashes flag from kanban diag, and amplified pressure on rate-limited providers (the reporter hit 429s during the loop on Kiro/Anthropic).

Root cause — recompute_ready was treating every blocked task with satisfied parents as eligible for promotion, with no way to distinguish:

Worker / operator-initiated blocks (kanban_block) — deliberate human-in-the-loop handoff; must stay blocked until explicit kanban_unblock, per the documented kanban-orchestrator skill contract.
Circuit-breaker blocks (_record_task_failure tripping on repeated crashes) — should auto-recover when conditions change (parents complete, transient infra clears). This is the original intent of #40c1decb3 ("promote blocked tasks when parent dependencies complete").

Fix — distinguish the two using the cheapest available signal: the most recent "blocked"/"unblocked" event in task_events.

kanban_block already emits a "blocked" event row (with the review-required: … reason).
kanban_unblock already emits an "unblocked" event row.
Circuit-breaker _record_task_failure emits "gave_up", not "blocked".

New helper _has_sticky_block(conn, task_id) returns True iff the most recent of the two block-related events is "blocked". recompute_ready consults it and skips sticky-blocked tasks. The only legitimate exit is unblock_task(), which emits "unblocked" and flips the predicate back — exactly the documented human-in-the-loop pattern.

Also fixes the tangentially related schema-init crash the reporter flagged at the bottom of #28712 (init_db failed with no such column: session_id on a kanban.db that pre-dated the session_id migration). Three CREATE INDEX statements on tasks(<late-added-column>) were sitting at the top of SCHEMA_SQL, where they run before the additive-column migrations. On a legacy DB the table's CREATE TABLE IF NOT EXISTS is a no-op, the column doesn't exist yet, and the index DDL crashes the whole init script — including the migration that would have fixed it. The reporter had to ALTER TABLE + CREATE INDEX by hand to unstick their install. Moved all three (idx_tasks_tenant, idx_tasks_idempotency, idx_tasks_session_id) into _migrate_add_optional_columns, after the ALTER calls that guarantee the columns exist.

Related Issue

Fixes #28712

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✅ Tests (adding or improving test coverage)

Changes Made

hermes_cli/kanban_db.py —
- New _has_sticky_block(conn, task_id) -> bool helper that reads the most recent "blocked"/"unblocked" event for a task.
- recompute_ready() now continues past blocked tasks whose latest block-related event is "blocked"; circuit-breaker blocks (with no event, or "gave_up" event) continue to auto-recover when parents complete.
- Moved idx_tasks_tenant, idx_tasks_idempotency and idx_tasks_session_id out of SCHEMA_SQL and into _migrate_add_optional_columns, asserted unconditionally with IF NOT EXISTS so new DBs get them on first init and legacy DBs get them after the additive ALTER TABLE calls.
- Combined: 92 added / 15 removed lines.
tests/hermes_cli/test_kanban_blocked_sticky.py — 344 lines, 7 new tests:
- test_worker_block_is_not_auto_promoted_by_recompute_ready — five back-to-back ticks leave the task blocked.
- test_worker_block_on_child_with_done_parents_is_still_sticky — the worst false-positive (parent-completion path) is closed.
- test_circuit_breaker_block_still_auto_promotes — preserves the pre-#28712 recovery semantics for the original 40c1decb3 intent.
- test_gave_up_event_alone_does_not_make_block_sticky — explicit guard so the protocol_violation loop's second leg can't regress.
- test_unblock_clears_sticky_state_and_lets_block_recover — the only legitimate exit, and subsequent circuit-breaker blocks still auto-recover.
- test_protocol_violation_loop_is_broken — full bug reproduction: block → tick → (would-be) crash + gave_up → next tick still blocked. Would loop indefinitely without the fix.
- test_init_db_recovers_from_legacy_tasks_table_without_session_id — hand-crafted pre-tenant / pre-idempotency_key / pre-session_id tasks table, calls init_db, asserts all three columns + indexes end up present and legacy rows survive.

How to Test

# New regression suite (7 tests).
scripts/run_tests.sh tests/hermes_cli/test_kanban_blocked_sticky.py -q

# Existing dispatcher tests must still pass, including
# test_recompute_ready_promotes_blocked_with_done_parents which
# pins the circuit-breaker recovery contract.
scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -q
# Expected: 156 passed, 1 failed (pre-existing
# `test_max_runtime_uses_current_run_start_after_retry` — unrelated
# `os.kill(999999, 0)` live-system guard, verified on upstream/main).

# Broader sweep.
scripts/run_tests.sh \
  tests/hermes_cli/test_kanban_cli.py \
  tests/hermes_cli/test_kanban_core_functionality.py \
  tests/hermes_cli/test_kanban_notify.py \
  tests/hermes_cli/test_kanban_diagnostics.py \
  tests/tools/test_kanban_tools.py \
  -q
# Expected: only the 3 pre-existing `os.kill` flakes fail
# (verified identical on upstream/main).

Manual reproduction of the loop fix:

From a worker session, call kanban_block with reason="review-required: please verify".
Without running hermes kanban unblock, wait through one or more dispatcher ticks (or call hermes kanban dispatch directly).
Before this PR: task flips back to ready, fresh worker spawns, exits cleanly with protocol_violation, repeats.
After this PR: task stays blocked indefinitely until you run hermes kanban unblock <id>. Once unblocked, normal promotion + claim semantics resume.

Manual reproduction of the schema-init fix:

# Take any older kanban.db that pre-dates the session_id migration.
sqlite3 ~/.hermes/kanban.db 'PRAGMA table_info(tasks)' | grep -q session_id || echo "DB is pre-session_id"
# Run hermes kanban init — before this PR, crashed with
# `OperationalError: no such column: session_id`.  After this PR,
# completes silently and adds the missing columns + indexes.
hermes kanban init

Checklist

Code

My commit messages follow Conventional Commits (fix(kanban):, test(kanban):)
I searched for existing PRs — no open duplicate (the related #27796 was a different auto-unblock-on-comment design, closed)
My PR contains only changes related to this fix
I've run scripts/run_tests.sh tests/hermes_cli/test_kanban_blocked_sticky.py -q (7 passed)
I've added regression tests covering every leg of the loop and the migration path
I've tested on my platform: macOS 15.x (darwin 24.6.0)

Documentation & Housekeeping

N/A — no user-facing docs change. The fix restores the documented behaviour from the kanban-orchestrator skill ("Any task can kanban_block() to wait for input. Dispatcher respawns after /unblock."), which the bug was violating.
N/A — no config keys changed
N/A — no architecture / workflow change
I've considered cross-platform impact — pure SQLite-via-sqlite3, no platform-specific code
N/A — no tool description/schema changes (the kanban_block / kanban_unblock tool surfaces are unchanged; only the dispatcher's interpretation of their outputs changes)

Changed files

hermes_cli/kanban_db.py (modified, +92/-15)
tests/hermes_cli/test_kanban_blocked_sticky.py (added, +344/-0)

PR #28724: fix(kanban): prevent recompute_ready from unblocking parent-free blocked tasks

Repository: NousResearch/hermes-agent
Author: Dusk1e
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/28724

Description (problem / solution / changelog)

Summary

recompute_ready() was treating every blocked task with no parents as eligible for promotion because all(...) over an empty parent set returns True.

This meant an intentionally blocked task could jump back to ready whenever an unrelated board action triggered a recompute pass.

What changed

only auto-promote blocked tasks when they are actually dependency-gated
keep parent-free manual blocks in blocked until an explicit unblock
add a regression test covering the parent-free manual block case

Repro

Create a parent-free task
Move it to blocked
Complete some unrelated task on the board
Before this fix, the blocked task moved back to ready

Validation

pytest tests/hermes_cli/test_kanban_db.py -k recompute_ready
Result : 6 passed

Changed files

hermes_cli/kanban_db.py (modified, +8/-2)
tests/hermes_cli/test_kanban_db.py (modified, +18/-0)

PR #28994: fix(kanban): worker-initiated block must not be auto-promoted (#28712)

Repository: NousResearch/hermes-agent
Author: teknium1
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/28994

Description (problem / solution / changelog)

Summary

Salvage of #28726 — worker / operator-initiated kanban_block() is now sticky. recompute_ready skips tasks whose latest block-related event in task_events is "blocked" while continuing to auto-promote circuit-breaker blocks (which emit "gave_up", not "blocked"). Fixes #28712.

Why salvage instead of merge #28726

#28726 also included a schema-init ordering fix for idx_tasks_session_id on legacy DBs. That half is already on main via #28754 (Michael Nguyen) and #28781 (kshitijk4poor), which is why the original PR was showing CONFLICTING. This salvage cherry-picks only the sticky-block half — xxxigm's authorship is preserved per-commit.

Changes

hermes_cli/kanban_db.py — new _has_sticky_block(conn, task_id) helper; recompute_ready() skips tasks whose latest blocked/unblocked event is blocked. +55 / −1 lines.
tests/hermes_cli/test_kanban_blocked_sticky.py — 6 regression tests covering both legs of the loop, the circuit-breaker preservation, the unblock exit, and the full protocol-violation reproduction. The original PR's 7th test (legacy-DB init recovery) was dropped during salvage — that contract is already covered by test_kanban_db.py::test_connect_migrates_legacy_db_before_optional_column_indexes on main.

Validation

	Result
`tests/hermes_cli/test_kanban_blocked_sticky.py`	6 passed
`tests/hermes_cli/test_kanban_db.py`	158 passed
`tests/hermes_cli/test_kanban_{cli,core_functionality,notify,diagnostics}.py + tests/tools/test_kanban_tools.py`	350 passed
E2E (real SQLite DB, real `kanban_block` / `recompute_ready` / `unblock_task`)	sticky-block holds 5 ticks; `unblock_task` clears it; circuit-breaker blocks (gave_up event only) still auto-recover; full protocol-violation loop confirmed broken

Credit

@xxxigm — original fix design, helper, tests, RCA write-up in #28726
Closes #28712 (reported by @vyductan)

Changed files

hermes_cli/kanban_db.py (modified, +55/-1)
tests/hermes_cli/test_kanban_blocked_sticky.py (added, +268/-0)

Code Example

Hermes Agent v0.14.0 (2026.5.16)
Up to date

---

[2026-05-19 17:23] [run 7] claimed
[2026-05-19 17:23] [run 7] spawned {pid: 78840}
[2026-05-19 17:32] commented {author: default, len: 1981}    # review-required handoff posted
[2026-05-19 17:32] [run 7] blocked {reason: "review-required: ..."}

---

[2026-05-19 17:43] promoted
[2026-05-19 17:43] [run 11] claimed
[2026-05-19 17:43] [run 11] spawned {pid: 83656}
[2026-05-19 17:45] [run 11] protocol_violation {exit_code: 0}
[2026-05-19 17:45] gave_up
[2026-05-19 17:45] promoted                  # <- loops again
[2026-05-19 17:45] [run 12] claimed
[2026-05-19 17:45] [run 12] spawned
[2026-05-19 17:56] [run 12] protocol_violation {exit_code: 0}
[2026-05-19 17:56] gave_up
[2026-05-19 17:56] promoted                  # <- and again
[2026-05-19 17:56] [run 13] claimed

RAW_BUFFERClick to expand / collapse

Summary

Hermes version

Hermes Agent v0.14.0 (2026.5.16)
Up to date

Reproduce evidence (real task `t_9d1f36e2`)

Worker successfully blocks for human review:

[2026-05-19 17:23] [run 7] claimed
[2026-05-19 17:23] [run 7] spawned {pid: 78840}
[2026-05-19 17:32] commented {author: default, len: 1981}    # review-required handoff posted
[2026-05-19 17:32] [run 7] blocked {reason: "review-required: ..."}

Dispatcher then promotes the blocked task back to ready and respawns:

[2026-05-19 17:43] promoted
[2026-05-19 17:43] [run 11] claimed
[2026-05-19 17:43] [run 11] spawned {pid: 83656}
[2026-05-19 17:45] [run 11] protocol_violation {exit_code: 0}
[2026-05-19 17:45] gave_up
[2026-05-19 17:45] promoted                  # <- loops again
[2026-05-19 17:45] [run 12] claimed
[2026-05-19 17:45] [run 12] spawned
[2026-05-19 17:56] [run 12] protocol_violation {exit_code: 0}
[2026-05-19 17:56] gave_up
[2026-05-19 17:56] promoted                  # <- and again
[2026-05-19 17:56] [run 13] claimed

This loop only stopped after I manually hermes kanban reclaim + hermes kanban block again.

Expected behavior

Any task can kanban_block() to wait for input. Dispatcher respawns after /unblock.

The dispatcher is respawning even without /unblock.

Actual behavior

Dispatcher promotes the blocked task back to ready after some interval, even though no unblock was issued. The fresh worker has no instructions to act on (work already applied, review handoff already posted) so it exits cleanly without calling kanban_complete/kanban_block. Dispatcher records protocol_violation, gives up that run, and... promotes again.

Impact

Burns API calls in a tight loop (each respawn = full agent boot, context load, possibly tool calls before the worker realizes there's nothing to do).
Pollutes task history with phantom crashed runs that aren't actually crashes.
hermes kanban diag flags it as repeated_crashes which is misleading — the original work succeeded.
On rate-limited providers (we hit 429s on Kiro/Anthropic during this), the loop amplifies the rate-limit pressure.

Suggested fix

In the dispatcher promotion logic, check the task's most recent run's outcome:

If outcome IN ('blocked', 'completed') — don't promote unless an unblock event has been recorded since.
Or: track a requires_human_unblock flag on the task that gets set when a worker calls kanban_block and only cleared by unblock.

Either way, a worker-issued kanban_block should be sticky until the operator unblocks it.

Tangentially related

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Any task can kanban_block() to wait for input. Dispatcher respawns after /unblock.

The dispatcher is respawning even without /unblock.

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix kanban: dispatcher auto-promotes blocked task → respawn worker → protocol_violation loop [3 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #28726: fix(kanban): worker-initiated block must not be auto-promoted (#28712)

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Changed files

PR #28724: fix(kanban): prevent recompute_ready from unblocking parent-free blocked tasks

Description (problem / solution / changelog)

Summary

What changed

Repro

Validation

Changed files

PR #28994: fix(kanban): worker-initiated block must not be auto-promoted (#28712)

Description (problem / solution / changelog)

Summary

Why salvage instead of merge #28726

Changes

Validation

Credit

Changed files

Code Example

Summary

Hermes version

Reproduce evidence (real task t_9d1f36e2)

Expected behavior

Actual behavior

Impact

Suggested fix

Tangentially related

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Reproduce evidence (real task `t_9d1f36e2`)