hermes - 💡(How to fix) Fix kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

hermes2026-05-21 09:13:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

We hit a real task that cycled through blocked three times over ~3 hours, each cycle requiring manual unblock to resume — and the system never escalated, never tripped failure_limit, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:

_rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Error Message

blocked_count_limit: 3 # warning at 3, error at 5

Root Cause

_rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Fix Action

Fix / Workaround

In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.

Code Example

kanban:
  blocked_count_limit: 3       # warning at 3, error at 5

RAW_BUFFERClick to expand / collapse

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

_rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Reproduction (real task, summarized)

run	duration	terminating event	how it ended
1	26 min	worker called `kanban_block` with `review-required` handoff	status=`blocked`
2	11 min	`Iteration budget exhausted (80/80)` → worker emitted `kanban_block`	status=`blocked`
3	91 min	worker self-detected logic loop, killed its child PID, emitted `kanban_block` with partial-progress note	status=`blocked`

Throughout, consecutive_failures stayed at 0 (none of the three outcomes counted). _rule_stuck_in_blocked never fired because each unblock reset its timer well under the 24h default. release_stale_claims extended the claim every 15 min during run 3 because _pid_alive was true; last_heartbeat_at had been stale for over an hour but was only recorded into the event payload, not consulted for the decision.

Net effect: the system has zero automated stop signal for "this task has been bouncing in/out of blocked repeatedly" — only a tired human noticing.

Proposed fixes

Gap 1 — add a count-based sibling to `_rule_stuck_in_blocked`

A new rule, e.g. _rule_blocked_thrashing, that fires when count(events.kind='blocked' for this task) >= N regardless of recency. Suggested:

kanban:
  blocked_count_limit: 3       # warning at 3, error at 5

Or alternatively, count blocked outcomes into consecutive_failures when the reason is a known auto-block (iteration_exhausted, worker-self-stuck-detected, etc.) rather than a review-required handoff. See Gap 2.

Gap 2 — taxonomize `blocked` reasons and feed auto-block outcomes into `consecutive_failures`

Today kanban_block is one channel for both:

Intentional review-required handoffs (the worker is healthy and waiting on a human)
Defensive self-reports of failure (budget exhausted, self-detected loop, stuck-too-long)

_rule_consecutive_failures shouldn't treat these the same. Suggestion: add a block_kind field (review_required | auto_failure) to the kanban_block payload, and have _record_run_outcome map auto_failure blocks to consecutive_failures += 1. The failure_limit breaker then catches Gap 2 naturally.

Minimal version: hardcode "Iteration budget exhausted" as auto-failure for now; add block_kind later.

Gap 3 — make `last_heartbeat_at` an upper bound on `claim_extended`

Currently release_stale_claims only checks _pid_alive. The fix is small: extend the claim only if both _pid_alive(pid) and now - last_heartbeat_at < HEARTBEAT_STALE_SECONDS (suggested default 1800s, configurable). If the heartbeat is older than the threshold, fall through to the normal reclaim path — that branch already handles a "live but unresponsive" worker (SIGTERM/SIGKILL).

This preserves the #23025 design intent (don't kill on TTL-alone for slow LLMs that are mid-call) while still bounding "PID alive but agent looped". As a second-order benefit it makes enforce_max_runtime non-essential for stuck-detection — it can remain an explicit per-task SLA cap.

Affected files

hermes_cli/kanban_diagnostics.py — _rule_stuck_in_blocked (count-based sibling), _rule_consecutive_failures (treat auto-block as failure)
hermes_cli/kanban_db.py — release_stale_claims (~L2384, gate claim_extended on heartbeat freshness), _record_run_outcome (block_kind plumbing)
gateway/run.py — iteration_budget exhaustion path (emit block_kind=auto_failure)

Why we're filing this

Happy to send a PR for Gap 3 (smallest, most contained) if the design direction looks right.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

Reproduction (real task, summarized)

Proposed fixes

Gap 1 — add a count-based sibling to `_rule_stuck_in_blocked`

Gap 2 — taxonomize `blocked` reasons and feed auto-block outcomes into `consecutive_failures`

Gap 3 — make `last_heartbeat_at` an upper bound on `claim_extended`

Affected files

Why we're filing this

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

Reproduction (real task, summarized)

Proposed fixes

Gap 1 — add a count-based sibling to _rule_stuck_in_blocked

Gap 2 — taxonomize blocked reasons and feed auto-block outcomes into consecutive_failures

Gap 3 — make last_heartbeat_at an upper bound on claim_extended

Affected files

Why we're filing this

Still need to ship something?

TRENDING

Gap 1 — add a count-based sibling to `_rule_stuck_in_blocked`

Gap 2 — taxonomize `blocked` reasons and feed auto-block outcomes into `consecutive_failures`

Gap 3 — make `last_heartbeat_at` an upper bound on `claim_extended`