hermes - 💡(How to fix) Fix kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

We hit a real task that cycled through blocked three times over ~3 hours, each cycle requiring manual unblock to resume — and the system never escalated, never tripped failure_limit, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:

  1. _rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
  2. Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
  3. release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Error Message

blocked_count_limit: 3 # warning at 3, error at 5

Root Cause

  1. _rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
  2. Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
  3. release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Fix Action

Fix / Workaround

In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.

Code Example

kanban:
  blocked_count_limit: 3       # warning at 3, error at 5
RAW_BUFFERClick to expand / collapse

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops

Summary

We hit a real task that cycled through blocked three times over ~3 hours, each cycle requiring manual unblock to resume — and the system never escalated, never tripped failure_limit, never auto-failed. After tracing the code paths we found three independent gaps that together make this pattern silent:

  1. _rule_stuck_in_blocked only counts single-blocked age, and any commented / unblocked event resets the timer → a task that gets re-blocked every few minutes is invisible to it, regardless of how many cycles.
  2. Iteration budget exhausted maps to kanban_block (status=blocked), but _rule_consecutive_failures explicitly excludes blocked outcome (see kanban_diagnostics.py line ~696: "Other outcomes (timed_out, blocked, spawn_failed, gave_up)" — they're skipped). So budget-exhausted runs never increment consecutive_failures and the kanban.failure_limit=5 (DEFAULT_FAILURE_LIMIT) breaker is bypassed.
  3. release_stale_claims uses _pid_alive(worker_pid) only and ignores the last_heartbeat_at it reads from the row (see kanban_db.py ~L2384). This is deliberate per issue #23025 (don't kill slow-but-healthy LLMs in long tool-free calls), and the documented backstop is enforce_max_runtime. But enforce_max_runtime is opt-in per task (max_runtime_seconds defaults to NULL) — a task created without that field has no upper bound at all on wall-clock runtime as long as the PID stays alive. We observed a single run hold its claim for 91 minutes with last_heartbeat_at frozen at t+10min because the worker entered a logic loop with no tool calls.

Reproduction (real task, summarized)

rundurationterminating eventhow it ended
126 minworker called kanban_block with review-required handoffstatus=blocked
211 minIteration budget exhausted (80/80) → worker emitted kanban_blockstatus=blocked
391 minworker self-detected logic loop, killed its child PID, emitted kanban_block with partial-progress notestatus=blocked

Throughout, consecutive_failures stayed at 0 (none of the three outcomes counted). _rule_stuck_in_blocked never fired because each unblock reset its timer well under the 24h default. release_stale_claims extended the claim every 15 min during run 3 because _pid_alive was true; last_heartbeat_at had been stale for over an hour but was only recorded into the event payload, not consulted for the decision.

Net effect: the system has zero automated stop signal for "this task has been bouncing in/out of blocked repeatedly" — only a tired human noticing.

Proposed fixes

Gap 1 — add a count-based sibling to _rule_stuck_in_blocked

A new rule, e.g. _rule_blocked_thrashing, that fires when count(events.kind='blocked' for this task) >= N regardless of recency. Suggested:

kanban:
  blocked_count_limit: 3       # warning at 3, error at 5

Or alternatively, count blocked outcomes into consecutive_failures when the reason is a known auto-block (iteration_exhausted, worker-self-stuck-detected, etc.) rather than a review-required handoff. See Gap 2.

Gap 2 — taxonomize blocked reasons and feed auto-block outcomes into consecutive_failures

Today kanban_block is one channel for both:

  • Intentional review-required handoffs (the worker is healthy and waiting on a human)
  • Defensive self-reports of failure (budget exhausted, self-detected loop, stuck-too-long)

_rule_consecutive_failures shouldn't treat these the same. Suggestion: add a block_kind field (review_required | auto_failure) to the kanban_block payload, and have _record_run_outcome map auto_failure blocks to consecutive_failures += 1. The failure_limit breaker then catches Gap 2 naturally.

Minimal version: hardcode "Iteration budget exhausted" as auto-failure for now; add block_kind later.

Gap 3 — make last_heartbeat_at an upper bound on claim_extended

Currently release_stale_claims only checks _pid_alive. The fix is small: extend the claim only if both _pid_alive(pid) and now - last_heartbeat_at < HEARTBEAT_STALE_SECONDS (suggested default 1800s, configurable). If the heartbeat is older than the threshold, fall through to the normal reclaim path — that branch already handles a "live but unresponsive" worker (SIGTERM/SIGKILL).

This preserves the #23025 design intent (don't kill on TTL-alone for slow LLMs that are mid-call) while still bounding "PID alive but agent looped". As a second-order benefit it makes enforce_max_runtime non-essential for stuck-detection — it can remain an explicit per-task SLA cap.

Affected files

  • hermes_cli/kanban_diagnostics.py_rule_stuck_in_blocked (count-based sibling), _rule_consecutive_failures (treat auto-block as failure)
  • hermes_cli/kanban_db.pyrelease_stale_claims (~L2384, gate claim_extended on heartbeat freshness), _record_run_outcome (block_kind plumbing)
  • gateway/run.pyiteration_budget exhaustion path (emit block_kind=auto_failure)

Why we're filing this

In our case the workaround was "human notices and takes over manually". For users running unattended kanban swarms (which is the design intent per the README), this pattern silently burns budget and stalls dependents. The three gaps compound — fixing any one of them would have stopped our scenario, but the combination is what makes it invisible.

Happy to send a PR for Gap 3 (smallest, most contained) if the design direction looks right.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops