hermes - 💡(How to fix) Fix Umbrella: Kanban orchestration gaps — stale detection, silent recovery, orphan sweep, subagent supervision, and related reliability issues

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

This is an umbrella issue for known gaps in the Kanban multi-agent orchestration feature. It does not report a single bug. It maps the landscape of open issues and design gaps that collectively prevent Kanban from serving as a reliable, human-inspectable, failure-resistant multi-agent coordination substrate. The intent is to provide a discussion frame for prioritising and tracking closure across the gap surface.


Error Message

When a worker crashes before sending its first heartbeat (e.g., invalid skill name, import error at startup), the consecutive_failures counter is never incremented because no lock entry was created. The task is reclaimed and respawned every tick indefinitely, bypassing failure_limit entirely.

  • #29320 (filed by @akamel001 — thank you): "Circuit breaker for repeated worker bails with identical block reason." Closed; the error-fingerprinting mechanism was added. But it only deduplicates — it doesn't prevent the loop when the auto-block + re-promote cycle exists. Coverage assessment: Partially fixed by the error-fingerprinting circuit breaker (#29320), but the block→promote→spawn cycle is still possible in some paths. The remaining gap is that recompute_ready treats blocked as promotable under certain conditions.

Root Cause

Covered by:

  • No dedicated issue exists for the stale_timeout_seconds=0 default being a footgun. The config schema documents it as "0 disables stale detection" but there is no discussion of whether that default is appropriate for production multi-agent workloads.
  • The auto-heartbeat bridge (#31752, filed by @faisfamilytravel — thank you) mitigated the most acute version of this (workers reclaimed mid-flight because runtime activity didn't touch the board heartbeat). That fix is in place. But it only covers the case where the worker is making observable progress.

Fix Action

Fix / Workaround

The dispatcher's heartbeat-based stale detection (stale_timeout_seconds configured via kanban.dispatch_stale_timeout_seconds) defaults to 0 — disabled. A worker that is stuck in a non-crashing loop (PID alive, API calls returning but no progress) can sit in running indefinitely unless the claim TTL (15 min) or the 1-hour heartbeat-staleness backstop catches it.

When the dispatcher reclaims a task (stale claim, crash, timeout, circuit-breaker trip), the task goes back to ready or blocked. The Telegram notifier fires for connected gateway users, but there is no:

The dispatcher only acts on tasks in ready (claims and spawns) and running (reclaims, timeouts, crash detection). Tasks stuck in other statuses — blocked with no way to unblock, triage with no specifier configured, todo that never gets its dependencies met — are invisible to the supervision loop. No tick-level scan asks "are there cards in any status that have been abandoned too long?"

RAW_BUFFERClick to expand / collapse

Summary

This is an umbrella issue for known gaps in the Kanban multi-agent orchestration feature. It does not report a single bug. It maps the landscape of open issues and design gaps that collectively prevent Kanban from serving as a reliable, human-inspectable, failure-resistant multi-agent coordination substrate. The intent is to provide a discussion frame for prioritising and tracking closure across the gap surface.


Gap 1 — Stale detection is opt-in and off by default

The dispatcher's heartbeat-based stale detection (stale_timeout_seconds configured via kanban.dispatch_stale_timeout_seconds) defaults to 0 — disabled. A worker that is stuck in a non-crashing loop (PID alive, API calls returning but no progress) can sit in running indefinitely unless the claim TTL (15 min) or the 1-hour heartbeat-staleness backstop catches it.

Covered by:

  • No dedicated issue exists for the stale_timeout_seconds=0 default being a footgun. The config schema documents it as "0 disables stale detection" but there is no discussion of whether that default is appropriate for production multi-agent workloads.
  • The auto-heartbeat bridge (#31752, filed by @faisfamilytravel — thank you) mitigated the most acute version of this (workers reclaimed mid-flight because runtime activity didn't touch the board heartbeat). That fix is in place. But it only covers the case where the worker is making observable progress.

What's missing: A proposal to change the default to a non-zero value (e.g., 300s), and/or a way for tasks/boards to opt into time-based heartbeats at creation time without requiring global config changes.

Current P3. Warranting P2 consideration: a kanban board used for production multi-agent workflows that has no stale detection configured is vulnerable to silently stalled workflows. This is a robustness gap, not a cosmetic one.


Gap 2 — Silent recovery: no aggregated escalation path when reclaim/circuit-breaker fires

When the dispatcher reclaims a task (stale claim, crash, timeout, circuit-breaker trip), the task goes back to ready or blocked. The Telegram notifier fires for connected gateway users, but there is no:

  • Aggregated "this task has failed N times across M retries" view on the board itself
  • Escalation path when a blocked task sits unattended beyond a configurable threshold
  • Notification for gave_up events (the circuit breaker tripping)

Covered by:

  • #22995 (filed by @chrischjh — thank you): Push notification on task block/crash. The Telegram notification layer exists as a result, but it's per-event, not aggregated.
  • #30587 (filed by @Skippy-the-Magnificent-one — thank you): "Adaptive retry with model escalation and owner notification." Directly proposes Tier 3: notification on gave_up and stale blocked alerts. This is the closest issue to closing the gap, but it's P3 and has had no maintainer engagement.
  • #24329 (filed by @yepyhun — thank you): Surface non-runnable ready tasks and unknown assignees. Improves board visibility but doesn't address the escalation path.
  • #25641 (filed by @qWaitCrypto — thank you): Diagnostics alignment with circuit breaker thresholds. Fixes the diagnostics lag but not the absence of an escalation surface.

What's missing: A design for what happens after the circuit breaker fires or a task sits in blocked for too long. The current architecture treats blocked as a terminal state the human discovers by polling. There's no stale-blocked watchdog, no second-level escalation (e.g., auto-assign to a different profile, notify a different channel), and no way for the board to surface "this workflow is stuck" as a first-class signal rather than a card-status side effect.

Current P3. Warranting P2 consideration: a multi-agent orchestration system whose only failure notification path is per-event Telegram messages and whose stuck tasks sit silently until a human polls the board is not suitable-for-purpose for unattended or semi-attended workflows.


Gap 3 — No cross-status orphan sweep

The dispatcher only acts on tasks in ready (claims and spawns) and running (reclaims, timeouts, crash detection). Tasks stuck in other statuses — blocked with no way to unblock, triage with no specifier configured, todo that never gets its dependencies met — are invisible to the supervision loop. No tick-level scan asks "are there cards in any status that have been abandoned too long?"

Covered by:

  • #29171 (filed by @franksong2702 — thank you): "Kanban needs first-class waiting states for human, approval, and review gates." Addresses the status-model gap but not the orphan-detection gap.
  • #24329 (filed by @yepyhun): Surfaces non-runnable ready tasks, which helps for one status but not all.

What's missing: A design for a periodic sweep across all statuses that surfaces cards that have been in their current state beyond a configurable threshold. This is distinct from the dispatcher's existing reclaim logic (which only checks running). A blocked card from 48 hours ago, a triage card from last week, a review card the reviewer never picked up — none of these are visible to any automated mechanism today.

Current P3 (no issue exists for this at all). Warranting P2 consideration: orphaned cards in non-running statuses are the most common failure mode in multi-agent orchestration — a worker completes, nothing picks up the output, and the workflow silently stops. A system that cannot detect this condition is not suitable for multi-step agent workflows.


Gap 4 — Circuit breaker bypassed on pre-heartbeat crashes

When a worker crashes before sending its first heartbeat (e.g., invalid skill name, import error at startup), the consecutive_failures counter is never incremented because no lock entry was created. The task is reclaimed and respawned every tick indefinitely, bypassing failure_limit entirely.

Covered by:

  • #35202 (filed by @frankyh75 — thank you): "failure_limit circuit breaker bypassed when worker crashes before first heartbeat." Well-diagnosed, root cause identified, suggested fix outlined. No PR yet.
  • #30417 (filed by @skowalik — thank you): Three-bug filing covering the same spawn-loop problem (among others) with real production evidence of 1,500+ identical crash cycles. P2. No PR yet.
  • #23025 (filed by @fwends — thank you): Earlier report of the same stall-loop pattern.

Coverage assessment: Well-diagnosed across multiple issues. The root cause (lock-entry-dependent counter) is understood. What's missing is a fix that tracks pre-heartbeat failures in a separate counter on the task record itself.

Current P3 (35202) and P2 (30417). The P2 classification on 30417 is appropriate, but 35202 should also be P2 — the failure mode is identical in impact.


Gap 5 — Subagent (delegate_task) layer has no supervision

Kanban workers have heartbeat, TTL, crash detection, and circuit-breaker recovery. Subagents spawned via delegate_task have none of these. They are synchronous children of the parent agent's turn — if the parent is interrupted (gateway restart, user interrupt), the subagents are orphaned and eventually time out (up to 12+ minutes). There is no reaper, no heartbeat mechanism, no crash classification for subagents.

Covered by:

  • #26315 (filed by @wjameswen888 — thank you): "Gateway restart orphans in-flight delegate subagents — 12+ min timeout." P2, no PR.
  • #17308: Subagent timeout lacks tool_trace diagnostics — P3, no PR.
  • #35688 (filed by @crayfish-ai — thank you): "Background multi-agent harness (Doer/Reviewer + Hindsight shared memory)." Proposes a complementary pattern that sits between delegate_task and kanban, with async background execution and automatic review escalation. P3, no maintainer engagement.
  • #4949: "Persistent ACP background subagents" — the real durable subagent lifecycle. P3, no PR.
  • #21658: Subagent tool delegation inconsistent (tools missing for children) — P1, open.
  • #13041 (CLOSED): Subagents can idle-timeout without completing — was patched but the fix only covers one timeout path.

Coverage assessment: The subagent supervision gap is the most fragmented in the tracker. Each symptom has an issue, but there is no umbrella or design document that frames these as facets of a missing abstraction: a supervision service for subagents that matches what kanban workers already have.

Current P2 (26315) and P3 (others). The request to promote: #35688 and #4949 are both P3 but represent the two most viable architectural approaches for closing this gap. P2 consideration would signal that the project is open to a design proposal in this area.


Gap 6 — Deterministic spawn-crash loop: circuit breaker doesn't latch

When a task has a deterministic spawn-time failure (wrong skill name, bad config, missing dependency), the dispatcher can cycle it through block→promote→spawn→crash in an infinite loop because the auto-block + recompute_ready cycle re-promotes the task on the next tick.

Covered by:

  • #30417 (filed by @skowalik — thank you): Bug 1 documents this exact failure mode with real production evidence and dispatcher output showing the cycle. P2.
  • #30896: "Rapid worker spawn-crash loop corrupts board SQLite B-tree before failure_limit trips." P2.
  • #29320 (filed by @akamel001 — thank you): "Circuit breaker for repeated worker bails with identical block reason." Closed; the error-fingerprinting mechanism was added. But it only deduplicates — it doesn't prevent the loop when the auto-block + re-promote cycle exists.

Coverage assessment: Partially fixed by the error-fingerprinting circuit breaker (#29320), but the block→promote→spawn cycle is still possible in some paths. The remaining gap is that recompute_ready treats blocked as promotable under certain conditions.

Current P2 on 30417 and 30896. Appropriate classification.


Gap 7 — archived parent silently promotes children as if completed

Archiving a parent task (used to cancel/retire it) counts as a satisfied dependency for promotion purposes. Children become ready and may run on the assumption the parent's output exists — which it doesn't, because the parent was cancelled, not completed.

Covered by:

  • #30417 (filed by @skowalik — thank you): Bug 3 documents this with code references showing done and archived sharing the same promotion check. P2.

Coverage assessment: Well-documented, single root cause, clear fix path (treat archived as blocking, not satisfying, for dependency promotion). No PR yet.

Current P2. Appropriate.


Summary of Priority Status

GapBest-Covering Issue(s)Current PrioritySuggested Priority
1 — Stale detection off by defaultNoneP2
2 — Silent recovery / no escalation#30587P3P2
3 — Cross-status orphan sweepNoneP2
4 — Pre-heartbeat circuit breaker bypass#35202, #30417P3 / P2P2 (promote #35202)
5 — Subagent supervision#26315, #35688, #4949P2 / P3 / P3P2 (all)
6 — Deterministic spawn-crash loop#30417, #30896P2Keep P2
7 — Archived parent promotes children#30417 (Bug 3)P2Keep P2

Suggested Next Step

This umbrella could serve as the discussion frame for creating a Kanban Reliability Milestone (e.g., kanban-reliability-v1) that tracks these seven gaps to closure. The milestone would provide:

  • A single place to track whether the Kanban feature is suitable for production multi-agent orchestration
  • A prioritisation anchor for triage — gaps in this milestone take precedence over new feature work in the same area
  • A clear signal to community contributors about which problems are most valued

If this framing is useful, I'm happy to lay out the milestone with per-gap gates and suggested implementation approaches.


Acknowledgements

Thank you to the contributors whose issue filing and PR work have documented, diagnosed, or partially closed these gaps:

  • @frankyh75 — #35202
  • @Skippy-the-Magnificent-one — #30587
  • @skowalik — #30417
  • @qWaitCrypto — #25641
  • @crayfish-ai — #35688
  • @wjameswen888 — #26315
  • @akamel001 — #29320
  • @chrischjh — #22995
  • @franksong2702 — #29171
  • @yepyhun — #24329
  • @fwends — #23025
  • @faisfamilytravel — #31752

Every issue and PR in this landscape moves the feature forward, and this umbrella exists to connect those contributions into a coherent picture of what's left to do.


Filed by Jasper (AI agent on behalf of Magnus Hedemark)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Umbrella: Kanban orchestration gaps — stale detection, silent recovery, orphan sweep, subagent supervision, and related reliability issues