hermes - 💡(How to fix) Fix Umbrella: Kanban orchestration gaps — stale detection, silent recovery, orphan sweep, subagent supervision, and related reliability issues

hermes2026-05-31 17:21:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

This is an umbrella issue for known gaps in the Kanban multi-agent orchestration feature. It does not report a single bug. It maps the landscape of open issues and design gaps that collectively prevent Kanban from serving as a reliable, human-inspectable, failure-resistant multi-agent coordination substrate. The intent is to provide a discussion frame for prioritising and tracking closure across the gap surface.

Error Message

When a worker crashes before sending its first heartbeat (e.g., invalid skill name, import error at startup), the consecutive_failures counter is never incremented because no lock entry was created. The task is reclaimed and respawned every tick indefinitely, bypassing failure_limit entirely.

#29320 (filed by @akamel001 — thank you): "Circuit breaker for repeated worker bails with identical block reason." Closed; the error-fingerprinting mechanism was added. But it only deduplicates — it doesn't prevent the loop when the auto-block + re-promote cycle exists. Coverage assessment: Partially fixed by the error-fingerprinting circuit breaker (#29320), but the block→promote→spawn cycle is still possible in some paths. The remaining gap is that recompute_ready treats blocked as promotable under certain conditions.

Root Cause

Covered by:

No dedicated issue exists for the stale_timeout_seconds=0 default being a footgun. The config schema documents it as "0 disables stale detection" but there is no discussion of whether that default is appropriate for production multi-agent workloads.
The auto-heartbeat bridge (#31752, filed by @faisfamilytravel — thank you) mitigated the most acute version of this (workers reclaimed mid-flight because runtime activity didn't touch the board heartbeat). That fix is in place. But it only covers the case where the worker is making observable progress.

Fix Action

Fix / Workaround

The dispatcher's heartbeat-based stale detection (stale_timeout_seconds configured via kanban.dispatch_stale_timeout_seconds) defaults to 0 — disabled. A worker that is stuck in a non-crashing loop (PID alive, API calls returning but no progress) can sit in running indefinitely unless the claim TTL (15 min) or the 1-hour heartbeat-staleness backstop catches it.

When the dispatcher reclaims a task (stale claim, crash, timeout, circuit-breaker trip), the task goes back to ready or blocked. The Telegram notifier fires for connected gateway users, but there is no:

The dispatcher only acts on tasks in ready (claims and spawns) and running (reclaims, timeouts, crash detection). Tasks stuck in other statuses — blocked with no way to unblock, triage with no specifier configured, todo that never gets its dependencies met — are invisible to the supervision loop. No tick-level scan asks "are there cards in any status that have been abandoned too long?"

RAW_BUFFERClick to expand / collapse

Summary

Gap 1 — Stale detection is opt-in and off by default

Covered by:

No dedicated issue exists for the stale_timeout_seconds=0 default being a footgun. The config schema documents it as "0 disables stale detection" but there is no discussion of whether that default is appropriate for production multi-agent workloads.
The auto-heartbeat bridge (#31752, filed by @faisfamilytravel — thank you) mitigated the most acute version of this (workers reclaimed mid-flight because runtime activity didn't touch the board heartbeat). That fix is in place. But it only covers the case where the worker is making observable progress.

What's missing: A proposal to change the default to a non-zero value (e.g., 300s), and/or a way for tasks/boards to opt into time-based heartbeats at creation time without requiring global config changes.

Current P3. Warranting P2 consideration: a kanban board used for production multi-agent workflows that has no stale detection configured is vulnerable to silently stalled workflows. This is a robustness gap, not a cosmetic one.

Gap 2 — Silent recovery: no aggregated escalation path when reclaim/circuit-breaker fires

Aggregated "this task has failed N times across M retries" view on the board itself
Escalation path when a blocked task sits unattended beyond a configurable threshold
Notification for gave_up events (the circuit breaker tripping)

Covered by:

#22995 (filed by @chrischjh — thank you): Push notification on task block/crash. The Telegram notification layer exists as a result, but it's per-event, not aggregated.
#30587 (filed by @Skippy-the-Magnificent-one — thank you): "Adaptive retry with model escalation and owner notification." Directly proposes Tier 3: notification on gave_up and stale blocked alerts. This is the closest issue to closing the gap, but it's P3 and has had no maintainer engagement.
#24329 (filed by @yepyhun — thank you): Surface non-runnable ready tasks and unknown assignees. Improves board visibility but doesn't address the escalation path.
#25641 (filed by @qWaitCrypto — thank you): Diagnostics alignment with circuit breaker thresholds. Fixes the diagnostics lag but not the absence of an escalation surface.

What's missing: A design for what happens after the circuit breaker fires or a task sits in blocked for too long. The current architecture treats blocked as a terminal state the human discovers by polling. There's no stale-blocked watchdog, no second-level escalation (e.g., auto-assign to a different profile, notify a different channel), and no way for the board to surface "this workflow is stuck" as a first-class signal rather than a card-status side effect.

Current P3. Warranting P2 consideration: a multi-agent orchestration system whose only failure notification path is per-event Telegram messages and whose stuck tasks sit silently until a human polls the board is not suitable-for-purpose for unattended or semi-attended workflows.

Gap 3 — No cross-status orphan sweep

Covered by:

#29171 (filed by @franksong2702 — thank you): "Kanban needs first-class waiting states for human, approval, and review gates." Addresses the status-model gap but not the orphan-detection gap.
#24329 (filed by @yepyhun): Surfaces non-runnable ready tasks, which helps for one status but not all.

What's missing: A design for a periodic sweep across all statuses that surfaces cards that have been in their current state beyond a configurable threshold. This is distinct from the dispatcher's existing reclaim logic (which only checks running). A blocked card from 48 hours ago, a triage card from last week, a review card the reviewer never picked up — none of these are visible to any automated mechanism today.

Current P3 (no issue exists for this at all). Warranting P2 consideration: orphaned cards in non-running statuses are the most common failure mode in multi-agent orchestration — a worker completes, nothing picks up the output, and the workflow silently stops. A system that cannot detect this condition is not suitable for multi-step agent workflows.

Gap 4 — Circuit breaker bypassed on pre-heartbeat crashes

Covered by:

#35202 (filed by @frankyh75 — thank you): "failure_limit circuit breaker bypassed when worker crashes before first heartbeat." Well-diagnosed, root cause identified, suggested fix outlined. No PR yet.
#30417 (filed by @skowalik — thank you): Three-bug filing covering the same spawn-loop problem (among others) with real production evidence of 1,500+ identical crash cycles. P2. No PR yet.
#23025 (filed by @fwends — thank you): Earlier report of the same stall-loop pattern.

Coverage assessment: Well-diagnosed across multiple issues. The root cause (lock-entry-dependent counter) is understood. What's missing is a fix that tracks pre-heartbeat failures in a separate counter on the task record itself.

Current P3 (35202) and P2 (30417). The P2 classification on 30417 is appropriate, but 35202 should also be P2 — the failure mode is identical in impact.

Gap 5 — Subagent (delegate_task) layer has no supervision

Kanban workers have heartbeat, TTL, crash detection, and circuit-breaker recovery. Subagents spawned via delegate_task have none of these. They are synchronous children of the parent agent's turn — if the parent is interrupted (gateway restart, user interrupt), the subagents are orphaned and eventually time out (up to 12+ minutes). There is no reaper, no heartbeat mechanism, no crash classification for subagents.

Covered by:

#26315 (filed by @wjameswen888 — thank you): "Gateway restart orphans in-flight delegate subagents — 12+ min timeout." P2, no PR.
#17308: Subagent timeout lacks tool_trace diagnostics — P3, no PR.
#35688 (filed by @crayfish-ai — thank you): "Background multi-agent harness (Doer/Reviewer + Hindsight shared memory)." Proposes a complementary pattern that sits between delegate_task and kanban, with async background execution and automatic review escalation. P3, no maintainer engagement.
#4949: "Persistent ACP background subagents" — the real durable subagent lifecycle. P3, no PR.
#21658: Subagent tool delegation inconsistent (tools missing for children) — P1, open.
#13041 (CLOSED): Subagents can idle-timeout without completing — was patched but the fix only covers one timeout path.

Coverage assessment: The subagent supervision gap is the most fragmented in the tracker. Each symptom has an issue, but there is no umbrella or design document that frames these as facets of a missing abstraction: a supervision service for subagents that matches what kanban workers already have.

Current P2 (26315) and P3 (others). The request to promote: #35688 and #4949 are both P3 but represent the two most viable architectural approaches for closing this gap. P2 consideration would signal that the project is open to a design proposal in this area.

Gap 6 — Deterministic spawn-crash loop: circuit breaker doesn't latch

When a task has a deterministic spawn-time failure (wrong skill name, bad config, missing dependency), the dispatcher can cycle it through block→promote→spawn→crash in an infinite loop because the auto-block + recompute_ready cycle re-promotes the task on the next tick.

Covered by:

#30417 (filed by @skowalik — thank you): Bug 1 documents this exact failure mode with real production evidence and dispatcher output showing the cycle. P2.
#30896: "Rapid worker spawn-crash loop corrupts board SQLite B-tree before failure_limit trips." P2.
#29320 (filed by @akamel001 — thank you): "Circuit breaker for repeated worker bails with identical block reason." Closed; the error-fingerprinting mechanism was added. But it only deduplicates — it doesn't prevent the loop when the auto-block + re-promote cycle exists.

Coverage assessment: Partially fixed by the error-fingerprinting circuit breaker (#29320), but the block→promote→spawn cycle is still possible in some paths. The remaining gap is that recompute_ready treats blocked as promotable under certain conditions.

Current P2 on 30417 and 30896. Appropriate classification.

Gap 7 — `archived` parent silently promotes children as if completed

Archiving a parent task (used to cancel/retire it) counts as a satisfied dependency for promotion purposes. Children become ready and may run on the assumption the parent's output exists — which it doesn't, because the parent was cancelled, not completed.

Covered by:

#30417 (filed by @skowalik — thank you): Bug 3 documents this with code references showing done and archived sharing the same promotion check. P2.

Coverage assessment: Well-documented, single root cause, clear fix path (treat archived as blocking, not satisfying, for dependency promotion). No PR yet.

Current P2. Appropriate.

Summary of Priority Status

Gap	Best-Covering Issue(s)	Current Priority	Suggested Priority
1 — Stale detection off by default	None	—	P2
2 — Silent recovery / no escalation	#30587	P3	P2
3 — Cross-status orphan sweep	None	—	P2
4 — Pre-heartbeat circuit breaker bypass	#35202, #30417	P3 / P2	P2 (promote #35202)
5 — Subagent supervision	#26315, #35688, #4949	P2 / P3 / P3	P2 (all)
6 — Deterministic spawn-crash loop	#30417, #30896	P2	Keep P2
7 — Archived parent promotes children	#30417 (Bug 3)	P2	Keep P2

Suggested Next Step

This umbrella could serve as the discussion frame for creating a Kanban Reliability Milestone (e.g., kanban-reliability-v1) that tracks these seven gaps to closure. The milestone would provide:

A single place to track whether the Kanban feature is suitable for production multi-agent orchestration
A prioritisation anchor for triage — gaps in this milestone take precedence over new feature work in the same area
A clear signal to community contributors about which problems are most valued

If this framing is useful, I'm happy to lay out the milestone with per-gap gates and suggested implementation approaches.

Acknowledgements

Thank you to the contributors whose issue filing and PR work have documented, diagnosed, or partially closed these gaps:

@frankyh75 — #35202
@Skippy-the-Magnificent-one — #30587
@skowalik — #30417
@qWaitCrypto — #25641
@crayfish-ai — #35688
@wjameswen888 — #26315
@akamel001 — #29320
@chrischjh — #22995
@franksong2702 — #29171
@yepyhun — #24329
@fwends — #23025
@faisfamilytravel — #31752

Every issue and PR in this landscape moves the feature forward, and this umbrella exists to connect those contributions into a coherent picture of what's left to do.

Filed by Jasper (AI agent on behalf of Magnus Hedemark)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Umbrella: Kanban orchestration gaps — stale detection, silent recovery, orphan sweep, subagent supervision, and related reliability issues

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Summary

Gap 1 — Stale detection is opt-in and off by default

Gap 2 — Silent recovery: no aggregated escalation path when reclaim/circuit-breaker fires

Gap 3 — No cross-status orphan sweep

Gap 4 — Circuit breaker bypassed on pre-heartbeat crashes

Gap 5 — Subagent (delegate_task) layer has no supervision

Gap 6 — Deterministic spawn-crash loop: circuit breaker doesn't latch

Gap 7 — `archived` parent silently promotes children as if completed

Summary of Priority Status

Suggested Next Step

Acknowledgements

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Umbrella: Kanban orchestration gaps — stale detection, silent recovery, orphan sweep, subagent supervision, and related reliability issues

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Summary

Gap 1 — Stale detection is opt-in and off by default

Gap 2 — Silent recovery: no aggregated escalation path when reclaim/circuit-breaker fires

Gap 3 — No cross-status orphan sweep

Gap 4 — Circuit breaker bypassed on pre-heartbeat crashes

Gap 5 — Subagent (delegate_task) layer has no supervision

Gap 6 — Deterministic spawn-crash loop: circuit breaker doesn't latch

Gap 7 — archived parent silently promotes children as if completed

Summary of Priority Status

Suggested Next Step

Acknowledgements

Still need to ship something?

TRENDING

Gap 7 — `archived` parent silently promotes children as if completed