hermes - 💡(How to fix) Fix Kanban recovery workflow loops on Cortex blocked replacement chains

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Expected behavior

When a board stalls because of blocked recovery cards:

Fix Action

Fix / Workaround

  • t_bdade2d9 — Recovery: fix CLI status for bundled image_gen backends
  • t_8031addf — Recovery split: locate dashboard plugin UX files and patch plan
  • t_cc13c09d — Recovery replacement: dashboard UX for bundled backend plugins
  • t_62b6e1ab — Recovery replacement: locate dashboard plugin UX files after workspace restore
  • t_a3048522 — Recovery v2: locate dashboard plugin UX files in persistent workspace
  • t_47999539 — Recovery v3: locate dashboard plugin UX files and patch plan
  • t_0c4a7596 — Recovery v3: remediate CLI plugin status tuple/API regression
  • t_1549ebd7 — Recovery v4: remediate plugin tuple/API status regression after workspace restore

Manual watchdog invocation correctly detected the stall and dispatched an operator card (t_6536b53d), so the watchdog path itself is alive. The failure is that recovery keeps generating more blocked replacement chains instead of converging to one clean implementation/review path.

  1. The operator/watchdog should consolidate duplicate/superseded recovery cards.
  2. Superseded failed attempts should be archived, not left as active blockers.
  3. There should be exactly one active implementation chain for the current root cause.
  4. blocked should mean a real unresolved blocker, not stale/replaced work.
  5. The dispatcher should resume with at least one ready/running actionable card, or escalate a human decision with evidence.

Code Example

running: 0
ready: 0
blocked: 8
todo: 14
done: 98

---

running: 0
ready: 0
blocked: N
RAW_BUFFERClick to expand / collapse

Bug description

Cortex Kanban recovery is not healthy: the board can stall with no running or ready tasks while accumulating multiple blocked recovery cards for the same dashboard/plugin activation chain.

Observed evidence

On 2026-05-21, hermes kanban --board cortex stats showed:

running: 0
ready: 0
blocked: 8
todo: 14
done: 98

The blocked cards were mostly recursive replacement/recovery attempts around the same scope:

  • t_bdade2d9 — Recovery: fix CLI status for bundled image_gen backends
  • t_8031addf — Recovery split: locate dashboard plugin UX files and patch plan
  • t_cc13c09d — Recovery replacement: dashboard UX for bundled backend plugins
  • t_62b6e1ab — Recovery replacement: locate dashboard plugin UX files after workspace restore
  • t_a3048522 — Recovery v2: locate dashboard plugin UX files in persistent workspace
  • t_47999539 — Recovery v3: locate dashboard plugin UX files and patch plan
  • t_0c4a7596 — Recovery v3: remediate CLI plugin status tuple/API regression
  • t_1549ebd7 — Recovery v4: remediate plugin tuple/API status regression after workspace restore

Manual watchdog invocation correctly detected the stall and dispatched an operator card (t_6536b53d), so the watchdog path itself is alive. The failure is that recovery keeps generating more blocked replacement chains instead of converging to one clean implementation/review path.

Expected behavior

When a board stalls because of blocked recovery cards:

  1. The operator/watchdog should consolidate duplicate/superseded recovery cards.
  2. Superseded failed attempts should be archived, not left as active blockers.
  3. There should be exactly one active implementation chain for the current root cause.
  4. blocked should mean a real unresolved blocker, not stale/replaced work.
  5. The dispatcher should resume with at least one ready/running actionable card, or escalate a human decision with evidence.

Actual behavior

The board repeatedly returns to:

running: 0
ready: 0
blocked: N

where N grows as more recovery cards are spawned. This makes the board look active historically but operationally dead.

Suspected root cause

The current recovery workflow is missing a convergence gate:

  • Watchdog creates a new operator recovery card when stalled.
  • Operator creates replacement chains.
  • Old blocked recovery cards remain active.
  • Replacement chains can block again on the same underlying workspace/API/test confusion.
  • The next watchdog run sees even more blocked cards and repeats the cycle.

This is a workflow bug, not just an implementation bug in the dashboard/plugin code.

Proposed fix

Add a recovery-convergence workflow for Kanban stalls:

  • Detect duplicate/superseded recovery cards by title/body lineage and source blocker references.
  • Before creating a new replacement chain, archive or explicitly reconcile older superseded blockers.
  • Add an operator rule/test: after recovery completes, the board must have either:
    • active actionable work (ready or running), or
    • only true human-decision blockers with evidence.
  • Add guardrails so watchdog does not recursively create unlimited recovery chains for the same stalled cluster.
  • Add a smoke/CLI test around this workflow if feasible.

Acceptance criteria

  • A stalled board with multiple duplicate recovery blockers is reduced to one current path.
  • Superseded recovery cards are archived with comments.
  • Running watchdog twice on the same stalled cluster does not create unbounded duplicate chains.
  • Cortex board can return to truthful states: actionable work running/ready, or a small set of real blockers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When a board stalls because of blocked recovery cards:

  1. The operator/watchdog should consolidate duplicate/superseded recovery cards.
  2. Superseded failed attempts should be archived, not left as active blockers.
  3. There should be exactly one active implementation chain for the current root cause.
  4. blocked should mean a real unresolved blocker, not stale/replaced work.
  5. The dispatcher should resume with at least one ready/running actionable card, or escalate a human decision with evidence.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Kanban recovery workflow loops on Cortex blocked replacement chains