When a board stalls because of blocked recovery cards: 1. The operator/watchdog should consolidate duplicate/superseded recovery cards. 2. Superseded failed attempts should be archived, not left as active blockers. 3. There should be exactly one active implementation chain for the current root cause. 4. `blocked` should mean a real unresolved blocker, not stale/replaced work. 5. The dispatcher should resume with at least one `ready`/`running` actionable card, or escalate a human decision with evidence.

hermes - 💡(How to fix) Fix Kanban recovery workflow loops on Cortex blocked replacement chains

StepCodex · 2026-05-21T11:06:24Z

[hermes] Bug description Cortex Kanban recovery is not healthy: the board can stall with no running or ready tasks while accumulating multiple blocked recovery… ## Fix / Workaround - `t_bdade2d9` — Recovery: fix CLI status for bundled image_gen backends - `t_8031addf` — Recovery split: locate dashboard plugin UX files and patch plan - `t_cc13c09d` — Recovery replacement: dashboard UX for bundled backend plugins - `t_62b6e1ab` — Recovery replacement: locate dashboard plugin UX files after workspace restore - `t_a3048522` — Recovery v2: locate dashboard plugin UX files in persistent workspace - `t_47999539` — Recovery v3: locate dashboard plugin UX files and patch plan - `t_0c4a7596` — Recovery v3: remediate CLI plugin status tuple/API regression - `t_1549ebd7` — Recovery v4: remediate plugin tuple/API status regression after workspace restore Manual watchdog invocation correctly detected the stall and dispatched an operator card (`t_6536b53d`), so the watchdog path itself is alive. The failure is that recovery keeps generating more blocked replacement chains instead of converging to one clean implementation/review path. 1. The operator/watchdog should consolidate duplicate/superseded recovery cards. 2. Superseded failed attempts should be archived, not left as active blockers. 3. There should be exactly one active implementation chain for the current root cause. 4. `blocked` should mean a real unresolved blocker, not stale/replaced work. 5. The dispatcher should resume with at least one `ready`/`running` actionable card, or escalate a human decision with evidence. ## Bug description Cortex Kanban recovery is not healthy: the board can stall with no `running` or `ready` tasks while accumulating multiple blocked recovery cards for the same dashboard/plugin activation chain. ## Observed evidence On 2026-05-21, `hermes kanban --board cortex stats` showed: ```text running: 0 ready: 0 blocked: 8 todo: 14 done: 98 ``` The blocked cards were mostly recursive replacement/recovery attempts around the same scope: - `t_bdade2d9` — Recovery: fix CLI status for bundled image_gen backends - `t_8031addf` — Recovery split: locate dashboard plugin UX files and patch plan - `t_cc13c09d` — Recovery replacement: dashboard UX for bundled backend plugins - `t_62b6e1ab` — Recovery replacement: locate dashboard plugin UX files after workspace restore - `t_a3048522` — Recovery v2: locate dashboard plugin UX files in persistent workspace - `t_47999539` — Recovery v3: locate dashboard plugin UX files and patch plan - `t_0c4a7596` — Recovery v3: remediate CLI plugin status tuple/API regression - `t_1549ebd7` — Recovery v4: remediate plugin tuple/API status regression after workspace restore Manual watchdog invocation correctly detected the stall and dispatched an operator card (`t_6536b53d`), so the watchdog path itself is alive. The failure is that recovery keeps generating more blocked replacement chains instead of converging to one clean implementation/review path. ## Expected behavior When a board stalls because of blocked recovery cards: 1. The operator/watchdog should consolidate duplicate/superseded recovery cards. 2. Superseded failed attempts should be archived, not left as active blockers. 3. There should be exactly one active implementation chain for the current root cause. 4. `blocked` should mean a real unresolved blocker, not stale/replaced work. 5. The dispatcher should resume with at least one `ready`/`running` actionable card, or escalate a human decision with evidence. ## Actual behavior The board repeatedly returns to: ```text running: 0 ready: 0 blocked: N ``` where `N` grows as more recovery cards are spawned. This makes the board look active historically but operationally dead. ## Suspected root cause The current recovery workflow is missing a convergence gate: - Watchdog creates a new operator recovery card when stalled. - Operator creates replacement chains. - Old blocked recovery cards remain active. - Replacement chains can block again on the same underlying workspace/API/test confusion. - The next watchdog run sees even more blocked cards and repeats the cycle. This is a workflow bug, not just an implementation bug in the dashboard/plugin code. ## Proposed fix Add a recovery-convergence workflow for Kanban stalls: - Detect duplicate/superseded recovery cards by title/body lineage and source blocker references. - Before creating a new replacement chain, archive or explicitly reconcile older superseded blockers. - Add an operator rule/test: after recovery completes, the board must have either: - active actionable work (`ready` or `running`), or - only true human-decision blockers with evidence. - Add guardrails so watchdog does not recursively create unlimited recovery chains for the same stalled cluster. - Add a smoke/CLI test around this workflow if feasible. ## Acceptance criteria - A stalled board with multiple duplicate recovery blockers is reduced to one current path. - Superseded recovery cards are archiv

Fix Action

Fix / Workaround

t_bdade2d9 — Recovery: fix CLI status for bundled image_gen backends
t_8031addf — Recovery split: locate dashboard plugin UX files and patch plan
t_cc13c09d — Recovery replacement: dashboard UX for bundled backend plugins
t_62b6e1ab — Recovery replacement: locate dashboard plugin UX files after workspace restore
t_a3048522 — Recovery v2: locate dashboard plugin UX files in persistent workspace
t_47999539 — Recovery v3: locate dashboard plugin UX files and patch plan
t_0c4a7596 — Recovery v3: remediate CLI plugin status tuple/API regression
t_1549ebd7 — Recovery v4: remediate plugin tuple/API status regression after workspace restore

Manual watchdog invocation correctly detected the stall and dispatched an operator card (t_6536b53d), so the watchdog path itself is alive. The failure is that recovery keeps generating more blocked replacement chains instead of converging to one clean implementation/review path.

The operator/watchdog should consolidate duplicate/superseded recovery cards.
Superseded failed attempts should be archived, not left as active blockers.
There should be exactly one active implementation chain for the current root cause.
blocked should mean a real unresolved blocker, not stale/replaced work.
The dispatcher should resume with at least one ready/running actionable card, or escalate a human decision with evidence.

Bug description

Cortex Kanban recovery is not healthy: the board can stall with no running or ready tasks while accumulating multiple blocked recovery cards for the same dashboard/plugin activation chain.

Observed evidence

On 2026-05-21, hermes kanban --board cortex stats showed:

running: 0
ready: 0
blocked: 8
todo: 14
done: 98

The blocked cards were mostly recursive replacement/recovery attempts around the same scope:

t_bdade2d9 — Recovery: fix CLI status for bundled image_gen backends
t_8031addf — Recovery split: locate dashboard plugin UX files and patch plan
t_cc13c09d — Recovery replacement: dashboard UX for bundled backend plugins
t_62b6e1ab — Recovery replacement: locate dashboard plugin UX files after workspace restore
t_a3048522 — Recovery v2: locate dashboard plugin UX files in persistent workspace
t_47999539 — Recovery v3: locate dashboard plugin UX files and patch plan
t_0c4a7596 — Recovery v3: remediate CLI plugin status tuple/API regression
t_1549ebd7 — Recovery v4: remediate plugin tuple/API status regression after workspace restore

Expected behavior

When a board stalls because of blocked recovery cards:

The operator/watchdog should consolidate duplicate/superseded recovery cards.
Superseded failed attempts should be archived, not left as active blockers.
There should be exactly one active implementation chain for the current root cause.
blocked should mean a real unresolved blocker, not stale/replaced work.
The dispatcher should resume with at least one ready/running actionable card, or escalate a human decision with evidence.

Actual behavior

The board repeatedly returns to:

running: 0
ready: 0
blocked: N

where N grows as more recovery cards are spawned. This makes the board look active historically but operationally dead.

Suspected root cause

The current recovery workflow is missing a convergence gate:

Watchdog creates a new operator recovery card when stalled.
Operator creates replacement chains.
Old blocked recovery cards remain active.
Replacement chains can block again on the same underlying workspace/API/test confusion.
The next watchdog run sees even more blocked cards and repeats the cycle.

This is a workflow bug, not just an implementation bug in the dashboard/plugin code.

Proposed fix

Add a recovery-convergence workflow for Kanban stalls:

Detect duplicate/superseded recovery cards by title/body lineage and source blocker references.
Before creating a new replacement chain, archive or explicitly reconcile older superseded blockers.
Add an operator rule/test: after recovery completes, the board must have either:
- active actionable work (ready or running), or
- only true human-decision blockers with evidence.
Add guardrails so watchdog does not recursively create unlimited recovery chains for the same stalled cluster.
Add a smoke/CLI test around this workflow if feasible.

Acceptance criteria

A stalled board with multiple duplicate recovery blockers is reduced to one current path.
Superseded recovery cards are archived with comments.
Running watchdog twice on the same stalled cluster does not create unbounded duplicate chains.
Cortex board can return to truthful states: actionable work running/ready, or a small set of real blockers.

FAQ

Expected behavior

When a board stalls because of blocked recovery cards:

The operator/watchdog should consolidate duplicate/superseded recovery cards.
Superseded failed attempts should be archived, not left as active blockers.
There should be exactly one active implementation chain for the current root cause.
blocked should mean a real unresolved blocker, not stale/replaced work.
The dispatcher should resume with at least one ready/running actionable card, or escalate a human decision with evidence.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Kanban recovery workflow loops on Cortex blocked replacement chains

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Expected behavior

Fix Action

Fix / Workaround

Code Example

Bug description

Observed evidence

Expected behavior

Actual behavior

Suspected root cause

Proposed fix

Acceptance criteria

FAQ

Expected behavior

Still need to ship something?

TRENDING