hermes - 💡(How to fix) Fix kanban dispatcher: add circuit-breaker for repeated worker bails with identical block reason

StepCodex · 2026-05-20T13:36:04Z

[hermes] When a kanban worker bails on an external blocker e.g. saturated CI runners, third-party API down, upstream dependency PR not merged , the dispatcher… When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows. ## Workaround Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the `kanban-orchestrator` skill, but easy to miss when the orchestrator is on a long side-quest. # Feature request: circuit-breaker on repeated worker bails with identical block reason ## Summary When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows. ## Repro 1. Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated) 2. Worker claims, observes the unchanged external condition, runs `kanban_block` with a reason like "CI queued, 0 progress, unchanged" 3. Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason 4. Loop continues until something external changes or a human intervenes ## Expected After N consecutive bails (suggest N=5) with substantially-identical block reasons, the dispatcher should: - Auto-pause the task (status=`blocked`, no auto re-claim) - Force an orchestrator handoff (or escalate if a handoff already exists and is past its SLA) - Surface in `hermes kanban list` with a distinct diagnostic flag (e.g. `circuit_open`) This is distinct from `max_retries` (which counts run failures, not voluntary bails) and from the triage-watcher handoff escalation (which triggers at 60min but does not stop the re-claim loop). ## Actual Observed in production: ~230 worker spawn cycles across 4 tasks over a ~7 hour window during a CI runner saturation event. Each spawn was a full agent boot + context load + situation re-discovery, all bailing in 24-30 seconds on the same unchanged external condition. The triage watcher *did* correctly escalate orchestrator handoffs at the 60-minute mark, but the dispatcher kept re-claiming the source tasks because nothing stopped it. Sample event sequence (one task, abbreviated): ``` #107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged" #108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged" #109 blocked → "100th run — same CI infra wall, unchanged" ... (continues) ``` ## Suspected cause `kanban_block` records the reason but doesn't track consecutive-with-same-reason counts on the task. Dispatcher's claim selection only considers `status=ready|blocked-with-unblock-time-passed` and doesn't penalize tasks that have repeatedly bailed on the same external condition. ## Suggested implementation sketch - Track `consecutive_identical_bails` counter on the task, incremented when a new block event's reason is fuzzy-matched to the prior one (or simply substring-matched on a normalized form) - Reset counter when the block reason changes substantively, when a comment is added by a non-worker (human/orchestrator intervention), or when the task transitions through `done` - At `consecutive_identical_bails >= N` (default 5, configurable), refuse to re-claim and emit a `circuit_open` diagnostic + force a triage-watcher handoff if one doesn't exist ## Workaround Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the `kanban-orchestrator` skill, but easy to miss when the orchestrator is on a long side-quest. ## Related - Triage watcher orchestrator-handoff SLA (60min) — works correctly but is downstream of the loop, not the loop itself - `max_retries` on tasks — counts failures, not voluntary bails, so doesn't trip on this pattern

hermes2026-05-20 13:36:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows.

Root Cause

Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated)
Worker claims, observes the unchanged external condition, runs kanban_block with a reason like "CI queued, 0 progress, unchanged"
Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason
Loop continues until something external changes or a human intervenes

Fix Action

Workaround

Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the kanban-orchestrator skill, but easy to miss when the orchestrator is on a long side-quest.

Code Example

#107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged"
#108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged"
#109 blocked → "100th run — same CI infra wall, unchanged"
... (continues)

RAW_BUFFERClick to expand / collapse

Feature request: circuit-breaker on repeated worker bails with identical block reason

Summary

Repro

Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated)
Worker claims, observes the unchanged external condition, runs kanban_block with a reason like "CI queued, 0 progress, unchanged"
Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason
Loop continues until something external changes or a human intervenes

Expected

After N consecutive bails (suggest N=5) with substantially-identical block reasons, the dispatcher should:

Auto-pause the task (status=blocked, no auto re-claim)
Force an orchestrator handoff (or escalate if a handoff already exists and is past its SLA)
Surface in hermes kanban list with a distinct diagnostic flag (e.g. circuit_open)

This is distinct from max_retries (which counts run failures, not voluntary bails) and from the triage-watcher handoff escalation (which triggers at 60min but does not stop the re-claim loop).

Actual

Observed in production: ~230 worker spawn cycles across 4 tasks over a ~7 hour window during a CI runner saturation event. Each spawn was a full agent boot + context load + situation re-discovery, all bailing in 24-30 seconds on the same unchanged external condition. The triage watcher did correctly escalate orchestrator handoffs at the 60-minute mark, but the dispatcher kept re-claiming the source tasks because nothing stopped it.

Sample event sequence (one task, abbreviated):

#107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged"
#108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged"
#109 blocked → "100th run — same CI infra wall, unchanged"
... (continues)

Suspected cause

kanban_block records the reason but doesn't track consecutive-with-same-reason counts on the task. Dispatcher's claim selection only considers status=ready|blocked-with-unblock-time-passed and doesn't penalize tasks that have repeatedly bailed on the same external condition.

Suggested implementation sketch

Track consecutive_identical_bails counter on the task, incremented when a new block event's reason is fuzzy-matched to the prior one (or simply substring-matched on a normalized form)
Reset counter when the block reason changes substantively, when a comment is added by a non-worker (human/orchestrator intervention), or when the task transitions through done
At consecutive_identical_bails >= N (default 5, configurable), refuse to re-claim and emit a circuit_open diagnostic + force a triage-watcher handoff if one doesn't exist

Workaround

Triage watcher orchestrator-handoff SLA (60min) — works correctly but is downstream of the loop, not the loop itself
max_retries on tasks — counts failures, not voluntary bails, so doesn't trip on this pattern

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix kanban dispatcher: add circuit-breaker for repeated worker bails with identical block reason

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Feature request: circuit-breaker on repeated worker bails with identical block reason

Summary

Repro

Expected

Actual

Suspected cause

Suggested implementation sketch

Workaround

Related

Still need to ship something?

TRENDING