hermes - 💡(How to fix) Fix kanban dispatcher: add circuit-breaker for repeated worker bails with identical block reason

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows.

Root Cause

  1. Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated)
  2. Worker claims, observes the unchanged external condition, runs kanban_block with a reason like "CI queued, 0 progress, unchanged"
  3. Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason
  4. Loop continues until something external changes or a human intervenes

Fix Action

Workaround

Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the kanban-orchestrator skill, but easy to miss when the orchestrator is on a long side-quest.

Code Example

#107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged"
#108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged"
#109 blocked → "100th run — same CI infra wall, unchanged"
... (continues)
RAW_BUFFERClick to expand / collapse

Feature request: circuit-breaker on repeated worker bails with identical block reason

Summary

When a kanban worker bails on an external blocker (e.g. saturated CI runners, third-party API down, upstream dependency PR not merged), the dispatcher re-claims the task on its next tick. If the external blocker persists, the worker bails again with the same block reason. This can loop for hundreds of cycles before the human-in-the-loop notices, burning provider quota and flooding the kanban event history with identical "still blocked, unchanged" rows.

Repro

  1. Set up a kanban task whose work depends on an external condition the worker cannot fix (e.g. PR waiting on CI checks that are queued indefinitely because runners are saturated)
  2. Worker claims, observes the unchanged external condition, runs kanban_block with a reason like "CI queued, 0 progress, unchanged"
  3. Dispatcher's next tick re-claims, worker observes the same condition, blocks again with the same reason
  4. Loop continues until something external changes or a human intervenes

Expected

After N consecutive bails (suggest N=5) with substantially-identical block reasons, the dispatcher should:

  • Auto-pause the task (status=blocked, no auto re-claim)
  • Force an orchestrator handoff (or escalate if a handoff already exists and is past its SLA)
  • Surface in hermes kanban list with a distinct diagnostic flag (e.g. circuit_open)

This is distinct from max_retries (which counts run failures, not voluntary bails) and from the triage-watcher handoff escalation (which triggers at 60min but does not stop the re-claim loop).

Actual

Observed in production: ~230 worker spawn cycles across 4 tasks over a ~7 hour window during a CI runner saturation event. Each spawn was a full agent boot + context load + situation re-discovery, all bailing in 24-30 seconds on the same unchanged external condition. The triage watcher did correctly escalate orchestrator handoffs at the 60-minute mark, but the dispatcher kept re-claiming the source tasks because nothing stopped it.

Sample event sequence (one task, abbreviated):

#107 blocked → "98th consecutive run — PR still queued, 0 progress, unchanged"
#108 blocked → "99th consecutive run — PR still queued, 0 progress, unchanged"
#109 blocked → "100th run — same CI infra wall, unchanged"
... (continues)

Suspected cause

kanban_block records the reason but doesn't track consecutive-with-same-reason counts on the task. Dispatcher's claim selection only considers status=ready|blocked-with-unblock-time-passed and doesn't penalize tasks that have repeatedly bailed on the same external condition.

Suggested implementation sketch

  • Track consecutive_identical_bails counter on the task, incremented when a new block event's reason is fuzzy-matched to the prior one (or simply substring-matched on a normalized form)
  • Reset counter when the block reason changes substantively, when a comment is added by a non-worker (human/orchestrator intervention), or when the task transitions through done
  • At consecutive_identical_bails >= N (default 5, configurable), refuse to re-claim and emit a circuit_open diagnostic + force a triage-watcher handoff if one doesn't exist

Workaround

Operator-side: orchestrator must manually monitor blocked tasks and intervene before the loop runs hundreds of cycles. Encoded as a checklist in the kanban-orchestrator skill, but easy to miss when the orchestrator is on a long side-quest.

Related

  • Triage watcher orchestrator-handoff SLA (60min) — works correctly but is downstream of the loop, not the loop itself
  • max_retries on tasks — counts failures, not voluntary bails, so doesn't trip on this pattern

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING