hermes - 💡(How to fix) Fix bug(kanban): failure_limit circuit breaker bypassed when worker crashes before first heartbeat

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

kanban.failure_limit does not auto-block tasks when the worker crashes before sending its first heartbeat. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every dispatch_interval_seconds — regardless of failure_limit value.

Error Message

kanban.failure_limit does not auto-block tasks when the worker crashes before sending its first heartbeat. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every dispatch_interval_seconds — regardless of failure_limit value.

Root Cause

The crash happens before the worker writes its first heartbeat or lock entry. detect_crashed_workers identifies the dead process, but the consecutive_failures counter stored against the task lock entry is never created (no lock = no counter). Each dispatch cycle therefore sees the task as having 0 failures and re-claims it, resetting the circuit breaker window.

Post-heartbeat crashes (worker dies mid-execution) correctly increment consecutive_failures because a lock entry exists.

Fix Action

Fix / Workaround

kanban.failure_limit does not auto-block tasks when the worker crashes before sending its first heartbeat. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every dispatch_interval_seconds — regardless of failure_limit value.

The crash happens before the worker writes its first heartbeat or lock entry. detect_crashed_workers identifies the dead process, but the consecutive_failures counter stored against the task lock entry is never created (no lock = no counter). Each dispatch cycle therefore sees the task as having 0 failures and re-claims it, resetting the circuit breaker window.

Note: PR #33747 (preflight skill validation at dispatch time) addresses the skill-name case specifically and will prevent this for that class of errors. This issue covers the general mechanism — any startup crash bypasses the circuit breaker today.

Code Example

hermes kanban create --board <board> --body "test" --skills nonexistent-skill
RAW_BUFFERClick to expand / collapse

Summary

kanban.failure_limit does not auto-block tasks when the worker crashes before sending its first heartbeat. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every dispatch_interval_seconds — regardless of failure_limit value.

Reproduction

  1. Create a kanban task with a skill name that doesn't exist:
    hermes kanban create --board <board> --body "test" --skills nonexistent-skill
  2. Set kanban.failure_limit: 2 in config (or any value)
  3. Observe: task spawns, crashes with ValueError: Unknown skill(s), gets reclaimed and respawned every 60s without ever being auto-blocked

Root cause

The crash happens before the worker writes its first heartbeat or lock entry. detect_crashed_workers identifies the dead process, but the consecutive_failures counter stored against the task lock entry is never created (no lock = no counter). Each dispatch cycle therefore sees the task as having 0 failures and re-claims it, resetting the circuit breaker window.

Post-heartbeat crashes (worker dies mid-execution) correctly increment consecutive_failures because a lock entry exists.

Impact (real-world case)

Observed in production on hermes-agent 0.14.0:

  • Skill name typo (foundry-vtt-mcp-macosfoundry-vtt-macos) caused 1192 consecutive crashes over 48 hours
  • failure_limit: 2 never triggered
  • System load peaked at 17.26 on a 20-core M1 Ultra
  • Contributed to gateway crash loop → 1000+ Discord reconnect attempts → system reboot

Expected behaviour

After failure_limit consecutive pre-heartbeat crashes, the task should be auto-blocked with a diagnostic message explaining the startup failure (same as post-heartbeat auto-block behaviour).

Suggested fix

Track pre-heartbeat spawn failures in a separate counter on the task record itself (not on the transient lock entry), so the circuit breaker fires regardless of whether a heartbeat was ever sent.

Note: PR #33747 (preflight skill validation at dispatch time) addresses the skill-name case specifically and will prevent this for that class of errors. This issue covers the general mechanism — any startup crash bypasses the circuit breaker today.

Environment

  • hermes-agent 0.14.0 / 0.15.1
  • macOS 15 (M1 Ultra)
  • kanban.failure_limit: 2, kanban.dispatch_interval_seconds: 60

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix bug(kanban): failure_limit circuit breaker bypassed when worker crashes before first heartbeat