hermes - 💡(How to fix) Fix bug(kanban): failure_limit circuit breaker bypassed when worker crashes before first heartbeat

StepCodex · 2026-05-30T07:03:21Z

[hermes] kanban.failure limit does not auto-block tasks when the worker crashes before sending its first heartbeat . Affected cases include invalid skill names… `kanban.failure_limit` does not auto-block tasks when the worker crashes **before sending its first heartbeat**. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every `dispatch_interval_seconds` — regardless of `failure_limit` value. ## Fix / Workaround `kanban.failure_limit` does not auto-block tasks when the worker crashes **before sending its first heartbeat**. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every `dispatch_interval_seconds` — regardless of `failure_limit` value. The crash happens before the worker writes its first heartbeat or lock entry. `detect_crashed_workers` identifies the dead process, but the `consecutive_failures` counter stored against the task lock entry **is never created** (no lock = no counter). Each dispatch cycle therefore sees the task as having 0 failures and re-claims it, resetting the circuit breaker window. Note: PR #33747 (preflight skill validation at dispatch time) addresses the skill-name case specifically and will prevent this for that class of errors. This issue covers the general mechanism — any startup crash bypasses the circuit breaker today. ## Summary `kanban.failure_limit` does not auto-block tasks when the worker crashes **before sending its first heartbeat**. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every `dispatch_interval_seconds` — regardless of `failure_limit` value. ## Reproduction 1. Create a kanban task with a skill name that doesn't exist: ``` hermes kanban create --board --body "test" --skills nonexistent-skill ``` 2. Set `kanban.failure_limit: 2` in config (or any value) 3. Observe: task spawns, crashes with `ValueError: Unknown skill(s)`, gets reclaimed and respawned every 60s **without ever being auto-blocked** ## Root cause The crash happens before the worker writes its first heartbeat or lock entry. `detect_crashed_workers` identifies the dead process, but the `consecutive_failures` counter stored against the task lock entry **is never created** (no lock = no counter). Each dispatch cycle therefore sees the task as having 0 failures and re-claims it, resetting the circuit breaker window. Post-heartbeat crashes (worker dies mid-execution) correctly increment `consecutive_failures` because a lock entry exists. ## Impact (real-world case) Observed in production on hermes-agent 0.14.0: - Skill name typo (`foundry-vtt-mcp-macos` → `foundry-vtt-macos`) caused **1192 consecutive crashes over 48 hours** - `failure_limit: 2` never triggered - System load peaked at 17.26 on a 20-core M1 Ultra - Contributed to gateway crash loop → 1000+ Discord reconnect attempts → system reboot ## Expected behaviour After `failure_limit` consecutive pre-heartbeat crashes, the task should be auto-blocked with a diagnostic message explaining the startup failure (same as post-heartbeat auto-block behaviour). ## Suggested fix Track pre-heartbeat spawn failures in a separate counter on the task record itself (not on the transient lock entry), so the circuit breaker fires regardless of whether a heartbeat was ever sent. Note: PR #33747 (preflight skill validation at dispatch time) addresses the skill-name case specifically and will prevent this for that class of errors. This issue covers the general mechanism — any startup crash bypasses the circuit breaker today. ## Environment - hermes-agent 0.14.0 / 0.15.1 - macOS 15 (M1 Ultra) - `kanban.failure_limit: 2`, `kanban.dispatch_interval_seconds: 60`

hermes2026-05-30 07:03:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

kanban.failure_limit does not auto-block tasks when the worker crashes before sending its first heartbeat. Affected cases include invalid skill names, import errors, and any startup exception. The task respawns indefinitely — every dispatch_interval_seconds — regardless of failure_limit value.

Error Message

Root Cause

The crash happens before the worker writes its first heartbeat or lock entry. detect_crashed_workers identifies the dead process, but the consecutive_failures counter stored against the task lock entry is never created (no lock = no counter). Each dispatch cycle therefore sees the task as having 0 failures and re-claims it, resetting the circuit breaker window.

Post-heartbeat crashes (worker dies mid-execution) correctly increment consecutive_failures because a lock entry exists.

Fix Action

Fix / Workaround

Note: PR #33747 (preflight skill validation at dispatch time) addresses the skill-name case specifically and will prevent this for that class of errors. This issue covers the general mechanism — any startup crash bypasses the circuit breaker today.

Code Example

hermes kanban create --board <board> --body "test" --skills nonexistent-skill

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

Create a kanban task with a skill name that doesn't exist:

hermes kanban create --board <board> --body "test" --skills nonexistent-skill

Set kanban.failure_limit: 2 in config (or any value)
Observe: task spawns, crashes with ValueError: Unknown skill(s), gets reclaimed and respawned every 60s without ever being auto-blocked

Root cause

Post-heartbeat crashes (worker dies mid-execution) correctly increment consecutive_failures because a lock entry exists.

Impact (real-world case)

Observed in production on hermes-agent 0.14.0:

Skill name typo (foundry-vtt-mcp-macos → foundry-vtt-macos) caused 1192 consecutive crashes over 48 hours
failure_limit: 2 never triggered
System load peaked at 17.26 on a 20-core M1 Ultra
Contributed to gateway crash loop → 1000+ Discord reconnect attempts → system reboot

Expected behaviour

After failure_limit consecutive pre-heartbeat crashes, the task should be auto-blocked with a diagnostic message explaining the startup failure (same as post-heartbeat auto-block behaviour).

Suggested fix

Track pre-heartbeat spawn failures in a separate counter on the task record itself (not on the transient lock entry), so the circuit breaker fires regardless of whether a heartbeat was ever sent.

Environment

hermes-agent 0.14.0 / 0.15.1
macOS 15 (M1 Ultra)
kanban.failure_limit: 2, kanban.dispatch_interval_seconds: 60

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering