hermes - 💡(How to fix) Fix [Bug]: Kanban stale claim locks from dead workers have no auto-cleanup — tasks permanently stuck until manual intervention

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a kanban worker process dies unexpectedly (OOM, segfault, SIGKILL, system reboot), its claim on the task remains in the DB (claim_lock, claim_expires, worker_pid). The claim TTL (~15 minutes) is supposed to handle this, but in practice:

  1. The dispatcher doesn't always re-check expired claims on its tick
  2. The claim expiry timestamp may not be set correctly for all failure modes
  3. There is no watchdog that periodically scans for and clears stale claims

The result: tasks get permanently stuck in running status with a dead worker's claim. They are invisible to hermes kanban dispatch and don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:

UPDATE tasks SET
  claim_lock = NULL, claim_expires = NULL, worker_pid = NULL, current_run_id = NULL,
  status = 'ready'
WHERE id = 't_xxx';

Root Cause

When a kanban worker process dies unexpectedly (OOM, segfault, SIGKILL, system reboot), its claim on the task remains in the DB (claim_lock, claim_expires, worker_pid). The claim TTL (~15 minutes) is supposed to handle this, but in practice:

  1. The dispatcher doesn't always re-check expired claims on its tick
  2. The claim expiry timestamp may not be set correctly for all failure modes
  3. There is no watchdog that periodically scans for and clears stale claims

The result: tasks get permanently stuck in running status with a dead worker's claim. They are invisible to hermes kanban dispatch and don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:

UPDATE tasks SET
  claim_lock = NULL, claim_expires = NULL, worker_pid = NULL, current_run_id = NULL,
  status = 'ready'
WHERE id = 't_xxx';

Fix Action

Fix / Workaround

  1. The dispatcher doesn't always re-check expired claims on its tick
  2. The claim expiry timestamp may not be set correctly for all failure modes
  3. There is no watchdog that periodically scans for and clears stale claims

The result: tasks get permanently stuck in running status with a dead worker's claim. They are invisible to hermes kanban dispatch and don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:

  1. Start a kanban task with a long-running worker
  2. Kill the worker process with SIGKILL (or let it OOM)
  3. Wait 15+ minutes
  4. Task is still in running with the dead claim — dispatcher will not pick it up

Code Example

UPDATE tasks SET
  claim_lock = NULL, claim_expires = NULL, worker_pid = NULL, current_run_id = NULL,
  status = 'ready'
WHERE id = 't_xxx';
RAW_BUFFERClick to expand / collapse

Summary

When a kanban worker process dies unexpectedly (OOM, segfault, SIGKILL, system reboot), its claim on the task remains in the DB (claim_lock, claim_expires, worker_pid). The claim TTL (~15 minutes) is supposed to handle this, but in practice:

  1. The dispatcher doesn't always re-check expired claims on its tick
  2. The claim expiry timestamp may not be set correctly for all failure modes
  3. There is no watchdog that periodically scans for and clears stale claims

The result: tasks get permanently stuck in running status with a dead worker's claim. They are invisible to hermes kanban dispatch and don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:

UPDATE tasks SET
  claim_lock = NULL, claim_expires = NULL, worker_pid = NULL, current_run_id = NULL,
  status = 'ready'
WHERE id = 't_xxx';

Steps to Reproduce

  1. Start a kanban task with a long-running worker
  2. Kill the worker process with SIGKILL (or let it OOM)
  3. Wait 15+ minutes
  4. Task is still in running with the dead claim — dispatcher will not pick it up

Expected Behavior

  1. The dispatcher should check claim_expires on every tick and clear expired claims
  2. A periodic watchdog (or gateway startup check) should scan for status = 'running' tasks with expired claims and reset them to ready
  3. The dashboard should show "stale claim" as a recovery option with a one-click "reclaim" button

Suggested Fix

  • Add a claim-expiry sweep in the dispatcher's main loop
  • On gateway startup, run a one-time scan for orphaned claims
  • Add hermes kanban reclaim --stale to bulk-reclaim all expired-claim tasks

Environment

  • Hermes Agent v2.x
  • The hermes kanban reclaim <task_id> command exists for single tasks but requires manual discovery
  • The kanban-orchestrator skill documents this under "Recovering stuck workers"

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Kanban stale claim locks from dead workers have no auto-cleanup — tasks permanently stuck until manual intervention