hermes - 💡(How to fix) Fix [Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The kanban_heartbeat tool that workers call (registered via tools/kanban_tools.py) only updates last_heartbeat_at — it does not extend claim_expires. As a result, a diligent worker that loops kanban_heartbeat while running a long synchronous tool call (e.g. xcodebuild archive, large flutter test, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.

This is likely the underlying cause of the "reclaims & respawns were exactly 15 minutes apart" symptom reported in #21141 — that issue addresses the post-reclaim cleanup (old worker not killed). The two issues are complementary fixes, not duplicates: my issue keeps diligent workers from being reclaimed in the first place; #21141 ensures that when reclamation does happen (truly stuck worker), the old process is actually terminated.

Error Message

def _handle_heartbeat(args: dict, **kw) -> str: tid = _default_task_id(args.get("task_id")) if not tid: return tool_error(...) ownership_err = _enforce_worker_task_ownership(tid) if ownership_err: return ownership_err note = args.get("note") try: kb, conn = _connect() try: # Extend the claim TTL — without this, a worker that heartbeats # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS. # The claim_lock check inside heartbeat_claim prevents extending # a claim we no longer own. claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK") kb.heartbeat_claim(conn, tid, claimer=claim_lock)

        ok = kb.heartbeat_worker(
            conn, tid, note=note,
            expected_run_id=_worker_run_id(tid),
        )
        if not ok:
            return tool_error(
                f"could not heartbeat {tid} (unknown id or not running)"
            )
        return _ok(task_id=tid)
    finally:
        conn.close()
except Exception as e:
    logger.exception("kanban_heartbeat failed")
    return tool_error(f"kanban_heartbeat: {e}")

Root Cause

tools/kanban_tools.py:317-348 (the _handle_heartbeat function) calls kb.heartbeat_worker(...):

ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)

heartbeat_worker (hermes_cli/kanban_db.py:2641-2691) only updates last_heartbeat_at on tasks and task_runs, plus appends a heartbeat event. It is silent about claim_expires.

The TTL-extending function is heartbeat_claim (hermes_cli/kanban_db.py:1817-1844). Its docstring even states the contract:

"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."

But no caller in the worker tool path invokes it. Workers can't call it themselves either — heartbeat_claim is not exposed via any tool.

Fix Action

Fix / Workaround

The kanban_heartbeat tool that workers call (registered via tools/kanban_tools.py) only updates last_heartbeat_at — it does not extend claim_expires. As a result, a diligent worker that loops kanban_heartbeat while running a long synchronous tool call (e.g. xcodebuild archive, large flutter test, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.

  1. Create a task with default settings: hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/foo
  2. Worker is dispatched. In its loop it calls kanban_heartbeat every 30 s.
  3. Worker's current shell command runs longer than DEFAULT_CLAIM_TTL_SECONDS (15 min).
  4. Dispatcher's release_stale_claims() (kanban_db.py:1846) reclaims the task because claim_expires < now, even though last_heartbeat_at is fresh.
  5. A new worker is spawned for the same task — duplicate work / corruption risk on shared workspaces.

The dispatcher already sets HERMES_KANBAN_CLAIM_LOCK in the worker env (hermes_cli/kanban_db.py:3293), so claim_lock is the right value to pass. If heartbeat_claim returns False (the worker no longer owns the claim — was reclaimed), we let heartbeat_worker also fail and the tool surfaces the standard "not running" error to the worker, who can then exit cleanly.

Code Example

ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)

---

def _handle_heartbeat(args: dict, **kw) -> str:
    tid = _default_task_id(args.get("task_id"))
    if not tid:
        return tool_error(...)
    ownership_err = _enforce_worker_task_ownership(tid)
    if ownership_err:
        return ownership_err
    note = args.get("note")
    try:
        kb, conn = _connect()
        try:
            # Extend the claim TTL — without this, a worker that heartbeats
            # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS.
            # The claim_lock check inside heartbeat_claim prevents extending
            # a claim we no longer own.
            claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
            kb.heartbeat_claim(conn, tid, claimer=claim_lock)

            ok = kb.heartbeat_worker(
                conn, tid, note=note,
                expected_run_id=_worker_run_id(tid),
            )
            if not ok:
                return tool_error(
                    f"could not heartbeat {tid} (unknown id or not running)"
                )
            return _ok(task_id=tid)
        finally:
            conn.close()
    except Exception as e:
        logger.exception("kanban_heartbeat failed")
        return tool_error(f"kanban_heartbeat: {e}")

---

def test_heartbeat_extends_claim(worker_env):
    """The kanban_heartbeat tool must extend claim_expires, not just
    update last_heartbeat_at — otherwise long-running workers are reclaimed
    despite heartbeating."""
    from tools import kanban_tools as kt
    from hermes_cli import kanban_db as kb

    conn = kb.connect()
    try:
        before = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    time.sleep(1)  # ensure now() > before
    out = kt._handle_heartbeat({"note": "still alive"})
    assert json.loads(out)["ok"] is True

    conn = kb.connect()
    try:
        after = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    assert after > before, (
        f"claim_expires did not advance ({before} -> {after}); "
        f"worker would be reclaimed at TTL despite heartbeating"
    )
RAW_BUFFERClick to expand / collapse

Summary

The kanban_heartbeat tool that workers call (registered via tools/kanban_tools.py) only updates last_heartbeat_at — it does not extend claim_expires. As a result, a diligent worker that loops kanban_heartbeat while running a long synchronous tool call (e.g. xcodebuild archive, large flutter test, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.

This is likely the underlying cause of the "reclaims & respawns were exactly 15 minutes apart" symptom reported in #21141 — that issue addresses the post-reclaim cleanup (old worker not killed). The two issues are complementary fixes, not duplicates: my issue keeps diligent workers from being reclaimed in the first place; #21141 ensures that when reclamation does happen (truly stuck worker), the old process is actually terminated.

Repro

  1. Create a task with default settings: hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/foo
  2. Worker is dispatched. In its loop it calls kanban_heartbeat every 30 s.
  3. Worker's current shell command runs longer than DEFAULT_CLAIM_TTL_SECONDS (15 min).
  4. Dispatcher's release_stale_claims() (kanban_db.py:1846) reclaims the task because claim_expires < now, even though last_heartbeat_at is fresh.
  5. A new worker is spawned for the same task — duplicate work / corruption risk on shared workspaces.

Root cause

tools/kanban_tools.py:317-348 (the _handle_heartbeat function) calls kb.heartbeat_worker(...):

ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)

heartbeat_worker (hermes_cli/kanban_db.py:2641-2691) only updates last_heartbeat_at on tasks and task_runs, plus appends a heartbeat event. It is silent about claim_expires.

The TTL-extending function is heartbeat_claim (hermes_cli/kanban_db.py:1817-1844). Its docstring even states the contract:

"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."

But no caller in the worker tool path invokes it. Workers can't call it themselves either — heartbeat_claim is not exposed via any tool.

Test gap

The kanban_heartbeat tool tests (tests/tools/test_kanban_tools.py:202-218) only check the tool returns ok: true — they don't verify claim_expires actually moves. The heartbeat_claim function is well-tested in isolation (tests/hermes_cli/test_kanban_db.py:231 test_heartbeat_extends_claim), but the integration through the tool is unverified, which is how this regression slipped past CI.

Proposed fix

In tools/kanban_tools.py, _handle_heartbeat should also extend the claim. Two-line change:

def _handle_heartbeat(args: dict, **kw) -> str:
    tid = _default_task_id(args.get("task_id"))
    if not tid:
        return tool_error(...)
    ownership_err = _enforce_worker_task_ownership(tid)
    if ownership_err:
        return ownership_err
    note = args.get("note")
    try:
        kb, conn = _connect()
        try:
            # Extend the claim TTL — without this, a worker that heartbeats
            # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS.
            # The claim_lock check inside heartbeat_claim prevents extending
            # a claim we no longer own.
            claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
            kb.heartbeat_claim(conn, tid, claimer=claim_lock)

            ok = kb.heartbeat_worker(
                conn, tid, note=note,
                expected_run_id=_worker_run_id(tid),
            )
            if not ok:
                return tool_error(
                    f"could not heartbeat {tid} (unknown id or not running)"
                )
            return _ok(task_id=tid)
        finally:
            conn.close()
    except Exception as e:
        logger.exception("kanban_heartbeat failed")
        return tool_error(f"kanban_heartbeat: {e}")

The dispatcher already sets HERMES_KANBAN_CLAIM_LOCK in the worker env (hermes_cli/kanban_db.py:3293), so claim_lock is the right value to pass. If heartbeat_claim returns False (the worker no longer owns the claim — was reclaimed), we let heartbeat_worker also fail and the tool surfaces the standard "not running" error to the worker, who can then exit cleanly.

Test that would have caught this

def test_heartbeat_extends_claim(worker_env):
    """The kanban_heartbeat tool must extend claim_expires, not just
    update last_heartbeat_at — otherwise long-running workers are reclaimed
    despite heartbeating."""
    from tools import kanban_tools as kt
    from hermes_cli import kanban_db as kb

    conn = kb.connect()
    try:
        before = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    time.sleep(1)  # ensure now() > before
    out = kt._handle_heartbeat({"note": "still alive"})
    assert json.loads(out)["ok"] is True

    conn = kb.connect()
    try:
        after = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    assert after > before, (
        f"claim_expires did not advance ({before} -> {after}); "
        f"worker would be reclaimed at TTL despite heartbeating"
    )

Severity

Medium. Workers that finish under 15 min are unaffected. Workers that exceed 15 min on a single tool call (Xcode Archive, large image generation, dataset processing) experience silent re-spawn — they appear to "loop" from the user's perspective and their first run's progress is discarded. Particularly painful when combined with --max-runtime since the per-task wall budget is consumed by the reclaimed first run, leaving the re-spawn with less budget than expected.

Related

  • tools/kanban_tools.py:317-348 — bug site
  • hermes_cli/kanban_db.py:1817-1844heartbeat_claim
  • hermes_cli/kanban_db.py:2641-2691heartbeat_worker
  • hermes_cli/kanban_db.py:1846+release_stale_claims (the function that reclaims)
  • hermes_cli/kanban_db.py:3293 — dispatcher sets HERMES_KANBAN_CLAIM_LOCK in worker env

I'm happy to follow up with a PR if useful.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING