hermes - 💡(How to fix) Fix [Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min

hermes2026-05-07 10:40:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

The kanban_heartbeat tool that workers call (registered via tools/kanban_tools.py) only updates last_heartbeat_at — it does not extend claim_expires. As a result, a diligent worker that loops kanban_heartbeat while running a long synchronous tool call (e.g. xcodebuild archive, large flutter test, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.

This is likely the underlying cause of the "reclaims & respawns were exactly 15 minutes apart" symptom reported in #21141 — that issue addresses the post-reclaim cleanup (old worker not killed). The two issues are complementary fixes, not duplicates: my issue keeps diligent workers from being reclaimed in the first place; #21141 ensures that when reclamation does happen (truly stuck worker), the old process is actually terminated.

Error Message

def _handle_heartbeat(args: dict, **kw) -> str: tid = _default_task_id(args.get("task_id")) if not tid: return tool_error(...) ownership_err = _enforce_worker_task_ownership(tid) if ownership_err: return ownership_err note = args.get("note") try: kb, conn = _connect() try: # Extend the claim TTL — without this, a worker that heartbeats # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS. # The claim_lock check inside heartbeat_claim prevents extending # a claim we no longer own. claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK") kb.heartbeat_claim(conn, tid, claimer=claim_lock)

        ok = kb.heartbeat_worker(
            conn, tid, note=note,
            expected_run_id=_worker_run_id(tid),
        )
        if not ok:
            return tool_error(
                f"could not heartbeat {tid} (unknown id or not running)"
            )
        return _ok(task_id=tid)
    finally:
        conn.close()
except Exception as e:
    logger.exception("kanban_heartbeat failed")
    return tool_error(f"kanban_heartbeat: {e}")

Root Cause

tools/kanban_tools.py:317-348 (the _handle_heartbeat function) calls kb.heartbeat_worker(...):

ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)

heartbeat_worker (hermes_cli/kanban_db.py:2641-2691) only updates last_heartbeat_at on tasks and task_runs, plus appends a heartbeat event. It is silent about claim_expires.

The TTL-extending function is heartbeat_claim (hermes_cli/kanban_db.py:1817-1844). Its docstring even states the contract:

"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."

But no caller in the worker tool path invokes it. Workers can't call it themselves either — heartbeat_claim is not exposed via any tool.

Fix Action

Fix / Workaround

Create a task with default settings: hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/foo
Worker is dispatched. In its loop it calls kanban_heartbeat every 30 s.
Worker's current shell command runs longer than DEFAULT_CLAIM_TTL_SECONDS (15 min).
Dispatcher's release_stale_claims() (kanban_db.py:1846) reclaims the task because claim_expires < now, even though last_heartbeat_at is fresh.
A new worker is spawned for the same task — duplicate work / corruption risk on shared workspaces.

The dispatcher already sets HERMES_KANBAN_CLAIM_LOCK in the worker env (hermes_cli/kanban_db.py:3293), so claim_lock is the right value to pass. If heartbeat_claim returns False (the worker no longer owns the claim — was reclaimed), we let heartbeat_worker also fail and the tool surfaces the standard "not running" error to the worker, who can then exit cleanly.

Code Example

ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)

---

def _handle_heartbeat(args: dict, **kw) -> str:
    tid = _default_task_id(args.get("task_id"))
    if not tid:
        return tool_error(...)
    ownership_err = _enforce_worker_task_ownership(tid)
    if ownership_err:
        return ownership_err
    note = args.get("note")
    try:
        kb, conn = _connect()
        try:
            # Extend the claim TTL — without this, a worker that heartbeats
            # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS.
            # The claim_lock check inside heartbeat_claim prevents extending
            # a claim we no longer own.
            claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
            kb.heartbeat_claim(conn, tid, claimer=claim_lock)

            ok = kb.heartbeat_worker(
                conn, tid, note=note,
                expected_run_id=_worker_run_id(tid),
            )
            if not ok:
                return tool_error(
                    f"could not heartbeat {tid} (unknown id or not running)"
                )
            return _ok(task_id=tid)
        finally:
            conn.close()
    except Exception as e:
        logger.exception("kanban_heartbeat failed")
        return tool_error(f"kanban_heartbeat: {e}")

---

def test_heartbeat_extends_claim(worker_env):
    """The kanban_heartbeat tool must extend claim_expires, not just
    update last_heartbeat_at — otherwise long-running workers are reclaimed
    despite heartbeating."""
    from tools import kanban_tools as kt
    from hermes_cli import kanban_db as kb

    conn = kb.connect()
    try:
        before = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    time.sleep(1)  # ensure now() > before
    out = kt._handle_heartbeat({"note": "still alive"})
    assert json.loads(out)["ok"] is True

    conn = kb.connect()
    try:
        after = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    assert after > before, (
        f"claim_expires did not advance ({before} -> {after}); "
        f"worker would be reclaimed at TTL despite heartbeating"
    )

RAW_BUFFERClick to expand / collapse

Summary

Repro

Create a task with default settings: hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/foo
Worker is dispatched. In its loop it calls kanban_heartbeat every 30 s.
Worker's current shell command runs longer than DEFAULT_CLAIM_TTL_SECONDS (15 min).
Dispatcher's release_stale_claims() (kanban_db.py:1846) reclaims the task because claim_expires < now, even though last_heartbeat_at is fresh.
A new worker is spawned for the same task — duplicate work / corruption risk on shared workspaces.

Root cause

tools/kanban_tools.py:317-348 (the _handle_heartbeat function) calls kb.heartbeat_worker(...):

ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)

heartbeat_worker (hermes_cli/kanban_db.py:2641-2691) only updates last_heartbeat_at on tasks and task_runs, plus appends a heartbeat event. It is silent about claim_expires.

The TTL-extending function is heartbeat_claim (hermes_cli/kanban_db.py:1817-1844). Its docstring even states the contract:

"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."

But no caller in the worker tool path invokes it. Workers can't call it themselves either — heartbeat_claim is not exposed via any tool.

Test gap

The kanban_heartbeat tool tests (tests/tools/test_kanban_tools.py:202-218) only check the tool returns ok: true — they don't verify claim_expires actually moves. The heartbeat_claim function is well-tested in isolation (tests/hermes_cli/test_kanban_db.py:231 test_heartbeat_extends_claim), but the integration through the tool is unverified, which is how this regression slipped past CI.

Proposed fix

In tools/kanban_tools.py, _handle_heartbeat should also extend the claim. Two-line change:

def _handle_heartbeat(args: dict, **kw) -> str:
    tid = _default_task_id(args.get("task_id"))
    if not tid:
        return tool_error(...)
    ownership_err = _enforce_worker_task_ownership(tid)
    if ownership_err:
        return ownership_err
    note = args.get("note")
    try:
        kb, conn = _connect()
        try:
            # Extend the claim TTL — without this, a worker that heartbeats
            # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS.
            # The claim_lock check inside heartbeat_claim prevents extending
            # a claim we no longer own.
            claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
            kb.heartbeat_claim(conn, tid, claimer=claim_lock)

            ok = kb.heartbeat_worker(
                conn, tid, note=note,
                expected_run_id=_worker_run_id(tid),
            )
            if not ok:
                return tool_error(
                    f"could not heartbeat {tid} (unknown id or not running)"
                )
            return _ok(task_id=tid)
        finally:
            conn.close()
    except Exception as e:
        logger.exception("kanban_heartbeat failed")
        return tool_error(f"kanban_heartbeat: {e}")

Test that would have caught this

def test_heartbeat_extends_claim(worker_env):
    """The kanban_heartbeat tool must extend claim_expires, not just
    update last_heartbeat_at — otherwise long-running workers are reclaimed
    despite heartbeating."""
    from tools import kanban_tools as kt
    from hermes_cli import kanban_db as kb

    conn = kb.connect()
    try:
        before = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    time.sleep(1)  # ensure now() > before
    out = kt._handle_heartbeat({"note": "still alive"})
    assert json.loads(out)["ok"] is True

    conn = kb.connect()
    try:
        after = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    assert after > before, (
        f"claim_expires did not advance ({before} -> {after}); "
        f"worker would be reclaimed at TTL despite heartbeating"
    )

Severity

Medium. Workers that finish under 15 min are unaffected. Workers that exceed 15 min on a single tool call (Xcode Archive, large image generation, dataset processing) experience silent re-spawn — they appear to "loop" from the user's perspective and their first run's progress is discarded. Particularly painful when combined with --max-runtime since the per-task wall budget is consumed by the reclaimed first run, leaving the re-spawn with less budget than expected.

tools/kanban_tools.py:317-348 — bug site
hermes_cli/kanban_db.py:1817-1844 — heartbeat_claim
hermes_cli/kanban_db.py:2641-2691 — heartbeat_worker
hermes_cli/kanban_db.py:1846+ — release_stale_claims (the function that reclaims)
hermes_cli/kanban_db.py:3293 — dispatcher sets HERMES_KANBAN_CLAIM_LOCK in worker env

I'm happy to follow up with a PR if useful.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#training loop #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Repro

Root cause

Test gap

Proposed fix

Test that would have caught this

Severity

Related

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Repro

Root cause

Test gap

Proposed fix

Test that would have caught this

Severity

Related

Still need to ship something?

RELATED_DISCOVERY

TRENDING