hermes - 💡(How to fix) Fix Kanban worker runtime activity does not update board heartbeat, causing stale reclaim of active workers

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Dispatcher-spawned Kanban workers can remain active in the agent runtime while the Kanban board still shows tasks.last_heartbeat_at = NULL and task_runs.last_heartbeat_at = NULL. The dispatcher watchdog reads the board heartbeat fields, not the agent's in-process activity timestamp, so long-running active workers can be reclaimed and respawned as stale.

Local fix candidate: bridge AIAgent._touch_activity() to a non-tool Kanban heartbeat helper when HERMES_KANBAN_TASK is set. The bridge updates the board heartbeat and claim TTL, is rate-limited to one write per 60 seconds, does not persist activity descriptions, and is non-fatal on DB errors.

Root Cause

Hermes currently has two separate liveness signals:

  1. Agent/runtime activity: AIAgent._touch_activity(desc) updates in-process activity fields.
  2. Kanban watchdog liveness: tasks.last_heartbeat_at and task_runs.last_heartbeat_at.

If the model does not explicitly call the kanban_heartbeat tool, ordinary runtime activity does not reach the Kanban DB. The watchdog then acts correctly on incomplete liveness data and reclaims an active worker.

Fix Action

Fix / Workaround

Dispatcher-spawned Kanban workers can remain active in the agent runtime while the Kanban board still shows tasks.last_heartbeat_at = NULL and task_runs.last_heartbeat_at = NULL. The dispatcher watchdog reads the board heartbeat fields, not the agent's in-process activity timestamp, so long-running active workers can be reclaimed and respawned as stale.

Local patch shape

Mission Control was restarted to load the patched runtime. A synthetic live Kanban task validated that runtime activity without explicit kanban_heartbeat set both task and run heartbeat timestamps, did not persist activity text in the heartbeat event payload, and was not reclaimed by detect_stale_running().

Code Example

venv/bin/python -m pytest tests/run_agent/test_kanban_auto_heartbeat.py -q
6 passed

venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -q
172 passed

venv/bin/python -m pytest tests/run_agent/test_run_agent.py -q
339 passed

venv/bin/python -m pytest tests/tools/test_kanban_tools.py -q
81 passed

---

{
  "heartbeat_event_count": 1,
  "heartbeat_payload_contains_activity_text": false,
  "reclaimed": [],
  "run_id": 184,
  "run_last_heartbeat_at": 1779670296,
  "task_id": "t_f887eedb",
  "task_last_heartbeat_at": 1779670296,
  "task_status": "running"
}

---

82e66d75a09ca1e83f6e1b7cb88934f04385245a fix(kanban): bridge worker activity to heartbeat
RAW_BUFFERClick to expand / collapse

Kanban worker activity does not update board heartbeat, causing stale reclaim of active workers

Summary

Dispatcher-spawned Kanban workers can remain active in the agent runtime while the Kanban board still shows tasks.last_heartbeat_at = NULL and task_runs.last_heartbeat_at = NULL. The dispatcher watchdog reads the board heartbeat fields, not the agent's in-process activity timestamp, so long-running active workers can be reclaimed and respawned as stale.

Local fix candidate: bridge AIAgent._touch_activity() to a non-tool Kanban heartbeat helper when HERMES_KANBAN_TASK is set. The bridge updates the board heartbeat and claim TTL, is rate-limited to one write per 60 seconds, does not persist activity descriptions, and is non-fatal on DB errors.

Observed failure mode

  • Workers were actively executing probe work.
  • Board liveness stayed null: tasks.last_heartbeat_at / task_runs.last_heartbeat_at were not updated unless the model explicitly called kanban_heartbeat.
  • detect_stale_running() saw a long-running task with null heartbeat and reclaimed it.
  • The task was returned to ready and could be re-spawned, causing worker context loss.

This is distinct from task_events.kind='heartbeat' rows with payload=NULL. Null event payload is expected when a heartbeat has no note. The defect is null board liveness while the worker is active.

Root cause

Hermes currently has two separate liveness signals:

  1. Agent/runtime activity: AIAgent._touch_activity(desc) updates in-process activity fields.
  2. Kanban watchdog liveness: tasks.last_heartbeat_at and task_runs.last_heartbeat_at.

If the model does not explicitly call the kanban_heartbeat tool, ordinary runtime activity does not reach the Kanban DB. The watchdog then acts correctly on incomplete liveness data and reclaims an active worker.

Local patch shape

Files changed locally:

  • run_agent.py

    • AIAgent._touch_activity() now calls a best-effort Kanban heartbeat bridge when HERMES_KANBAN_TASK is set.
    • Write rate limit: 60 seconds minimum between auto-heartbeat DB writes.
    • Exceptions are swallowed and logged at debug level.
    • Runtime activity descriptions are not written to durable task events.
  • tools/kanban_tools.py

    • Adds heartbeat_current_worker_from_env() helper.
    • Uses worker env identity:
      • HERMES_KANBAN_TASK
      • HERMES_KANBAN_RUN_ID
      • HERMES_KANBAN_CLAIM_LOCK
    • Calls heartbeat_claim() and heartbeat_worker().

Explicit kanban_heartbeat remains unchanged and remains the correct path for worker-provided human-readable heartbeat notes.

Tests added locally

New focused tests in tests/run_agent/test_kanban_auto_heartbeat.py prove:

  1. _touch_activity() in a Kanban worker sets tasks.last_heartbeat_at and task_runs.last_heartbeat_at.
  2. _touch_activity() outside a Kanban worker does not connect to or mutate Kanban.
  3. Auto-heartbeat is rate-limited to prevent write churn.
  4. Auto-heartbeat extends claim_expires through the claim heartbeat path.
  5. A long-running task older than stale timeout but recently auto-heartbeated is not reclaimed by detect_stale_running().
  6. Heartbeat bridge failures are non-fatal to _touch_activity().

Local validation:

venv/bin/python -m pytest tests/run_agent/test_kanban_auto_heartbeat.py -q
6 passed

venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -q
172 passed

venv/bin/python -m pytest tests/run_agent/test_run_agent.py -q
339 passed

venv/bin/python -m pytest tests/tools/test_kanban_tools.py -q
81 passed

Local live validation

Mission Control was restarted to load the patched runtime. A synthetic live Kanban task validated that runtime activity without explicit kanban_heartbeat set both task and run heartbeat timestamps, did not persist activity text in the heartbeat event payload, and was not reclaimed by detect_stale_running().

Validation task: t_f887eedb Validation run: 184

Result:

{
  "heartbeat_event_count": 1,
  "heartbeat_payload_contains_activity_text": false,
  "reclaimed": [],
  "run_id": 184,
  "run_last_heartbeat_at": 1779670296,
  "task_id": "t_f887eedb",
  "task_last_heartbeat_at": 1779670296,
  "task_status": "running"
}

Relation to prior local fixes

Related local context:

  • DEC-024: WAL-safe Kanban DB connection close reduced FD/WAL leakage risk.
  • Phase 2 dispatcher persistent connection refactor reduced per-tick dispatcher SQLite connection churn and was filed as #31736, later linked to #29610.

This heartbeat fix is independent of those dispatcher/WAL fixes. It adds rate-limited worker-side writes and does not change dispatcher connection caching or WAL-safe close behavior.

Local commit:

82e66d75a09ca1e83f6e1b7cb88934f04385245a fix(kanban): bridge worker activity to heartbeat

Why this should be upstreamable

The fix is not site-specific:

  • It uses existing worker env variables.
  • It uses existing DB primitives (heartbeat_claim, heartbeat_worker).
  • It preserves the explicit heartbeat tool.
  • It avoids durable storage of runtime activity text.
  • It rate-limits writes to avoid WAL churn.
  • It leaves watchdog reclaim semantics intact.

Suggested acceptance criteria

  • Dispatcher-spawned workers update board heartbeat timestamps during normal runtime activity, even if the model never explicitly calls kanban_heartbeat.
  • Auto-heartbeat writes are rate-limited.
  • Auto-heartbeat failure cannot crash worker execution.
  • Activity descriptions are not persisted as heartbeat event payloads.
  • Existing explicit kanban_heartbeat behavior remains unchanged.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Kanban worker runtime activity does not update board heartbeat, causing stale reclaim of active workers