hermes - ✅(Solved) Fix recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns [3 pull requests, 1 comments, 2 participants]

hermes2026-05-10 06:35:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#23025•Fetched 2026-05-11 03:31:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fwends

Participants

fwends

konsisumer

Timeline (top)

referenced ×4cross-referenced ×3labeled ×3closed ×1

Error Message

Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
Log why lock is considered stale: The stale_lock error message should include the actual TTL value and the timestamp of the last heartbeat, so we can diagnose whether it's a timing issue or a logic bug.

Fix Action

Fix / Workaround

Worker spawns (status: running)
Worker produces no output — workspace stays empty, no files written
Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
Dispatcher immediately respawns a new worker
New worker immediately hits the same pattern
This repeats 3-5 times with zero progress before the task either completes or is abandoned

Constant recurring problem. Every significant task on kimicoder hits this stall loop. Wastes compute, blocks progress, makes Kanban unreliable. Current workaround: manually reclaim + nudge repeatedly, but pattern always recurs.

PR fix notes

PR #23071: fix(kanban): extend stale claim instead of killing live worker

Repository: NousResearch/hermes-agent
Author: konsisumer
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/23071

Description (problem / solution / changelog)

Stop reclaiming kanban tasks whose worker subprocess is still alive (#23025).

What changed and why

release_stale_claims now skips reclaim when the host-local worker_pid is alive, extending claim_expires by DEFAULT_CLAIM_TTL_SECONDS and emitting a new claim_extended event. Slow models (kimi-k2.6 in the report) can spend longer than the 15-min TTL inside a single tool-free LLM call, so kanban_heartbeat never fires; the previous behavior killed those healthy workers and respawned new ones that hit the same trap, producing the empty-workspace stall loop the reporter described.
enforce_max_runtime and detect_crashed_workers remain the upper bounds for genuinely wedged or dead workers — neither is touched here.
reclaimed events now carry claim_expires, last_heartbeat_at, worker_pid, host_local, and now, so operators can tell at a glance whether a kill was timing-driven or a worker that genuinely went away.

How to test

pytest tests/hermes_cli/test_kanban_db.py tests/tools/test_kanban_tools.py tests/stress/test_concurrency_reclaim_race.py -q --timeout=60 (112 passed locally).
New tests: test_stale_claim_with_live_pid_extends_instead_of_reclaiming (live PID → claim extended, no SIGTERM, claim_extended event emitted) and test_stale_claim_reclaim_event_records_diagnostic_payload (dead PID → reclaim event records expiry + heartbeat).
Existing test_stale_claim_reclaimed updated to simulate a dead PID, exercising the path that should still kill + reclaim.

What platforms tested on

macOS on darwin-arm64 (local)

Fixes #23025

Changed files

hermes_cli/kanban_db.py (modified, +70/-4)
tests/hermes_cli/test_kanban_db.py (modified, +85/-5)

PR #23108: fix(kanban): add heartbeat grace window to prevent reclaiming slow workers (#23025)

Repository: NousResearch/hermes-agent
Author: KhanCold
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/23108

Description (problem / solution / changelog)

Bug Description

Kanban workers on the default board repeatedly entered a stall loop:

Worker spawns (status: running)
Worker produces no output
Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock)
Dispatcher immediately respawns a new worker
New worker immediately hits the same pattern

This was particularly bad for slow models like kimi-k2.6.

Root Cause

release_stale_claims() reclaimed any running task whose claim_expires had passed, regardless of whether the worker had heartbeated recently. The default claim TTL is 15 minutes, but slow models can take longer between heartbeats.

Fix

Add a 5-minute heartbeat grace window to release_stale_claims():

If a worker heartbeated within the last 5 minutes, skip reclaim even if claim_expires has passed
Workers that are truly stuck (no heartbeat for >5 min) are still reclaimed
This gives slow-model workers extra time while still catching dead workers

Tests

Added tests/hermes_cli/test_kanban_stale_claim_grace_regression_23025.py with 3 tests:

Does NOT reclaim when worker heartbeated 2 minutes ago
DOES reclaim when worker last heartbeated 10 minutes ago
DOES reclaim when worker never heartbeated

All tests pass.

Fixes #23025

Changed files

hermes_cli/kanban_db.py (modified, +10/-1)
tests/hermes_cli/test_kanban_stale_claim_grace_regression_23025.py (added, +230/-0)

PR #23442: fix(kanban): extend stale claim instead of killing live worker (salvage #23071)

Repository: NousResearch/hermes-agent
Author: teknium1
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/23442

Description (problem / solution / changelog)

Summary

Stops the kanban dispatcher from killing healthy workers that are slow. Workers running slow models (kimi-k2.6 was the reported case) can spend longer than the 15-min DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call — they make no tool calls, so they don't heartbeat, so the dispatcher used to mark the claim stale and SIGTERM the worker mid-flight. The respawned worker hit the same trap, producing the empty-workspace stall loop reported in #23025.

How

release_stale_claims() now checks if the worker's host-local PID is alive before reclaiming. If alive: extend the claim by another DEFAULT_CLAIM_TTL_SECONDS and emit a claim_extended event. If dead (or non-host-local): reclaim as before.

Upper bounds are unchanged:

enforce_max_runtime still hard-caps task runtime per the max_runtime_seconds column (catches genuinely-stuck-but-PID-alive workers — deadlocks, infinite loops).
detect_crashed_workers still reaps workers whose PID has vanished.

The host-local check (lock.startswith(host_prefix) from _claimer_id().split(":", 1)[0]) means we only trust _pid_alive when the lock was set by THIS host. Cross-host claims (rare; happens if you migrate the kanban DB between machines) fall through to the normal reclaim path because we can't safely interpret a PID number from a different host.

Changes

hermes_cli/kanban_db.py::release_stale_claims: add the live-PID extension branch with a CAS-guarded UPDATE, run-row sync, and claim_extended event. The reclaim path's payload is also enriched with claim_expires, last_heartbeat_at, worker_pid, host_local, and now so operators can tell from task_events whether a kill was timing-driven or a genuinely-dead worker.
tests/hermes_cli/test_kanban_db.py: 2 new tests (test_stale_claim_with_live_pid_extends_instead_of_reclaiming + test_stale_claim_reclaim_event_records_diagnostic_payload) and the existing test_stale_claim_reclaimed flipped to _pid_alive=False so it exercises the still-correct dead-PID reclaim path.

Validation

	Before	After
`tests/hermes_cli/test_kanban_db.py` (stale-claim/reclaim subset)	2/2	4/4
`tests/hermes_cli/test_kanban_db.py + test_kanban_core_functionality.py + tests/tools/test_kanban_tools.py`	288/288	290/290

Closes #23025 via salvage. Salvage of #23071. Original commit by @konsisumer cherry-picked with authorship preserved (re-attributed during salvage from [email protected] to the GitHub-noreply form for release-notes credit). AUTHOR_MAP entry added.

Changed files

hermes_cli/kanban_db.py (modified, +70/-4)
scripts/release.py (modified, +1/-0)
tests/hermes_cli/test_kanban_db.py (modified, +85/-5)

Code Example

2026-05-10 12:53:33 spawned=1 reclaimed=1
2026-05-10 13:08:35 spawned=1 reclaimed=1
2026-05-10 13:23:37 spawned=1 reclaimed=1

RAW_BUFFERClick to expand / collapse

Bug Description

Kanban workers on the default board repeatedly enter a stall loop:

Worker spawns (status: running)
Worker produces no output — workspace stays empty, no files written
Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
Dispatcher immediately respawns a new worker
New worker immediately hits the same pattern
This repeats 3-5 times with zero progress before the task either completes or is abandoned

Affected tasks: t_805fc503, t_57dabea4 (both on kimicoder / kimi-k2.6)

Profile config (kimicoder):

max_turns: 200
gateway_timeout: 1800
terminal.backend: local
terminal.timeout: 180

Evidence

Run history for t_805fc503:

Run 81: reclaimed (stale_lock)
Run 114: reclaimed (stale_lock)
Run 115: reclaimed (stale_lock)
Run 116: running (current, workspace empty after 10+ min)

Gateway log cycling:

2026-05-10 12:53:33 spawned=1 reclaimed=1
2026-05-10 13:08:35 spawned=1 reclaimed=1
2026-05-10 13:23:37 spawned=1 reclaimed=1

Questions / Investigation needed

What is "stale_lock" actually checking? The lock appears stale even when the worker is actively running (PID alive, model calls being made). Is the lock TTL shorter than the time between gateway heartbeats?
Why does a live worker get marked stale? If the worker is alive and processing, the lock shouldn't be considered stale. Is there a race condition where the gateway marks a lock stale before the worker has a chance to heartbeat?
Why does the respawned worker immediately hit the same stale lock? If the previous worker was killed for being "stale" but the new worker starts fresh, what causes the new worker to also be marked stale within minutes?
Is kimi-k2.6 specifically affected? The pattern consistently shows kimi-k2.6 workers stalling. Could be model-specific (slow token generation causing lock TTL to expire between heartbeats), or could be a generic issue with long-running tasks.

Suggested fixes

Increase lock TTL or make it adaptive: If a worker is actively making model calls (input_tokens > 0 in last heartbeat), extend the lock TTL dynamically.
Add stall detection before lock expiry: If a task has been running for >X minutes with zero tool calls or zero output, trigger a mid-flight warning rather than waiting for the lock to expire.
Log why lock is considered stale: The stale_lock error message should include the actual TTL value and the timestamp of the last heartbeat, so we can diagnose whether it's a timing issue or a logic bug.
Consider: if worker_pid is alive and responsive, don't reclaim. The current logic seems to reclaim based on lock file age alone, not actual worker health.

Impact

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns [3 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

PR fix notes

PR #23071: fix(kanban): extend stale claim instead of killing live worker

Description (problem / solution / changelog)

What changed and why

How to test

What platforms tested on

Changed files

PR #23108: fix(kanban): add heartbeat grace window to prevent reclaiming slow workers (#23025)

Description (problem / solution / changelog)

Bug Description

Root Cause

Fix

Tests

Changed files

PR #23442: fix(kanban): extend stale claim instead of killing live worker (salvage #23071)

Description (problem / solution / changelog)

Summary

How

Changes

Validation

Changed files

Code Example

Bug Description

Evidence

Questions / Investigation needed

Suggested fixes

Impact

Still need to ship something?

RELATED_DISCOVERY

TRENDING