hermes - ✅(Solved) Fix recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns [3 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23025Fetched 2026-05-11 03:31:39
View on GitHub
Comments
1
Participants
2
Timeline
12
Reactions
0
Author
Participants
Timeline (top)
referenced ×4cross-referenced ×3labeled ×3closed ×1

Error Message

  1. Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
  2. Log why lock is considered stale: The stale_lock error message should include the actual TTL value and the timestamp of the last heartbeat, so we can diagnose whether it's a timing issue or a logic bug.

Fix Action

Fix / Workaround

  1. Worker spawns (status: running)
  2. Worker produces no output — workspace stays empty, no files written
  3. Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
  4. Dispatcher immediately respawns a new worker
  5. New worker immediately hits the same pattern
  6. This repeats 3-5 times with zero progress before the task either completes or is abandoned

Constant recurring problem. Every significant task on kimicoder hits this stall loop. Wastes compute, blocks progress, makes Kanban unreliable. Current workaround: manually reclaim + nudge repeatedly, but pattern always recurs.

PR fix notes

PR #23071: fix(kanban): extend stale claim instead of killing live worker

Description (problem / solution / changelog)

Stop reclaiming kanban tasks whose worker subprocess is still alive (#23025).

What changed and why

  • release_stale_claims now skips reclaim when the host-local worker_pid is alive, extending claim_expires by DEFAULT_CLAIM_TTL_SECONDS and emitting a new claim_extended event. Slow models (kimi-k2.6 in the report) can spend longer than the 15-min TTL inside a single tool-free LLM call, so kanban_heartbeat never fires; the previous behavior killed those healthy workers and respawned new ones that hit the same trap, producing the empty-workspace stall loop the reporter described.
  • enforce_max_runtime and detect_crashed_workers remain the upper bounds for genuinely wedged or dead workers — neither is touched here.
  • reclaimed events now carry claim_expires, last_heartbeat_at, worker_pid, host_local, and now, so operators can tell at a glance whether a kill was timing-driven or a worker that genuinely went away.

How to test

  • pytest tests/hermes_cli/test_kanban_db.py tests/tools/test_kanban_tools.py tests/stress/test_concurrency_reclaim_race.py -q --timeout=60 (112 passed locally).
  • New tests: test_stale_claim_with_live_pid_extends_instead_of_reclaiming (live PID → claim extended, no SIGTERM, claim_extended event emitted) and test_stale_claim_reclaim_event_records_diagnostic_payload (dead PID → reclaim event records expiry + heartbeat).
  • Existing test_stale_claim_reclaimed updated to simulate a dead PID, exercising the path that should still kill + reclaim.

What platforms tested on

  • macOS on darwin-arm64 (local)

Fixes #23025

<!-- autocontrib:worker-id=issue-new-61356881 kind=pr-open -->

Changed files

  • hermes_cli/kanban_db.py (modified, +70/-4)
  • tests/hermes_cli/test_kanban_db.py (modified, +85/-5)

PR #23108: fix(kanban): add heartbeat grace window to prevent reclaiming slow workers (#23025)

Description (problem / solution / changelog)

Bug Description

Kanban workers on the default board repeatedly entered a stall loop:

  1. Worker spawns (status: running)
  2. Worker produces no output
  3. Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock)
  4. Dispatcher immediately respawns a new worker
  5. New worker immediately hits the same pattern

This was particularly bad for slow models like kimi-k2.6.

Root Cause

release_stale_claims() reclaimed any running task whose claim_expires had passed, regardless of whether the worker had heartbeated recently. The default claim TTL is 15 minutes, but slow models can take longer between heartbeats.

Fix

Add a 5-minute heartbeat grace window to release_stale_claims():

  • If a worker heartbeated within the last 5 minutes, skip reclaim even if claim_expires has passed
  • Workers that are truly stuck (no heartbeat for >5 min) are still reclaimed
  • This gives slow-model workers extra time while still catching dead workers

Tests

Added tests/hermes_cli/test_kanban_stale_claim_grace_regression_23025.py with 3 tests:

  • Does NOT reclaim when worker heartbeated 2 minutes ago
  • DOES reclaim when worker last heartbeated 10 minutes ago
  • DOES reclaim when worker never heartbeated

All tests pass.

Fixes #23025

Changed files

  • hermes_cli/kanban_db.py (modified, +10/-1)
  • tests/hermes_cli/test_kanban_stale_claim_grace_regression_23025.py (added, +230/-0)

PR #23442: fix(kanban): extend stale claim instead of killing live worker (salvage #23071)

Description (problem / solution / changelog)

Summary

Stops the kanban dispatcher from killing healthy workers that are slow. Workers running slow models (kimi-k2.6 was the reported case) can spend longer than the 15-min DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call — they make no tool calls, so they don't heartbeat, so the dispatcher used to mark the claim stale and SIGTERM the worker mid-flight. The respawned worker hit the same trap, producing the empty-workspace stall loop reported in #23025.

How

release_stale_claims() now checks if the worker's host-local PID is alive before reclaiming. If alive: extend the claim by another DEFAULT_CLAIM_TTL_SECONDS and emit a claim_extended event. If dead (or non-host-local): reclaim as before.

Upper bounds are unchanged:

  • enforce_max_runtime still hard-caps task runtime per the max_runtime_seconds column (catches genuinely-stuck-but-PID-alive workers — deadlocks, infinite loops).
  • detect_crashed_workers still reaps workers whose PID has vanished.

The host-local check (lock.startswith(host_prefix) from _claimer_id().split(":", 1)[0]) means we only trust _pid_alive when the lock was set by THIS host. Cross-host claims (rare; happens if you migrate the kanban DB between machines) fall through to the normal reclaim path because we can't safely interpret a PID number from a different host.

Changes

  • hermes_cli/kanban_db.py::release_stale_claims: add the live-PID extension branch with a CAS-guarded UPDATE, run-row sync, and claim_extended event. The reclaim path's payload is also enriched with claim_expires, last_heartbeat_at, worker_pid, host_local, and now so operators can tell from task_events whether a kill was timing-driven or a genuinely-dead worker.
  • tests/hermes_cli/test_kanban_db.py: 2 new tests (test_stale_claim_with_live_pid_extends_instead_of_reclaiming + test_stale_claim_reclaim_event_records_diagnostic_payload) and the existing test_stale_claim_reclaimed flipped to _pid_alive=False so it exercises the still-correct dead-PID reclaim path.

Validation

BeforeAfter
tests/hermes_cli/test_kanban_db.py (stale-claim/reclaim subset)2/24/4
tests/hermes_cli/test_kanban_db.py + test_kanban_core_functionality.py + tests/tools/test_kanban_tools.py288/288290/290

Closes #23025 via salvage. Salvage of #23071. Original commit by @konsisumer cherry-picked with authorship preserved (re-attributed during salvage from [email protected] to the GitHub-noreply form for release-notes credit). AUTHOR_MAP entry added.

Changed files

  • hermes_cli/kanban_db.py (modified, +70/-4)
  • scripts/release.py (modified, +1/-0)
  • tests/hermes_cli/test_kanban_db.py (modified, +85/-5)

Code Example

2026-05-10 12:53:33 spawned=1 reclaimed=1
2026-05-10 13:08:35 spawned=1 reclaimed=1
2026-05-10 13:23:37 spawned=1 reclaimed=1
RAW_BUFFERClick to expand / collapse

Bug Description

Kanban workers on the default board repeatedly enter a stall loop:

  1. Worker spawns (status: running)
  2. Worker produces no output — workspace stays empty, no files written
  3. Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
  4. Dispatcher immediately respawns a new worker
  5. New worker immediately hits the same pattern
  6. This repeats 3-5 times with zero progress before the task either completes or is abandoned

Affected tasks: t_805fc503, t_57dabea4 (both on kimicoder / kimi-k2.6)

Profile config (kimicoder):

  • max_turns: 200
  • gateway_timeout: 1800
  • terminal.backend: local
  • terminal.timeout: 180

Evidence

Run history for t_805fc503:

  • Run 81: reclaimed (stale_lock)
  • Run 114: reclaimed (stale_lock)
  • Run 115: reclaimed (stale_lock)
  • Run 116: running (current, workspace empty after 10+ min)

Gateway log cycling:

2026-05-10 12:53:33 spawned=1 reclaimed=1
2026-05-10 13:08:35 spawned=1 reclaimed=1
2026-05-10 13:23:37 spawned=1 reclaimed=1

Questions / Investigation needed

  1. What is "stale_lock" actually checking? The lock appears stale even when the worker is actively running (PID alive, model calls being made). Is the lock TTL shorter than the time between gateway heartbeats?

  2. Why does a live worker get marked stale? If the worker is alive and processing, the lock shouldn't be considered stale. Is there a race condition where the gateway marks a lock stale before the worker has a chance to heartbeat?

  3. Why does the respawned worker immediately hit the same stale lock? If the previous worker was killed for being "stale" but the new worker starts fresh, what causes the new worker to also be marked stale within minutes?

  4. Is kimi-k2.6 specifically affected? The pattern consistently shows kimi-k2.6 workers stalling. Could be model-specific (slow token generation causing lock TTL to expire between heartbeats), or could be a generic issue with long-running tasks.

Suggested fixes

  1. Increase lock TTL or make it adaptive: If a worker is actively making model calls (input_tokens > 0 in last heartbeat), extend the lock TTL dynamically.

  2. Add stall detection before lock expiry: If a task has been running for >X minutes with zero tool calls or zero output, trigger a mid-flight warning rather than waiting for the lock to expire.

  3. Log why lock is considered stale: The stale_lock error message should include the actual TTL value and the timestamp of the last heartbeat, so we can diagnose whether it's a timing issue or a logic bug.

  4. Consider: if worker_pid is alive and responsive, don't reclaim. The current logic seems to reclaim based on lock file age alone, not actual worker health.

Impact

Constant recurring problem. Every significant task on kimicoder hits this stall loop. Wastes compute, blocks progress, makes Kanban unreliable. Current workaround: manually reclaim + nudge repeatedly, but pattern always recurs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns [3 pull requests, 1 comments, 2 participants]