hermes - ✅(Solved) Fix kanban dispatcher: macOS zombie detection is a no-op — _pid_alive returns True for defunct workers [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#20015Fetched 2026-05-06 06:39:14
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×3cross-referenced ×2closed ×1

_pid_alive() in hermes_cli/kanban_db.py only implements zombie detection on Linux (parsing /proc/<pid>/status for State: Z). On macOS, os.kill(pid, 0) returns success for defunct/zombie processes, so a worker that crashes immediately stays "alive" to the dispatcher until claim_expires times out (~15 min default).

Error Message

  1. Assign a worker a task that causes an immediate crash (e.g., require a skill it doesn't have, or a missing credential that triggers an unhandled exception at startup).

Root Cause

  1. Run the kanban dispatcher on macOS with a ~5 min cadence.
  2. Assign a worker a task that causes an immediate crash (e.g., require a skill it doesn't have, or a missing credential that triggers an unhandled exception at startup).
  3. os.kill(pid, 0) succeeds against the defunct process because the process table entry still exists.
  4. The dispatcher sees the worker as alive and does NOT re-queue the task until claim_expires (~15 min later).
  5. This creates a zombie-respawn loop where the dispatcher tries again every N minutes, gets the same crash, and the task stays stuck until manual SQL intervention.

Fix Action

Fix / Workaround

_pid_alive() in hermes_cli/kanban_db.py only implements zombie detection on Linux (parsing /proc/<pid>/status for State: Z). On macOS, os.kill(pid, 0) returns success for defunct/zombie processes, so a worker that crashes immediately stays "alive" to the dispatcher until claim_expires times out (~15 min default).

  1. Run the kanban dispatcher on macOS with a ~5 min cadence.
  2. Assign a worker a task that causes an immediate crash (e.g., require a skill it doesn't have, or a missing credential that triggers an unhandled exception at startup).
  3. os.kill(pid, 0) succeeds against the defunct process because the process table entry still exists.
  4. The dispatcher sees the worker as alive and does NOT re-queue the task until claim_expires (~15 min later).
  5. This creates a zombie-respawn loop where the dispatcher tries again every N minutes, gets the same crash, and the task stays stuck until manual SQL intervention.

Tasks stuck in running for up to 15 minutes on macOS, requiring manual sqlite3 surgery to break the loop. With a 5-minute dispatcher cadence and default claim_expires of 15 minutes, users see 3+ wasted spawn attempts per stuck task.

PR fix notes

PR #20023: fix(kanban): detect darwin zombie workers

Description (problem / solution / changelog)

Fixes #20015

Summary

  • extend Kanban _pid_alive() zombie detection to macOS/Darwin using ps -o stat= after kill(pid, 0) succeeds
  • treat Darwin process states containing Z as dead so crash detection can reclaim defunct workers promptly
  • keep the existing Linux /proc/<pid>/status path unchanged

Scope

This is limited to Kanban dispatcher worker liveness checks. It does not change gateway scoped locks or process spawning semantics.

Verification

  • scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py::test_pid_alive_helper tests/hermes_cli/test_kanban_core_functionality.py::test_pid_alive_detects_darwin_zombie tests/hermes_cli/test_kanban_core_functionality.py::test_detect_crashed_workers_reclaims -> 3 passed
  • git diff --check

Changed files

  • hermes_cli/kanban_db.py (modified, +24/-5)
  • tests/hermes_cli/test_kanban_core_functionality.py (modified, +16/-0)

PR #20188: fix(kanban): detect darwin zombie workers (salvages #20023)

Description (problem / solution / changelog)

Kanban dispatcher now correctly detects zombie worker processes on macOS, so crashed workers get reclaimed on the next tick instead of tying up their task for the full claim_expires window (~15 min).

Salvaged from #20023 (@LeonSGP43).

Root cause: _pid_alive() used os.kill(pid, 0) which succeeds for zombie processes because the process table entry still exists post-exit, pre-reap. On Linux it fell through to /proc/<pid>/status to read State: Z, but on macOS there's no /proc, so the zombie check was a documented no-op. A worker that crashed at startup (missing skill, bad credential, import error) stayed "alive" to the dispatcher until claim TTL expired, creating a ~5 min dispatcher cadence × 15 min TTL = 3+ wasted re-spawn attempts per stuck task, all of which crashed identically.

Changes

  • hermes_cli/kanban_db.py: after kill(pid, 0) succeeds on Darwin, shell out to ps -o stat= -p <pid> with a 1s timeout. Return False if ps exits non-zero (no such process) or if the BSD stat field contains Z. If the probe itself errors, keep the optimistic kill(0) answer — conservative default.
  • tests/hermes_cli/test_kanban_core_functionality.py: new test_pid_alive_detects_darwin_zombie covering the Darwin branch with a mocked ps returning Z+.
  • Linux /proc/<pid>/status path unchanged.

Validation

BeforeAfter
macOS: worker crashes at startup_pid_alive → True for ~15 min until claim_expires; dispatcher re-spawns 3+ times_pid_alive → False on next tick; task reclaimed via crashed-worker path
Linux: zombie worker/proc/<pid>/status peek returns False (unchanged)unchanged
Windows / other POSIXno zombie check (unchanged)unchanged
ps probe fails (unexpected)N/Afalls back to kill(0) answer (optimistic)
Targeted tests174/174 pass (test_kanban_core_functionality + test_kanban_db)

Closes #20015

Co-authored-by: LeonSGP43 [email protected]

Changed files

  • hermes_cli/kanban_db.py (modified, +24/-5)
  • tests/hermes_cli/test_kanban_core_functionality.py (modified, +16/-0)

Code Example

On Linux we additionally peek at /proc/<pid>/status and treat State: Z
as dead. On other POSIX or on Windows the zombie check is a no-op.
RAW_BUFFERClick to expand / collapse

Summary

_pid_alive() in hermes_cli/kanban_db.py only implements zombie detection on Linux (parsing /proc/<pid>/status for State: Z). On macOS, os.kill(pid, 0) returns success for defunct/zombie processes, so a worker that crashes immediately stays "alive" to the dispatcher until claim_expires times out (~15 min default).

Where

hermes_cli/kanban_db.py:2158-2173

The docstring at line 2136-2144 even admits this:

On Linux we additionally peek at /proc/<pid>/status and treat State: Z
as dead. On other POSIX or on Windows the zombie check is a no-op.

Reproduction

  1. Run the kanban dispatcher on macOS with a ~5 min cadence.
  2. Assign a worker a task that causes an immediate crash (e.g., require a skill it doesn't have, or a missing credential that triggers an unhandled exception at startup).
  3. os.kill(pid, 0) succeeds against the defunct process because the process table entry still exists.
  4. The dispatcher sees the worker as alive and does NOT re-queue the task until claim_expires (~15 min later).
  5. This creates a zombie-respawn loop where the dispatcher tries again every N minutes, gets the same crash, and the task stays stuck until manual SQL intervention.

Impact

Tasks stuck in running for up to 15 minutes on macOS, requiring manual sqlite3 surgery to break the loop. With a 5-minute dispatcher cadence and default claim_expires of 15 minutes, users see 3+ wasted spawn attempts per stuck task.

Suggested Fix

On Darwin, use proc_pidinfo(PROC_PIDTASKINFO) or kqueue with EVFILT_PROC to detect zombie state. A simpler fallback: check if the process group leader is still alive, or verify that proc_pidinfo's pti_status field is not 0.

Environment

  • macOS (any version)
  • Hermes Agent v0.11.0 (a7fb79efb)

extent analysis

TL;DR

Implement a zombie detection mechanism on macOS using proc_pidinfo or kqueue to prevent tasks from getting stuck in the running state.

Guidance

  • On macOS, use proc_pidinfo(PROC_PIDTASKINFO) to detect the zombie state of a process, as os.kill(pid, 0) returns success for defunct processes.
  • Alternatively, consider using kqueue with EVFILT_PROC for a more event-driven approach to process state monitoring.
  • As a simpler fallback, check if the process group leader is still alive to infer the zombie state.
  • Verify the effectiveness of the chosen approach by testing it with a worker that crashes immediately and checking if the task is re-queued as expected.

Example

import ctypes
libc = ctypes.CDLL(None)
proc_pidinfo = libc.proc_pidinfo
proc_pidinfo.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_void_p, ctypes.c_uint32]
proc_pidinfo.restype = ctypes.c_int

# Example usage:
pid = 12345  # Replace with the actual PID
info = ctypes.create_string_buffer(1024)
if proc_pidinfo(pid, 0, info, 1024) == 0:
    # Parse the proc_pidinfo output to check the pti_status field
    # If pti_status is 0, the process is a zombie
    pass

Notes

The provided example is a basic illustration and may require additional error handling and parsing of the proc_pidinfo output.

Recommendation

Apply a workaround using proc_pidinfo or kqueue to detect zombie processes on macOS, as upgrading to a fixed version is not mentioned in the issue. This approach allows for a targeted fix without relying on a specific version update.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING