hermes - 💡(How to fix) Fix [Bug]: Kanban workers stuck in zombie state after SIGTERM — claim never released, task blocked forever

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a Kanban worker process receives SIGTERM (from gateway restart, launchd/systemd cgroup cleanup, enforce_max_runtime, or _terminate_reclaimed_worker), the single-query signal handler (_signal_handler_q in cli.py) calls _agent.interrupt() and raises KeyboardInterrupt — but the Python process does not exit cleanly. It remains in the process table as a zombie (<defunct> on macOS).

The dispatchers detect_crashed_workers / release_stale_claims check os.kill(pid, 0) which returns True for zombie processes (they still have a PID table entry). The dispatcher thinks the worker is still alive and keeps extending the claim forever. The task remains running indefinitely and never gets re-dispatched.

Root Cause

In cli.py lines 14144–14158, _signal_handler_q is registered for SIGTERM and SIGHUP in single-query mode (chat -q). When a Kanban worker receives the signal:

  1. _signal_handler_q calls _agent.interrupt(...) and sleeps for the grace window
  2. Raises KeyboardInterrupt
  3. The agent loop dies but the process stays alive as a zombie
  4. _pid_alive() uses os.kill(pid, 0) which returns True even for zombies
  5. The dispatcher extends the claim forever — task stuck permanently

Fix Action

Workaround

hermes kanban block <task_id> "Worker was interrupted — manual recovery"

Code Example

hermes kanban block <task_id> "Worker was interrupted — manual recovery"
RAW_BUFFERClick to expand / collapse

Summary

When a Kanban worker process receives SIGTERM (from gateway restart, launchd/systemd cgroup cleanup, enforce_max_runtime, or _terminate_reclaimed_worker), the single-query signal handler (_signal_handler_q in cli.py) calls _agent.interrupt() and raises KeyboardInterrupt — but the Python process does not exit cleanly. It remains in the process table as a zombie (<defunct> on macOS).

The dispatchers detect_crashed_workers / release_stale_claims check os.kill(pid, 0) which returns True for zombie processes (they still have a PID table entry). The dispatcher thinks the worker is still alive and keeps extending the claim forever. The task remains running indefinitely and never gets re-dispatched.

Root Cause

In cli.py lines 14144–14158, _signal_handler_q is registered for SIGTERM and SIGHUP in single-query mode (chat -q). When a Kanban worker receives the signal:

  1. _signal_handler_q calls _agent.interrupt(...) and sleeps for the grace window
  2. Raises KeyboardInterrupt
  3. The agent loop dies but the process stays alive as a zombie
  4. _pid_alive() uses os.kill(pid, 0) which returns True even for zombies
  5. The dispatcher extends the claim forever — task stuck permanently

Impact

  • Kanban tasks stuck in running state forever
  • Downstream dependent tasks never execute
  • Manual recovery required
  • Affects all gateway-managed kanban setups, especially macOS with launchd

Steps to Reproduce

  1. Set up kanban with dispatch_in_gateway: true
  2. Run hermes gateway restart while a kanban worker is running
  3. Observe: the worker process becomes <defunct>, task stays running forever

Proposed Fix

The signal handler should check for HERMES_KANBAN_TASK env var and call block_task() to release the claim before dying. Fix being tested locally.

Workaround

hermes kanban block <task_id> "Worker was interrupted — manual recovery"

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Kanban workers stuck in zombie state after SIGTERM — claim never released, task blocked forever