hermes - ✅(Solved) Fix Gateway scoped locks treat zombie owner PIDs as live, blocking Telegram/Slack reconnect [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18822Fetched 2026-05-03 04:54:05
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×5cross-referenced ×1

Root Cause

Observed effect: a replacement gateway started only partially (webhook connected), while Telegram/Slack were refused because their scoped lock metadata pointed at an old PID that was already a zombie. Since zombie processes still have /proc/<pid> entries and os.kill(pid, 0) succeeds, the existing stale-lock check considered the lock live.

Fix Action

Fix / Workaround

I patched the /proc/<pid>/status state check in acquire_scoped_lock(...) to treat zombie/dead process states as stale, not just stopped states:

PR fix notes

PR #18832: fix(gateway): treat zombie PIDs as stale in scoped locks

Description (problem / solution / changelog)

Fix: Treat zombie PIDs as stale in gateway scoped locks

Problem

The gateway scoped-lock stale detection treats zombie gateway processes as still owning a live platform lock. When a gateway process becomes a zombie (killed but not yet reaped), os.kill(pid, 0) succeeds because /proc/<pid> still exists. This prevents a replacement gateway from reclaiming Telegram/Slack locks.

Root Cause

In gateway/status.py, acquire_scoped_lock() only checks for stopped/tracing-stop states ("T", "t") in /proc/<pid>/status. Zombie state ("Z") is not in the check tuple, so zombie processes pass the liveness check and keep their locks.

Same class of vulnerability exists in get_running_pid() which could report a zombie gateway as "running".

Fix

  1. acquire_scoped_lock (line 529): Added "Z" to the State check tuple: ("T", "t", "Z") — stopped, tracing stop, or zombie are all now treated as stale.

  2. get_running_pid (line 785+): Added explicit zombie check via /proc/<pid>/status State field. Zombies are skipped (treated as dead) rather than returned as a running gateway PID.

This matches the existing implementation in kanban_db.py:_is_pid_alive() which already handles zombie detection correctly.

Testing

Reproduced by creating a scoped-lock file owned by a zombie PID, then calling acquire_scoped_lock(). Before fix: lock is treated as live (returns False, existing). After fix: zombie PID is detected, lock is treated as stale.

Fixes #18822

Changed files

  • gateway/status.py (modified, +19/-3)

Code Example

- if _state in ("T", "t"):  # stopped or tracing stop
+ if _state in ("T", "t", "Z", "X", "x"):
      stale = True

---

# Check if process is not actually runnable. Stopped processes
# (Ctrl+Z / SIGTSTP) and zombies still respond to
# os.kill(pid, 0), but they cannot own a live gateway
# connection. Treat both states as stale so replacement
# gateways can reclaim Telegram/Slack scoped locks after an
# interrupted restart.

---

acquire_with_zombie_lock_ok= True
existing_seen= False

---

[Telegram] Connected to Telegram (polling mode)
✓ telegram connected
[Slack] Socket Mode connected
✓ slack connected
Gateway running with 3 platform(s)
RAW_BUFFERClick to expand / collapse

Bug Description

The gateway scoped-lock stale detection can treat a zombie gateway process as still owning a live platform lock. In a container/manual gateway setup, this prevented a new gateway instance from reconnecting Telegram and Slack after an interrupted/restarted gateway run.

Observed effect: a replacement gateway started only partially (webhook connected), while Telegram/Slack were refused because their scoped lock metadata pointed at an old PID that was already a zombie. Since zombie processes still have /proc/<pid> entries and os.kill(pid, 0) succeeds, the existing stale-lock check considered the lock live.

Environment

  • Hermes Agent v0.12.0 (2026.4.30)
  • Commit observed locally: f98b5d00a
  • Python: 3.11.15
  • Runtime: Linux container-style install, gateway run manually/nohup rather than systemd
  • Affected file/function: gateway/status.py, acquire_scoped_lock(...)

Steps to Reproduce

  1. Run the gateway with Telegram/Slack enabled.
  2. Interrupt/restart the gateway in a way that leaves the old gateway PID as a zombie/defunct process.
  3. Start a new gateway instance.
  4. The new instance sees the old scoped lock owner PID. Because os.kill(pid, 0) succeeds for zombies, the lock is treated as active instead of stale.

I also reproduced the lock behavior locally by writing a scoped-lock file owned by a real zombie PID, then calling acquire_scoped_lock(...).

Expected Behavior

Scoped locks owned by zombie/dead gateway processes should be treated as stale, allowing the replacement gateway to reclaim Telegram/Slack locks and reconnect all platforms.

Actual Behavior

The old zombie PID was treated as a live lock owner. Telegram/Slack stayed disconnected until the stale lock files were removed manually.

Local Fix Tested

I patched the /proc/<pid>/status state check in acquire_scoped_lock(...) to treat zombie/dead process states as stale, not just stopped states:

- if _state in ("T", "t"):  # stopped or tracing stop
+ if _state in ("T", "t", "Z", "X", "x"):
      stale = True

With an expanded comment:

# Check if process is not actually runnable. Stopped processes
# (Ctrl+Z / SIGTSTP) and zombies still respond to
# os.kill(pid, 0), but they cannot own a live gateway
# connection. Treat both states as stale so replacement
# gateways can reclaim Telegram/Slack scoped locks after an
# interrupted restart.

Verification

Using the Hermes venv Python, because the system Python did not have all Hermes deps:

  • py_compile passed for gateway/status.py
  • Simulated scoped lock owned by zombie PID was reclaimed successfully:
acquire_with_zombie_lock_ok= True
existing_seen= False
  • After a controlled gateway restart, all platforms connected again:
[Telegram] Connected to Telegram (polling mode)
✓ telegram connected
[Slack] Socket Mode connected
✓ slack connected
Gateway running with 3 platform(s)

Suggested Fix

Update acquire_scoped_lock(...) to treat non-runnable Linux /proc/<pid>/status states as stale, at minimum Z (zombie) in addition to the existing T/t stopped/traced states. Including X/x seems reasonable for dead processes if encountered.

extent analysis

TL;DR

Update the acquire_scoped_lock function to treat zombie processes as stale by checking for the "Z" state in the /proc/<pid>/status file.

Guidance

  • Modify the acquire_scoped_lock function to include a check for zombie processes by adding "Z" to the list of states that indicate a stale lock.
  • Verify the fix by testing the acquire_scoped_lock function with a simulated zombie PID and checking that the lock is reclaimed successfully.
  • Ensure that the updated function works correctly in different scenarios, such as when the gateway is restarted or interrupted.
  • Consider including additional states, such as "X" or "x", to handle dead processes if encountered.

Example

# Check if process is not actually runnable. Stopped processes
# (Ctrl+Z / SIGTSTP) and zombies still respond to
# os.kill(pid, 0), but they cannot own a live gateway
# connection. Treat both states as stale so replacement
# gateways can reclaim Telegram/Slack scoped locks after an
# interrupted restart.
if _state in ("T", "t", "Z", "X", "x"):
    stale = True

Notes

The suggested fix assumes that the issue is specific to Linux systems and may not apply to other operating systems. Additionally, the fix may need to be adapted to handle different types of zombie processes or edge cases.

Recommendation

Apply the workaround by updating the acquire_scoped_lock function to treat zombie processes as stale, as this fix has been tested and verified to work correctly in the provided scenario.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Gateway scoped locks treat zombie owner PIDs as live, blocking Telegram/Slack reconnect [1 pull requests, 1 participants]