hermes - ✅(Solved) Fix Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin) — 90s systemctl stop traceback [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#19937Fetched 2026-05-05 06:04:14
View on GitHub
Comments
2
Participants
2
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
labeled ×5commented ×2cross-referenced ×2mentioned ×1
  • Debug report: https://paste.rs/zHNKJ (auto-deletes)
  • Stop path: hermes_cli/gateway.py::systemd_stop()_run_systemctl(..., timeout=90)
  • Drain path: gateway/run.py::GatewayRunner.stop()_drain_active_agents(timeout)_safe_adapter_disconnect(...)
  • Related PRs: #14130 (bounded adapter disconnect), #11325 (extended TimeoutStopSec), #9128 (clean-shutdown marker), #19936 (planned-stop marker)

Error Message

hermes gateway stop

... 90s pause ...

Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds

Root Cause

GatewayRunner.stop() drains active agents up to agent.restart_drain_timeout (default 60s), then calls _safe_adapter_disconnect() per adapter. Feishu / Weixin / Lark websocket closes are not bounded — on a wedged socket they can consume the remaining budget by themselves, and combined with the drain this exceeds the systemd TimeoutStopSec (drain + 30s) and the CLI's systemctl stop timeout (hard-coded 90s in hermes_cli/gateway.py).

When it blows past 90s, _run_systemctl(["stop", ...], timeout=90) raises TimeoutExpired and the CLI dumps a traceback. Meanwhile systemd eventually SIGKILLs the cgroup at its own TimeoutStopSec, so the service does actually go away — but the user sees an error instead of a clean stop.

Fix Action

Fixed

PR fix notes

PR #19946: fix(gateway): bound per-adapter disconnect timeout to prevent shutdown hang

Description (problem / solution / changelog)

Summary

Bound per-adapter disconnect with a 5-second timeout to prevent wedged websockets (Feishu/Lark/Weixin on WSL with flaky networking) from blocking the entire gateway shutdown sequence.

Root cause

GatewayRunner.stop() iterates over all adapters and calls await adapter.disconnect() without any timeout. On WSL with flaky networking, Feishu/Lark/Weixin websocket closes can hang indefinitely on ConnectionResetError or DNS failures, consuming the entire drain budget and exceeding both systemd's TimeoutStopSec and the CLI's 90s _run_systemctl timeout — resulting in a raw Python traceback from systemctl --user stop.

Fix

  1. Per-adapter disconnect timeout (5s default): Both _safe_adapter_disconnect() and the stop() disconnect loop now wrap adapter.disconnect() with asyncio.wait_for(timeout=5). A wedged adapter is logged and abandoned rather than blocking the rest of shutdown.

  2. Friendly CLI message: systemd_stop() catches subprocess.TimeoutExpired and prints a helpful message with hermes gateway status instead of a raw traceback.

  3. Regression tests: Two new tests verify the timeout behavior — one for _safe_adapter_disconnect() and one for the stop() disconnect loop with a wedged + normal adapter pair.

Testing

  • All 5 tests in tests/gateway/test_safe_adapter_disconnect.py pass (including 2 new)
  • All 9 tests in tests/gateway/test_gateway_shutdown.py pass (no regressions)
  • 3 changed files: gateway/run.py, hermes_cli/gateway.py, tests/gateway/test_safe_adapter_disconnect.py

Fixes Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin) — 90s systemctl stop traceback

Changed files

  • gateway/run.py (modified, +25/-3)
  • hermes_cli/gateway.py (modified, +11/-1)
  • tests/gateway/test_safe_adapter_disconnect.py (modified, +69/-1)

PR #19994: fix(gateway): cap adapter disconnect during stop

Description (problem / solution / changelog)

Summary

  • cap each gateway adapter disconnect() call during shutdown so one wedged websocket cannot consume the whole stop budget
  • log and continue when an adapter disconnect exceeds the cap
  • catch systemctl stop timeout in hermes gateway stop and print status/log guidance instead of surfacing a raw TimeoutExpired traceback

Closes #19937.

Verification

  • scripts/run_tests.sh tests/gateway/test_safe_adapter_disconnect.py tests/hermes_cli/test_gateway_service.py::TestSystemdServiceRefresh::test_systemd_stop_timeout_prints_status_guidance -> 5 passed
  • git diff --check -> passed

Local note

Running the full tests/hermes_cli/test_gateway_service.py file on this macOS environment still hits existing user-systemd preflight failures in unrelated start/restart tests. The new stop-timeout test and disconnect tests pass in isolation.

Scope

This does not alter drain timing or adapter-specific websocket code. It bounds the shared defensive disconnect path and makes the CLI stop timeout user-facing.

Changed files

  • gateway/run.py (modified, +27/-1)
  • hermes_cli/gateway.py (modified, +9/-1)
  • tests/gateway/test_safe_adapter_disconnect.py (modified, +20/-0)
  • tests/hermes_cli/test_gateway_service.py (modified, +24/-0)

Code Example

hermes gateway stop
# ... 90s pause ...
# Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds
RAW_BUFFERClick to expand / collapse

Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin)

Follow-up to #19876 / #19936 (planned-stop marker).

Symptom

On WSL with flaky networking, hermes gateway stop times out at 90s and prints a Python traceback from systemctl --user stop — even on the first invocation after a clean boot:

hermes gateway stop
# ... 90s pause ...
# Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds

Originally reported by @lulu in the Discord bug channel; debug report includes repeated ConnectionResetError(104, 'Connection reset by peer') on lark_oapi websocket closes during shutdown, plus DNS failures (Failed to resolve 'open.feishu.cn'). The planned-stop marker in #19936 handles the service-manager-revives-us case, but does not shorten drain — so this traceback can still happen on slow drains.

Root cause

GatewayRunner.stop() drains active agents up to agent.restart_drain_timeout (default 60s), then calls _safe_adapter_disconnect() per adapter. Feishu / Weixin / Lark websocket closes are not bounded — on a wedged socket they can consume the remaining budget by themselves, and combined with the drain this exceeds the systemd TimeoutStopSec (drain + 30s) and the CLI's systemctl stop timeout (hard-coded 90s in hermes_cli/gateway.py).

When it blows past 90s, _run_systemctl(["stop", ...], timeout=90) raises TimeoutExpired and the CLI dumps a traceback. Meanwhile systemd eventually SIGKILLs the cgroup at its own TimeoutStopSec, so the service does actually go away — but the user sees an error instead of a clean stop.

What to do

  1. Bound each adapter disconnect with a short per-adapter timeout. _safe_adapter_disconnect() already exists (#14130 bounded it somewhat) — audit the Feishu / Lark / Weixin / iLink paths specifically and cap their socket-close at ~3–5s each. Any adapter that can't close cleanly in that window should be abandoned (log and move on), not block shutdown.
  2. Catch TimeoutExpired in systemd_stop() and surface a friendly message instead of a raw traceback. The marker from #19936 means the process IS exiting cleanly once it finishes drain; the 90s CLI timeout just means "the gateway is still working on it, check hermes gateway status."
  3. Optionally: shorten the default agent.restart_drain_timeout for the hermes gateway stop path specifically. Stop-for-real wants a shorter budget than restart-drain (where we want to let sessions finish gracefully).

Context

  • Debug report: https://paste.rs/zHNKJ (auto-deletes)
  • Stop path: hermes_cli/gateway.py::systemd_stop()_run_systemctl(..., timeout=90)
  • Drain path: gateway/run.py::GatewayRunner.stop()_drain_active_agents(timeout)_safe_adapter_disconnect(...)
  • Related PRs: #14130 (bounded adapter disconnect), #11325 (extended TimeoutStopSec), #9128 (clean-shutdown marker), #19936 (planned-stop marker)

extent analysis

TL;DR

Bound each adapter disconnect with a short per-adapter timeout to prevent TimeoutExpired errors during gateway shutdown.

Guidance

  • Audit the Feishu, Lark, Weixin, and iLink paths in _safe_adapter_disconnect() and cap their socket-close at 3-5s each to prevent shutdown delays.
  • Catch TimeoutExpired in systemd_stop() and surface a friendly message instead of a raw traceback to improve user experience.
  • Consider shortening the default agent.restart_drain_timeout for the hermes gateway stop path to reduce shutdown time.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a modification to the existing _safe_adapter_disconnect() function.

Notes

The provided solution focuses on bounding adapter disconnect timeouts and handling TimeoutExpired errors. Additional modifications, such as shortening the default agent.restart_drain_timeout, may be considered to further improve shutdown performance.

Recommendation

Apply the workaround by bounding each adapter disconnect with a short per-adapter timeout, as this directly addresses the root cause of the TimeoutExpired errors during gateway shutdown.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin) — 90s systemctl stop traceback [2 pull requests, 2 comments, 2 participants]