hermes - ✅(Solved) Fix Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin) — 90s systemctl stop traceback [2 pull requests, 2 comments, 2 participants]

hermes2026-05-04 23:01:20

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#19937•Fetched 2026-05-05 06:04:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

teknium1

Participants

alt-glitch

teknium1

Timeline (top)

labeled ×5commented ×2cross-referenced ×2mentioned ×1

Debug report: https://paste.rs/zHNKJ (auto-deletes)
Stop path: hermes_cli/gateway.py::systemd_stop() → _run_systemctl(..., timeout=90)
Drain path: gateway/run.py::GatewayRunner.stop() → _drain_active_agents(timeout) → _safe_adapter_disconnect(...)
Related PRs: #14130 (bounded adapter disconnect), #11325 (extended TimeoutStopSec), #9128 (clean-shutdown marker), #19936 (planned-stop marker)

Error Message

hermes gateway stop

... 90s pause ...

Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds

Root Cause

GatewayRunner.stop() drains active agents up to agent.restart_drain_timeout (default 60s), then calls _safe_adapter_disconnect() per adapter. Feishu / Weixin / Lark websocket closes are not bounded — on a wedged socket they can consume the remaining budget by themselves, and combined with the drain this exceeds the systemd TimeoutStopSec (drain + 30s) and the CLI's systemctl stop timeout (hard-coded 90s in hermes_cli/gateway.py).

When it blows past 90s, _run_systemctl(["stop", ...], timeout=90) raises TimeoutExpired and the CLI dumps a traceback. Meanwhile systemd eventually SIGKILLs the cgroup at its own TimeoutStopSec, so the service does actually go away — but the user sees an error instead of a clean stop.

Fix Action

Fixed

Fixed by PR: fix(gateway): bound per-adapter disconnect timeout to prevent shutdown hang (https://github.com/NousResearch/hermes-agent/pull/19946)
Fixed by PR: fix(gateway): cap adapter disconnect during stop (https://github.com/NousResearch/hermes-agent/pull/19994)

PR fix notes

PR #19946: fix(gateway): bound per-adapter disconnect timeout to prevent shutdown hang

Repository: NousResearch/hermes-agent
Author: liuhao1024
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/19946

Description (problem / solution / changelog)

Summary

Bound per-adapter disconnect with a 5-second timeout to prevent wedged websockets (Feishu/Lark/Weixin on WSL with flaky networking) from blocking the entire gateway shutdown sequence.

Root cause

GatewayRunner.stop() iterates over all adapters and calls await adapter.disconnect() without any timeout. On WSL with flaky networking, Feishu/Lark/Weixin websocket closes can hang indefinitely on ConnectionResetError or DNS failures, consuming the entire drain budget and exceeding both systemd's TimeoutStopSec and the CLI's 90s _run_systemctl timeout — resulting in a raw Python traceback from systemctl --user stop.

Fix

Per-adapter disconnect timeout (5s default): Both _safe_adapter_disconnect() and the stop() disconnect loop now wrap adapter.disconnect() with asyncio.wait_for(timeout=5). A wedged adapter is logged and abandoned rather than blocking the rest of shutdown.
Friendly CLI message: systemd_stop() catches subprocess.TimeoutExpired and prints a helpful message with hermes gateway status instead of a raw traceback.
Regression tests: Two new tests verify the timeout behavior — one for _safe_adapter_disconnect() and one for the stop() disconnect loop with a wedged + normal adapter pair.

Testing

All 5 tests in tests/gateway/test_safe_adapter_disconnect.py pass (including 2 new)
All 9 tests in tests/gateway/test_gateway_shutdown.py pass (no regressions)
3 changed files: gateway/run.py, hermes_cli/gateway.py, tests/gateway/test_safe_adapter_disconnect.py

Fixes Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin) — 90s systemctl stop traceback

Changed files

gateway/run.py (modified, +25/-3)
hermes_cli/gateway.py (modified, +11/-1)
tests/gateway/test_safe_adapter_disconnect.py (modified, +69/-1)

PR #19994: fix(gateway): cap adapter disconnect during stop

Repository: NousResearch/hermes-agent
Author: LeonSGP43
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/19994

Description (problem / solution / changelog)

Summary

cap each gateway adapter disconnect() call during shutdown so one wedged websocket cannot consume the whole stop budget
log and continue when an adapter disconnect exceeds the cap
catch systemctl stop timeout in hermes gateway stop and print status/log guidance instead of surfacing a raw TimeoutExpired traceback

Closes #19937.

Verification

scripts/run_tests.sh tests/gateway/test_safe_adapter_disconnect.py tests/hermes_cli/test_gateway_service.py::TestSystemdServiceRefresh::test_systemd_stop_timeout_prints_status_guidance -> 5 passed
git diff --check -> passed

Local note

Running the full tests/hermes_cli/test_gateway_service.py file on this macOS environment still hits existing user-systemd preflight failures in unrelated start/restart tests. The new stop-timeout test and disconnect tests pass in isolation.

Scope

This does not alter drain timing or adapter-specific websocket code. It bounds the shared defensive disconnect path and makes the CLI stop timeout user-facing.

Changed files

gateway/run.py (modified, +27/-1)
hermes_cli/gateway.py (modified, +9/-1)
tests/gateway/test_safe_adapter_disconnect.py (modified, +20/-0)
tests/hermes_cli/test_gateway_service.py (modified, +24/-0)

Code Example

hermes gateway stop
# ... 90s pause ...
# Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds

RAW_BUFFERClick to expand / collapse

Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin)

Follow-up to #19876 / #19936 (planned-stop marker).

Symptom

On WSL with flaky networking, hermes gateway stop times out at 90s and prints a Python traceback from systemctl --user stop — even on the first invocation after a clean boot:

hermes gateway stop
# ... 90s pause ...
# Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds

Originally reported by @lulu in the Discord bug channel; debug report includes repeated ConnectionResetError(104, 'Connection reset by peer') on lark_oapi websocket closes during shutdown, plus DNS failures (Failed to resolve 'open.feishu.cn'). The planned-stop marker in #19936 handles the service-manager-revives-us case, but does not shorten drain — so this traceback can still happen on slow drains.

Root cause

What to do

Bound each adapter disconnect with a short per-adapter timeout. _safe_adapter_disconnect() already exists (#14130 bounded it somewhat) — audit the Feishu / Lark / Weixin / iLink paths specifically and cap their socket-close at ~3–5s each. Any adapter that can't close cleanly in that window should be abandoned (log and move on), not block shutdown.
Catch TimeoutExpired in systemd_stop() and surface a friendly message instead of a raw traceback. The marker from #19936 means the process IS exiting cleanly once it finishes drain; the 90s CLI timeout just means "the gateway is still working on it, check hermes gateway status."
Optionally: shorten the default agent.restart_drain_timeout for the hermes gateway stop path specifically. Stop-for-real wants a shorter budget than restart-drain (where we want to let sessions finish gracefully).

Context

Debug report: https://paste.rs/zHNKJ (auto-deletes)
Stop path: hermes_cli/gateway.py::systemd_stop() → _run_systemctl(..., timeout=90)
Drain path: gateway/run.py::GatewayRunner.stop() → _drain_active_agents(timeout) → _safe_adapter_disconnect(...)
Related PRs: #14130 (bounded adapter disconnect), #11325 (extended TimeoutStopSec), #9128 (clean-shutdown marker), #19936 (planned-stop marker)

extent analysis

TL;DR

Bound each adapter disconnect with a short per-adapter timeout to prevent TimeoutExpired errors during gateway shutdown.

Guidance

Audit the Feishu, Lark, Weixin, and iLink paths in _safe_adapter_disconnect() and cap their socket-close at 3-5s each to prevent shutdown delays.
Catch TimeoutExpired in systemd_stop() and surface a friendly message instead of a raw traceback to improve user experience.
Consider shortening the default agent.restart_drain_timeout for the hermes gateway stop path to reduce shutdown time.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a modification to the existing _safe_adapter_disconnect() function.

Notes

The provided solution focuses on bounding adapter disconnect timeouts and handling TimeoutExpired errors. Additional modifications, such as shortening the default agent.restart_drain_timeout, may be considered to further improve shutdown performance.

Recommendation

Apply the workaround by bounding each adapter disconnect with a short per-adapter timeout, as this directly addresses the root cause of the TimeoutExpired errors during gateway shutdown.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin) — 90s systemctl stop traceback [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

... 90s pause ...

Traceback ending in subprocess.TimeoutExpired: Command '['systemctl', '--user', 'stop', 'hermes-gateway.service']' timed out after 90 seconds

Root Cause

Fix Action

Fixed

PR fix notes

PR #19946: fix(gateway): bound per-adapter disconnect timeout to prevent shutdown hang

Description (problem / solution / changelog)

Summary

Root cause

Fix

Testing

Changed files

PR #19994: fix(gateway): cap adapter disconnect during stop

Description (problem / solution / changelog)

Summary

Verification

Local note

Scope

Changed files

Code Example

Gateway drain hangs on wedged adapter websockets (WSL + Feishu/Weixin)

Symptom

Root cause

What to do

Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING