hermes - ✅(Solved) Fix QQBot WebSocket _open_ws() hangs indefinitely on stale CLOSE-WAIT connections, freezing entire gateway [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18221Fetched 2026-05-02 05:49:56
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×2referenced ×1

The QQBot adapter's _open_ws() method can permanently block when cleaning up a stale WebSocket connection in CLOSE-WAIT state. This freezes the entire gateway process — it remains alive but stops processing events, writing logs, or responding to health checks.

This is distinct from but related to #17703, #14539, and #15490.

Environment

hermes-agent version: v0.11.0 (commit 454d883e)

Python: 3.12

OS: Ubuntu 22.04 (WSL2)

aiohttp: 3.11.x

Symptoms

Process state: S (sleeping) (not dead, not spinning)

TCP connections stuck in CLOSE-WAIT:

CLOSE-WAIT 25 0 172.24.40.238:35638 101.91.34.226:443 CLOSE-WAIT 32 0 172.24.40.238:47022 101.91.19.174:443

Logs stop updating entirely

Occurs reliably after ~10 minutes of stable operation

Process does NOT exit; must be killed manually

Root Cause

In _open_ws() (gateway/platforms/qqbot/adapter.py: 387-393):

async def _open_ws(self, gateway_url: str) -> None: if self._ws and not self._ws.closed: await self._ws.close() # ← BLOCKS HERE self._ws = None if self._session and not self._session.closed: await self._session.close() # ← ALSO BLOCKS self._session = None

When the QQ server sends a FIN (normal connection teardown), the old WebSocket enters CLOSE-WAIT state. At this point:

self._ws.closed returns False (close handshake incomplete )

await self._ws.close() attempts a graceful close

But _read_events() has already exited, so no one consumes the close frame from the read buffer

close() waits indefinitely for the close handshake to complete

The asyncio event loop is blocked; gateway freezes

Reproduction

Start hermes gateway run with QQBot enabled

Wait for QQ server to send a connection-level FIN (~10 minutes in our environment)

Observe process enters S state with CLOSE-WAIT connections

Suggested Fix

Option A: Add timeout to close operations

async def _open_ws(self, gateway_url: str) -> None: if self._ws: try: await asyncio.wait_for(self._ws.close(), timeout=5.0) except asyncio.TimeoutError: pass # Force abandon self._ws = None if self._session: try: await asyncio.wait_for(self._session.close(), timeout=5.0) except asyncio.TimeoutError: pass self._session = None

Option B: Force close without handshake for dead connections

if self._ws: self._ws._closing = True self._ws = None

Related Issues

#17703 — QQBot stops reconnecting after failed reconnect leaves websocket closed

#14539 — QQ Bot adapter silently stops reconnecting without notifying gateway

#15490 — qqbot adapter silently dies on network outage during reconnect

PR #18172 — fix(qqbot): add gateway URL cache, retry, and rate-limit handling (addresses reconnect storms but not this cleanup hang)

Impact

In our production environment, this caused 50+ freeze-restart cycles within 10 hours. The issue was mitigated by an external watchdog that kills and restarts the gateway process. This issue was created with the assistance of AI analysis. The root cause, diagnostic steps, and suggested fixes were identified through automated log analysis and code review.

I was troubled by repeated restarts, so I had my agent write an external monitoring script to mitigate and help diagnose this issue

Root Cause

Root Cause

Fix Action

Fixed

PR fix notes

PR #18237: fix(qqbot): bound stale websocket cleanup

Description (problem / solution / changelog)

Summary

  • Bound QQBot WebSocket/session close operations with a timeout before reconnecting or disconnecting
  • Clear stale WebSocket resource references before awaiting graceful close so a hung close cannot freeze the adapter
  • Add a regression test for a WebSocket whose close() never completes

Root cause

QQAdapter._open_ws() and _cleanup() awaited self._ws.close() / self._session.close() directly. When a stale QQ WebSocket sits in an incomplete close handshake, graceful close can hang indefinitely and block the reconnect path.

Fix

Centralize WebSocket resource cleanup in _close_ws_resources() and wrap graceful close() calls with asyncio.wait_for(..., timeout=WS_CLOSE_TIMEOUT_SECONDS). If close times out, the adapter logs and abandons the stale resource instead of blocking the gateway.

Regression coverage

  • tests/gateway/test_qqbot.py::TestQQWebSocketCleanup::test_cleanup_abandons_hung_websocket_close

Testing

  • scripts/run_tests.sh tests/gateway/test_qqbot.py::TestQQWebSocketCleanup::test_cleanup_abandons_hung_websocket_close -q --tb=short
  • scripts/run_tests.sh tests/gateway/test_qqbot.py -q --tb=short

Closes #18221

Changed files

  • gateway/platforms/qqbot/adapter.py (modified, +26/-12)
  • gateway/platforms/qqbot/constants.py (modified, +1/-0)
  • tests/gateway/test_qqbot.py (modified, +28/-0)

PR #17246: fix: resolve 7 identified issues [automated]

Description (problem / solution / changelog)

Summary

This automated PR resolves 7 identified open issues with focus on bugs, cross-platform reliability, and operational hardening.

Fixed issues

  1. #18594get_hermes_home() now fails fast in profile-scoped subprocesses when HERMES_HOME is missing (prevents silent cross-profile writes).
  2. #18588 — context compression now retries on the main model when summary_model_override is unset and summary model path fails.
  3. #18586delegate_task now passes target_model into runtime provider resolution, fixing wrong api_mode/base_url for providers like opencode-go.
  4. #18187 — Discord adapter now closes any existing bot client before reconnecting, preventing duplicate websocket consumers and double responses.
  5. #18221 — QQBot _open_ws() now bounds stale websocket/session close operations with timeouts to avoid reconnect hangs.
  6. #18437 — Weixin direct send now avoids reusing live adapters across event loops; falls back safely to one-shot adapter/session path.
  7. #18485 — Slack channel directory now warns once per team and downgrades repeated failures to debug, reducing recurring gateway log noise.

Files changed

  • hermes_constants.py
  • agent/context_compressor.py
  • tools/delegate_tool.py
  • gateway/platforms/discord.py
  • gateway/platforms/qqbot/adapter.py
  • gateway/platforms/weixin.py
  • gateway/channel_directory.py

Notes

  • Commits were created with descriptive English messages.
  • Push was performed at the end after all fixes were committed.

Changed files

  • Dockerfile (modified, +2/-1)
  • acp_adapter/session.py (modified, +12/-0)
  • agent/auxiliary_client.py (modified, +280/-28)
  • agent/context_compressor.py (modified, +496/-52)
  • agent/title_generator.py (modified, +2/-2)
  • agent/transports/chat_completions.py (modified, +14/-0)
  • agent/usage_pricing.py (modified, +4/-0)
  • cli-config.yaml.example (modified, +5/-0)
  • cli.py (modified, +27/-3)
  • cron/scheduler.py (modified, +8/-2)
  • docker/entrypoint.sh (modified, +5/-1)
  • gateway/channel_directory.py (modified, +14/-4)
  • gateway/platforms/discord.py (modified, +33/-7)
  • gateway/platforms/email.py (modified, +12/-2)
  • gateway/platforms/feishu.py (modified, +34/-1)
  • gateway/platforms/qqbot/adapter.py (modified, +8/-2)
  • gateway/platforms/telegram_network.py (modified, +7/-2)
  • gateway/platforms/weixin.py (modified, +10/-1)
  • gateway/run.py (modified, +99/-32)
  • gateway/status.py (modified, +8/-1)
  • hermes_cli/auth.py (modified, +1/-1)
  • hermes_cli/commands.py (modified, +1/-1)
  • hermes_cli/config.py (modified, +271/-40)
  • hermes_cli/copilot_auth.py (modified, +1/-1)
  • hermes_cli/gateway.py (modified, +16/-13)
  • hermes_cli/main.py (modified, +69/-3)
  • hermes_cli/memory_setup.py (modified, +1/-1)
  • hermes_cli/model_switch.py (modified, +6/-1)
  • hermes_cli/models.py (modified, +59/-1)
  • hermes_cli/profiles.py (modified, +16/-3)
  • hermes_cli/runtime_provider.py (modified, +16/-13)
  • hermes_cli/setup.py (modified, +8/-2)
  • hermes_cli/slack_cli.py (modified, +1/-2)
  • hermes_cli/status.py (modified, +17/-2)
  • hermes_cli/web_server.py (modified, +1/-1)
  • hermes_constants.py (modified, +16/-3)
  • model_tools.py (modified, +44/-13)
  • run_agent.py (modified, +389/-82)
  • setup-hermes.sh (modified, +23/-12)
  • skills/red-teaming/godmode/scripts/load_godmode.py (modified, +9/-8)
  • tests/agent/test_context_compressor.py (modified, +389/-0)
  • tests/gateway/test_compress_command.py (modified, +49/-0)
  • tests/run_agent/test_413_compression.py (modified, +81/-1)
  • tests/run_agent/test_compression_boundary_hook.py (modified, +42/-0)
  • tests/run_agent/test_run_agent.py (modified, +100/-13)
  • tests/tools/test_skill_manager_tool.py (modified, +270/-0)
  • tools/approval.py (modified, +1/-1)
  • tools/delegate_tool.py (modified, +4/-1)
  • tools/environments/docker.py (modified, +36/-5)
  • tools/environments/local.py (modified, +7/-1)
  • tools/file_operations.py (modified, +70/-67)
  • tools/file_tools.py (modified, +4/-1)
  • tools/send_message_tool.py (modified, +66/-2)
  • tools/session_search_tool.py (modified, +2/-2)
  • tools/skill_manager_tool.py (modified, +82/-21)
  • tools/skills_tool.py (modified, +13/-1)
  • tools/terminal_tool.py (modified, +6/-0)
  • tools/tool_backend_helpers.py (modified, +15/-5)
  • tools/tts_tool.py (modified, +27/-16)
  • tools/voice_mode.py (modified, +23/-10)
  • tui_gateway/server.py (modified, +5/-3)
  • ui-tui/src/app/turnController.ts (modified, +1/-1)
  • ui-tui/src/app/useInputHandlers.ts (modified, +8/-3)
  • ui-tui/src/app/useSessionLifecycle.ts (modified, +1/-1)
  • ui-tui/src/gatewayTypes.ts (modified, +1/-0)
  • utils.py (modified, +9/-0)
  • uv.lock (modified, +161/-2)
RAW_BUFFERClick to expand / collapse

Description

The QQBot adapter's _open_ws() method can permanently block when cleaning up a stale WebSocket connection in CLOSE-WAIT state. This freezes the entire gateway process — it remains alive but stops processing events, writing logs, or responding to health checks.

This is distinct from but related to #17703, #14539, and #15490.

Environment

hermes-agent version: v0.11.0 (commit 454d883e)

Python: 3.12

OS: Ubuntu 22.04 (WSL2)

aiohttp: 3.11.x

Symptoms

Process state: S (sleeping) (not dead, not spinning)

TCP connections stuck in CLOSE-WAIT:

CLOSE-WAIT 25 0 172.24.40.238:35638 101.91.34.226:443 CLOSE-WAIT 32 0 172.24.40.238:47022 101.91.19.174:443

Logs stop updating entirely

Occurs reliably after ~10 minutes of stable operation

Process does NOT exit; must be killed manually

Root Cause

In _open_ws() (gateway/platforms/qqbot/adapter.py: 387-393):

async def _open_ws(self, gateway_url: str) -> None: if self._ws and not self._ws.closed: await self._ws.close() # ← BLOCKS HERE self._ws = None if self._session and not self._session.closed: await self._session.close() # ← ALSO BLOCKS self._session = None

When the QQ server sends a FIN (normal connection teardown), the old WebSocket enters CLOSE-WAIT state. At this point:

self._ws.closed returns False (close handshake incomplete )

await self._ws.close() attempts a graceful close

But _read_events() has already exited, so no one consumes the close frame from the read buffer

close() waits indefinitely for the close handshake to complete

The asyncio event loop is blocked; gateway freezes

Reproduction

Start hermes gateway run with QQBot enabled

Wait for QQ server to send a connection-level FIN (~10 minutes in our environment)

Observe process enters S state with CLOSE-WAIT connections

Suggested Fix

Option A: Add timeout to close operations

async def _open_ws(self, gateway_url: str) -> None: if self._ws: try: await asyncio.wait_for(self._ws.close(), timeout=5.0) except asyncio.TimeoutError: pass # Force abandon self._ws = None if self._session: try: await asyncio.wait_for(self._session.close(), timeout=5.0) except asyncio.TimeoutError: pass self._session = None

Option B: Force close without handshake for dead connections

if self._ws: self._ws._closing = True self._ws = None

Related Issues

#17703 — QQBot stops reconnecting after failed reconnect leaves websocket closed

#14539 — QQ Bot adapter silently stops reconnecting without notifying gateway

#15490 — qqbot adapter silently dies on network outage during reconnect

PR #18172 — fix(qqbot): add gateway URL cache, retry, and rate-limit handling (addresses reconnect storms but not this cleanup hang)

Impact

In our production environment, this caused 50+ freeze-restart cycles within 10 hours. The issue was mitigated by an external watchdog that kills and restarts the gateway process. This issue was created with the assistance of AI analysis. The root cause, diagnostic steps, and suggested fixes were identified through automated log analysis and code review.

I was troubled by repeated restarts, so I had my agent write an external monitoring script to mitigate and help diagnose this issue

extent analysis

TL;DR

Implement a timeout for the WebSocket close operation to prevent the gateway process from freezing.

Guidance

  • Apply a timeout to the close() method in the _open_ws() function to prevent indefinite blocking, as shown in the suggested fix Option A.
  • Consider forcing a close without handshake for dead connections, as shown in Option B, but be aware that this might have implications for the connection state.
  • Verify that the fix works by checking the process state and logs after applying the changes.
  • Monitor the gateway process for any further issues related to WebSocket connections and closing handshakes.

Example

async def _open_ws(self, gateway_url: str) -> None:
    if self._ws:
        try:
            await asyncio.wait_for(self._ws.close(), timeout=5.0)
        except asyncio.TimeoutError:
            pass  # Force abandon
        self._ws = None

Notes

The suggested fixes assume that the issue is caused by the WebSocket close operation blocking indefinitely. However, the root cause might be more complex, and additional debugging might be necessary.

Recommendation

Apply workaround: Implement a timeout for the WebSocket close operation, as shown in Option A, to prevent the gateway process from freezing. This approach is more conservative and less likely to introduce new issues compared to forcing a close without handshake.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix QQBot WebSocket _open_ws() hangs indefinitely on stale CLOSE-WAIT connections, freezing entire gateway [2 pull requests, 1 participants]