hermes - ✅(Solved) Fix qqbot adapter silently dies on network outage during reconnect; gateway has no task watchdog [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15490Fetched 2026-04-26 05:27:06
View on GitHub
Comments
2
Participants
3
Timeline
7
Reactions
0
Timeline (top)
labeled ×4commented ×2cross-referenced ×1

When the host's network briefly goes down, the QQ bot platform adapter silently dies during a reconnect attempt. The Gateway parent task does not detect the failure or restart the adapter, so QQ stays offline indefinitely until the container is manually restarted. Telegram, in the same container subjected to the same network event, recovers automatically.

Error Message

2026-04-24 23:50:19 WARNING [QQBot:xxx] WebSocket error: WebSocket closed 2026-04-24 23:50:21 INFO [QQBot:xxx] Reconnected 2026-04-24 23:50:21 INFO [QQBot:xxx] Session resumed 2026-04-24 23:51:21 WARNING [QQBot:xxx] WebSocket error: WebSocket closed # exact 60s cycle 2026-04-24 23:51:24 INFO [QQBot:xxx] Reconnected ... [3 more cycles, all succeeding] ... 2026-04-24 23:54:31 INFO [QQBot:xxx] Session resumed (seq=232) # last qqbot log # ~1h 6m of zero qqbot activity at all 2026-04-25 01:00:30 WARNING [Telegram] network error, scheduling reconnect: httpx.ConnectError ... [Telegram retry loop runs to completion and recovers] ... 2026-04-25 01:57:29 INFO [Telegram] Connected to Telegram (polling mode) # qqbot still silent — no reconnect attempted

Root Cause

When the host's network briefly goes down, the QQ bot platform adapter silently dies during a reconnect attempt. The Gateway parent task does not detect the failure or restart the adapter, so QQ stays offline indefinitely until the container is manually restarted. Telegram, in the same container subjected to the same network event, recovers automatically.

Fix Action

Fixed

PR fix notes

PR #15510: fix: notify gateway when QQ bot adapter exhausts reconnect attempts

Description (problem / solution / changelog)

Fixes #15490

Problem

When the host network goes down, the QQ bot adapter silently dies during WebSocket reconnect. The gateway never learns the adapter failed and cannot restart it.

Root Cause

_listen_loop has 5 exit paths that return without calling _notify_fatal_error(). Compare with Telegram adapter which correctly calls both _set_fatal_error() + _notify_fatal_error() on all exit paths.

Fix

All 5 exit paths now call _set_fatal_error() + _notify_fatal_error() before returning, matching the Telegram adapter pattern.

Files Changed

  • gateway/platforms/qqbot/adapter.py — 20 lines added

Changed files

  • gateway/platforms/qqbot/adapter.py (modified, +20/-0)
  • plugins/memory/hindsight/__init__.py (modified, +7/-0)

Code Example

2026-04-24 23:50:19 WARNING [QQBot:xxx] WebSocket error: WebSocket closed
2026-04-24 23:50:21 INFO    [QQBot:xxx] Reconnected
2026-04-24 23:50:21 INFO    [QQBot:xxx] Session resumed
2026-04-24 23:51:21 WARNING [QQBot:xxx] WebSocket error: WebSocket closed   # exact 60s cycle
2026-04-24 23:51:24 INFO    [QQBot:xxx] Reconnected
... [3 more cycles, all succeeding] ...
2026-04-24 23:54:31 INFO    [QQBot:xxx] Session resumed (seq=232)           # last qqbot log
                                                                             # ~1h 6m of zero qqbot activity at all
2026-04-25 01:00:30 WARNING [Telegram] network error, scheduling reconnect: httpx.ConnectError
... [Telegram retry loop runs to completion and recovers] ...
2026-04-25 01:57:29 INFO    [Telegram] Connected to Telegram (polling mode)
                                                                             # qqbot still silent — no reconnect attempted
RAW_BUFFERClick to expand / collapse

Summary

When the host's network briefly goes down, the QQ bot platform adapter silently dies during a reconnect attempt. The Gateway parent task does not detect the failure or restart the adapter, so QQ stays offline indefinitely until the container is manually restarted. Telegram, in the same container subjected to the same network event, recovers automatically.

Environment

  • Hermes Agent v0.11.0 (nousresearch/hermes-agent:latest, image sha256 550ae16a17b3)
  • Docker on a NAS (China region), behind a clash HTTP/HTTPS proxy at http://<local-clash-proxy> set via HTTP_PROXY / HTTPS_PROXY env vars
  • Platforms enabled: telegram + qqbot (both reach their endpoints through the same proxy)

What happened (production observation)

  1. Host network started degrading; clash proxy began dropping idle WebSocket connections.
  2. QQ bot adapter lost its WS to wss://api.sgroup.qq.com/websocket every ~60 s. Each cycle the adapter logged WebSocket error: WebSocket closed, reconnected, sent Resume, and succeeded.
  3. After ~5 such cycles, the host network dropped fully for a short window.
  4. The 6th reconnect attempt triggered an exception at the httpx/httpcore layer (TCP/TLS handshake through proxy), which appears not to be caught by the qqbot adapter's reconnect coroutine.
  5. The qqbot task quietly exited — no traceback in agent.log, no ERROR entry, no further qqbot log lines for over an hour.
  6. Meanwhile Telegram experienced the same network event but its retry loop survived and reconnected automatically once network was back.
  7. hermes gateway status continued to report Gateway is running (PID alive) and Telegram kept serving. QQ remained permanently offline until docker restart.

Excerpted log (timestamps UTC)

2026-04-24 23:50:19 WARNING [QQBot:xxx] WebSocket error: WebSocket closed
2026-04-24 23:50:21 INFO    [QQBot:xxx] Reconnected
2026-04-24 23:50:21 INFO    [QQBot:xxx] Session resumed
2026-04-24 23:51:21 WARNING [QQBot:xxx] WebSocket error: WebSocket closed   # exact 60s cycle
2026-04-24 23:51:24 INFO    [QQBot:xxx] Reconnected
... [3 more cycles, all succeeding] ...
2026-04-24 23:54:31 INFO    [QQBot:xxx] Session resumed (seq=232)           # last qqbot log
                                                                             # ~1h 6m of zero qqbot activity at all
2026-04-25 01:00:30 WARNING [Telegram] network error, scheduling reconnect: httpx.ConnectError
... [Telegram retry loop runs to completion and recovers] ...
2026-04-25 01:57:29 INFO    [Telegram] Connected to Telegram (polling mode)
                                                                             # qqbot still silent — no reconnect attempted

Expected behavior

The Gateway should either:

  • Wrap each platform adapter's main loop in a supervisor that restarts the adapter on unhandled exception (or at minimum logs the traceback at ERROR level so silent death is visible), and/or
  • The qqbot adapter's reconnect coroutine should catch transport-layer exceptions (httpx.ConnectError, httpcore.ConnectError, OSError, TLS handshake failures, proxy CONNECT failures) the same way it currently handles WebSocket closed.

Suspected fix locations

  • gateway/platforms/qqbot/adapter.py — broaden the except around the reconnect / Resume path to include httpx.ConnectError, httpcore.ConnectError, OSError, ssl.SSLError, etc.
  • gateway/run.py — add a per-platform task supervisor that restarts a dead adapter task, or at least emits a high-severity log + alert when a platform task exits unexpectedly.

The 60-second WebSocket cycle is likely a clash idle-connection timeout (client-side proxy issue, not your bug), but the silent death after that is the actual bug — a healthy adapter should not be killable by a transient network event.

Happy to provide more logs / the full silent-death window if useful.

extent analysis

TL;DR

The qqbot adapter's reconnect coroutine should be modified to catch transport-layer exceptions to prevent silent death during network reconnect attempts.

Guidance

  • Review the gateway/platforms/qqbot/adapter.py file to broaden the except block around the reconnect/Resume path to include exceptions like httpx.ConnectError, httpcore.ConnectError, OSError, and ssl.SSLError.
  • Consider adding a per-platform task supervisor in gateway/run.py to restart a dead adapter task or emit a high-severity log and alert when a platform task exits unexpectedly.
  • Verify that the qqbot adapter's reconnect coroutine is properly handling WebSocket closed errors and other transport-layer exceptions.
  • Test the modified code with simulated network failures to ensure the qqbot adapter can recover automatically.

Example

try:
    # reconnect and Resume logic
except (httpx.ConnectError, httpcore.ConnectError, OSError, ssl.SSLError) as e:
    # log the exception and retry or restart the adapter
    logging.error(f"Reconnect failed: {e}")
    # retry or restart logic

Notes

The provided log excerpt and issue description suggest that the qqbot adapter's reconnect coroutine is not catching transport-layer exceptions, leading to silent death during network reconnect attempts. The proposed fix locations and example code snippet aim to address this issue. However, additional testing and verification are necessary to ensure the modified code works as expected.

Recommendation

Apply the workaround by modifying the gateway/platforms/qqbot/adapter.py file to catch transport-layer exceptions and retry or restart the adapter as needed. This should prevent the qqbot adapter from silently dying during network reconnect attempts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The Gateway should either:

  • Wrap each platform adapter's main loop in a supervisor that restarts the adapter on unhandled exception (or at minimum logs the traceback at ERROR level so silent death is visible), and/or
  • The qqbot adapter's reconnect coroutine should catch transport-layer exceptions (httpx.ConnectError, httpcore.ConnectError, OSError, TLS handshake failures, proxy CONNECT failures) the same way it currently handles WebSocket closed.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING