hermes - 💡(How to fix) Fix [Bug] QQBot: _listen_loop exits after MAX_RECONNECT_ATTEMPTS and never restarts; send() waits 15s for reconnection that cannot succeed [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Two related bugs in the QQBot adapter cause the bot to silently die and become unresponsive after network outages, requiring manual Hermes restart to recover.


Error Message

logger.error("[%s] Max reconnect attempts reached (QQCloseError)", self._log_tag) return SendResult(success=False, error="Not connected", retryable=True)

Root Cause

  • Related to Issue #17703 (QQBot stops reconnecting after failed reconnect leaves websocket closed) — PR #17704 is open for a related but different root cause (_read_events() returning normally with a closed-but-present ws object)
  • Related to Issue #14539 (QQ Bot adapter silently stops reconnecting without notifying gateway)
  • Related to Issue #15490 (qqbot adapter silently dies on network outage during reconnect)
  • Related to Issue #11163 (Gateway silently drops messages when WebSocket platform adapter temporarily loses connection during reconnect cycle)

Fix Action

Fixed

Code Example

if backoff_idx >= MAX_RECONNECT_ATTEMPTS:
    logger.error("[%s] Max reconnect attempts reached (QQCloseError)", self._log_tag)
    return   # <-- _running stays True, but loop never restarts

---

_RECONNECT_WAIT_SECONDS = 15.0

async def _wait_for_reconnection(self) -> bool:
    """... polls is_connected for up to _RECONNECT_WAIT_SECONDS ..."""
    waited = 0.0
    while waited < self._RECONNECT_WAIT_SECONDS:
        await asyncio.sleep(self._RECONNECT_POLL_INTERVAL)  # 0.5s
        waited += self._RECONNECT_POLL_INTERVAL
        if self.is_connected:
            return True
    return False

async def send(self, ...):
    if not self.is_connected:
        if not await self._wait_for_reconnection():  # Waits up to 15s
            return SendResult(success=False, error="Not connected", retryable=True)
RAW_BUFFERClick to expand / collapse

Summary

Two related bugs in the QQBot adapter cause the bot to silently die and become unresponsive after network outages, requiring manual Hermes restart to recover.


Bug 1: _listen_loop exits permanently after MAX_RECONNECT_ATTEMPTS (100 attempts)

File: gateway/platforms/qqbot/adapter.py

Location: Lines 567-569 and 578-580

if backoff_idx >= MAX_RECONNECT_ATTEMPTS:
    logger.error("[%s] Max reconnect attempts reached (QQCloseError)", self._log_tag)
    return   # <-- _running stays True, but loop never restarts

What happens:

  • When reconnect repeatedly fails (e.g., persistent network issue), backoff_idx accumulates and eventually hits MAX_RECONNECT_ATTEMPTS = 100
  • _listen_loop() exits via return, but self._running remains True
  • No code ever restarts the listen task — _listen_loop is a simple while self._running: loop, not a supervised task that gets respawned
  • The bot is now fully dead: no sending, no receiving, but the process is still running

Expected behavior: Even after 100 consecutive failures, _listen_loop should continue retrying (perhaps with a longer backoff or a warning). The bot should never silently die while Hermes is still running.

Reference: OpenClaws qqbot implementation uses a separate msgQueue + reconnect state machine that does not die in this way.


Bug 2: send() waits up to 15 seconds polling is_connected, which cannot become True after Bug 1

File: gateway/platforms/qqbot/adapter.py

Location: Lines 1899-1923 and 1939-1941

_RECONNECT_WAIT_SECONDS = 15.0

async def _wait_for_reconnection(self) -> bool:
    """... polls is_connected for up to _RECONNECT_WAIT_SECONDS ..."""
    waited = 0.0
    while waited < self._RECONNECT_WAIT_SECONDS:
        await asyncio.sleep(self._RECONNECT_POLL_INTERVAL)  # 0.5s
        waited += self._RECONNECT_POLL_INTERVAL
        if self.is_connected:
            return True
    return False

async def send(self, ...):
    if not self.is_connected:
        if not await self._wait_for_reconnection():  # Waits up to 15s
            return SendResult(success=False, error="Not connected", retryable=True)

What happens:

  • When send() is called while WS is disconnected, it waits up to 15 seconds polling is_connected
  • is_connected only becomes True when _listen_loop calls _mark_connected() (line 604 after successful reconnect)
  • After Bug 1, _listen_loop has exited — it will never call _mark_connected() again
  • The 15-second wait always fails, causing send() to return success=False
  • In the shutdown scenario (cron job at 23:50), the 15-second wait blocks the shutdown sequence before the machine powers off

Expected behavior: If is_connected is False and the reconnect is unlikely to succeed quickly (e.g., _running is True but _listen_loop has exited), send() should either: (a) fail fast and fall back to a REST-only path, or (b) use the standalone REST API (_send_c2c_text / _send_group_text) which does not depend on WebSocket.


Relationship to existing issues

  • Related to Issue #17703 (QQBot stops reconnecting after failed reconnect leaves websocket closed) — PR #17704 is open for a related but different root cause (_read_events() returning normally with a closed-but-present ws object)
  • Related to Issue #14539 (QQ Bot adapter silently stops reconnecting without notifying gateway)
  • Related to Issue #15490 (qqbot adapter silently dies on network outage during reconnect)
  • Related to Issue #11163 (Gateway silently drops messages when WebSocket platform adapter temporarily loses connection during reconnect cycle)

Environment

  • Hermes Agent version: current main branch
  • Platform: QQ Bot (qqbot adapter)
  • OS: Linux/WSL
  • Python 3.x with asyncio

Suggested fix direction

For Bug 1: After MAX_RECONNECT_ATTEMPTS is reached, instead of return, the loop should either:

  • Reset backoff_idx = 0 and continue (keep trying forever, maybe with a longer backoff cycle), OR
  • Exit the loop but set self._running = False and trigger a gateway-level alert that the adapter has died

For Bug 2: send() should detect when _listen_loop has exited (e.g., _running is True but _listen_task is done) and fall back to a REST-only path immediately rather than polling is_connected for 15 seconds.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING