hermes - 💡(How to fix) Fix [Bug]: QQBot adapter busy-loops after WebSocket reconnect failure, causing 100% CPU [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

I have not run hermes debug share yet because the issue was diagnosed after the original busy-loop process had already been restarted, and I want to avoid uploading potentially sensitive local config.

Relevant logs and runtime observations are included in the issue body above.

Failure point:

2026-05-25 02:30:33,484 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] WebSocket error: WebSocket closed
2026-05-25 02:30:33,488 INFO gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnecting in 2s (attempt 1)...
2026-05-25 02:30:35,522 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnect failed: Failed to get QQ Bot gateway URL:

High CPU observation:

PID   PPID STAT %CPU %MEM ELAPSED       TIME       COMMAND
1742  1    R    99.1 0.1  01-14:19:12   252:59.23 /Users/wangjiqing/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace

Runtime state file showed gateway_state: running while platforms.qqbot.state: disconnected.

Root Cause

gateway/platforms/qqbot/adapter.py::_read_events() returns normally when self._ws exists but is already closed. The outer _listen_loop() treats that normal return as a successful read cycle and immediately loops again. Because no exception is raised and no sleep/backoff is applied, a closed WebSocket can produce an infinite hot loop.

Fix Action

Fixed

Code Example

I have not run `hermes debug share` yet because the issue was diagnosed after the original busy-loop process had already been restarted, and I want to avoid uploading potentially sensitive local config.

Relevant logs and runtime observations are included in the issue body above.

Failure point:

    2026-05-25 02:30:33,484 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] WebSocket error: WebSocket closed
    2026-05-25 02:30:33,488 INFO gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnecting in 2s (attempt 1)...
    2026-05-25 02:30:35,522 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnect failed: Failed to get QQ Bot gateway URL:

High CPU observation:

    PID   PPID STAT %CPU %MEM ELAPSED       TIME       COMMAND
    1742  1    R    99.1 0.1  01-14:19:12   252:59.23 /Users/wangjiqing/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace

Runtime state file showed `gateway_state: running` while `platforms.qqbot.state: disconnected`.

---

Additional relevant logs are included above in the Debug Report section.

One more related error from `gateway.error.log` showed a DNS/name resolution failure during an earlier QQBot startup/connect attempt:

    httpx.ConnectError: [Errno 8] nodename nor servname provided, or not known

    Traceback (most recent call last):
      File "/Users/wangjiqing/.hermes/hermes-agent/gateway/platforms/qqbot/adapter.py", line 311, in connect
        await self._ensure_token()
      File "/Users/wangjiqing/.hermes/hermes-agent/gateway/platforms/qqbot/adapter.py", line 402, in _ensure_token
        raise RuntimeError(f"Failed to get QQ Bot access token: {exc}") from exc
    RuntimeError: Failed to get QQ Bot access token: [Errno 8] nodename nor servname provided, or not known

I believe this network failure is probably only a trigger. The CPU burn seems to come from the adapter failing to settle into a sleeping/retry state after a closed WebSocket/reconnect failure.
RAW_BUFFERClick to expand / collapse

Bug Description

Hermes Gateway can enter a sustained high-CPU busy loop after the QQBot WebSocket closes and a reconnect attempt fails.

Observed impact:

  • One Hermes Gateway Python process used about one full CPU core continuously.
  • The process was running as the macOS launchd service ai.hermes.gateway.
  • The gateway logs showed repeated QQBot WebSocket session timeouts, followed by one reconnect failure.
  • After the reconnect failure, the process continued running but appeared to spin in the asyncio loop rather than sleeping/backing off.

Suspected issue:

gateway/platforms/qqbot/adapter.py::_read_events() returns normally when self._ws exists but is already closed. The outer _listen_loop() treats that normal return as a successful read cycle and immediately loops again. Because no exception is raised and no sleep/backoff is applied, a closed WebSocket can produce an infinite hot loop.

I am not sure whether the final fix should be in _read_events(), _reconnect(), or _listen_loop(). I am reporting the observed behavior and suspected code path so maintainers familiar with the QQBot adapter / gateway reconnect state machine can confirm the right fix direction.

Steps to Reproduce

I do not have a minimal deterministic reproducer yet. This was observed in a long-running Hermes Gateway with the QQBot platform enabled.

Observed sequence:

  1. Run Hermes Gateway as the macOS launchd service ai.hermes.gateway.
  2. Keep the QQBot platform enabled and connected.
  3. QQBot WebSocket sessions periodically time out with code 4009 and normally reconnect.
  4. At one point, the WebSocket closed and the reconnect attempt failed.
  5. After that, no more QQBot reconnect/backoff logs appeared for several hours.
  6. The Hermes Gateway process kept running and consumed around 99%-100% CPU.

Failure point:

2026-05-25 02:30:33,484 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] WebSocket error: WebSocket closed
2026-05-25 02:30:33,488 INFO gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnecting in 2s (attempt 1)...
2026-05-25 02:30:35,522 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnect failed: Failed to get QQ Bot gateway URL:

Expected Behavior

When the QQBot WebSocket is closed or reconnect fails:

  • the listener should not spin;
  • it should raise a connection error and go through the existing backoff/reconnect path;
  • or it should mark the adapter disconnected/fatal and stop the listener;
  • or it should clear stale WebSocket state so the next pass cannot treat a closed socket as a normal completed read.

Actual Behavior

When self._ws is already closed before entering _read_events(), _read_events() appears to return normally.

The outer _listen_loop() then immediately re-enters it without sleep or backoff, causing a CPU busy loop.

Observed symptoms:

  • sustained 99%-100% CPU usage from one Hermes Gateway Python process;
  • no repeated reconnect logs;
  • no further QQBot connected/failed logs;
  • only memory monitor logs continued every 5 minutes;
  • macOS sample showed active Python/asyncio execution instead of blocking socket I/O.

Affected Component

Gateway

Messaging Platform (if gateway-related)

QQBot

Debug Report

I have not run `hermes debug share` yet because the issue was diagnosed after the original busy-loop process had already been restarted, and I want to avoid uploading potentially sensitive local config.

Relevant logs and runtime observations are included in the issue body above.

Failure point:

    2026-05-25 02:30:33,484 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] WebSocket error: WebSocket closed
    2026-05-25 02:30:33,488 INFO gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnecting in 2s (attempt 1)...
    2026-05-25 02:30:35,522 WARNING gateway.platforms.qqbot.adapter: [QQBot:1903695542] Reconnect failed: Failed to get QQ Bot gateway URL:

High CPU observation:

    PID   PPID STAT %CPU %MEM ELAPSED       TIME       COMMAND
    1742  1    R    99.1 0.1  01-14:19:12   252:59.23 /Users/wangjiqing/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace

Runtime state file showed `gateway_state: running` while `platforms.qqbot.state: disconnected`.

Operating System

macOS 26.5, build 25F71

Python Version

3.11.15

Hermes Version

v0.14.0 (2026.5.16)

Additional Logs / Traceback (optional)

Additional relevant logs are included above in the Debug Report section.

One more related error from `gateway.error.log` showed a DNS/name resolution failure during an earlier QQBot startup/connect attempt:

    httpx.ConnectError: [Errno 8] nodename nor servname provided, or not known

    Traceback (most recent call last):
      File "/Users/wangjiqing/.hermes/hermes-agent/gateway/platforms/qqbot/adapter.py", line 311, in connect
        await self._ensure_token()
      File "/Users/wangjiqing/.hermes/hermes-agent/gateway/platforms/qqbot/adapter.py", line 402, in _ensure_token
        raise RuntimeError(f"Failed to get QQ Bot access token: {exc}") from exc
    RuntimeError: Failed to get QQ Bot access token: [Errno 8] nodename nor servname provided, or not known

I believe this network failure is probably only a trigger. The CPU burn seems to come from the adapter failing to settle into a sleeping/retry state after a closed WebSocket/reconnect failure.

Root Cause Analysis (optional)

Suspected code path:

File: gateway/platforms/qqbot/adapter.py

Relevant locations from my local checkout:

  • _listen_loop(): around line 475
  • _reconnect(): around line 653
  • _read_events(): around line 676

Current _read_events() control flow:

async def _read_events(self) -> None:
    """Read WebSocket frames until connection closes."""
    if not self._ws:
        raise RuntimeError("WebSocket not connected")

    while self._running and self._ws and not self._ws.closed:
        msg = await self._ws.receive()
        ...

Why this can busy-loop:

  1. After reconnect failure, _reconnect() returns False.
  2. If the old self._ws remains set but is already closed, _read_events() sees self._ws is truthy.
  3. The loop condition is false because self._ws.closed is true.
  4. _read_events() returns normally without raising.
  5. _listen_loop() treats this normal return as a successful read cycle and immediately loops again without sleep/backoff.

This matches the observed symptoms:

  • sustained 99%-100% CPU;
  • no repeated reconnect logs;
  • no further QQBot connected/failed logs;
  • macOS sample showed active Python/asyncio execution instead of blocking socket I/O.

I also checked git blame. The core _read_events() control flow appears to have existed since the initial QQBot adapter implementation. Later changes around log tags or membership-test style do not appear to change this closed-WebSocket return path.

Proposed Fix (optional)

I am not sure whether the final fix should be in _read_events(), _reconnect(), or _listen_loop(). Maintainers familiar with the QQBot adapter / gateway reconnect state machine should confirm the right direction.

One possible minimal fix is to make _read_events() explicitly reject an already-closed WebSocket:

async def _read_events(self) -> None:
    """Read WebSocket frames until connection closes."""
    if not self._ws:
        raise RuntimeError("WebSocket not connected")
    if self._ws.closed:
        raise RuntimeError("WebSocket closed")

    while self._running and self._ws and not self._ws.closed:
        msg = await self._ws.receive()
        ...

    if self._running:
        raise RuntimeError("WebSocket closed")

Another possible hardening option is to clear stale closed WebSocket state when reconnect fails:

except Exception as exc:
    logger.warning("[%s] Reconnect failed: %s", self._log_tag, exc)
    if self._ws and self._ws.closed:
        self._ws = None
    return False

A possible regression test would be to set adapter._ws to a fake object with closed = True and assert that await adapter._read_events() raises RuntimeError("WebSocket closed").

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug]: QQBot adapter busy-loops after WebSocket reconnect failure, causing 100% CPU [1 pull requests]