hermes - 💡(How to fix) Fix QQ Bot Reconnect Busy Loop Causes 100% CPU Spin

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

while self._running: try: connect_time = time.monotonic() await self._read_events() # line 492 backoff_idx = 0 # line 493 ← RESET on clean return! except QQCloseError: # ... handle, then _reconnect() with sleep except Exception: # ... handle, then _reconnect() with sleep

Root Cause

Root Cause Analysis

Code Example

while self._running:
    try:
        connect_time = time.monotonic()
        await self._read_events()          # line 492
        backoff_idx = 0                     # line 493RESET on clean return!
    except QQCloseError:
        # ... handle, then _reconnect() with sleep
    except Exception:
        # ... handle, then _reconnect() with sleep

---

async def _read_events(self) -> None:
    if not self._ws:
        raise RuntimeError("WebSocket not connected")  # raises → caught
    while self._running and self._ws and not self._ws.closed:  # line 659
        msg = await self._ws.receive()
        # ... handle frame types ...

---

# In _listen_loop, after line 492:
await self._read_events()
# Guard against silent return from closed websocket
if not self._ws or self._ws.closed:
    backoff_idx += 1
    if backoff_idx >= MAX_RECONNECT_ATTEMPTS:
        logger.error("[%s] Max reconnect attempts reached (silent close)", self._log_tag)
        self._mark_disconnected()
        return
    if not await self._reconnect(backoff_idx):
        continue
backoff_idx = 0
RAW_BUFFERClick to expand / collapse

QQ Bot Reconnect Busy Loop Causes 100% CPU Spin

Version: 0.14.0
Python: 3.11.15
aiohttp: 3.13.4
Severity: Medium — resource waste (100% CPU for hours), no user-visible outage

Symptoms

  • Gateway process consuming 100% CPU continuously for 3+ hours
  • 0 syscalls in 3 seconds (strace -c): pure CPU-bound spin with no I/O
  • Memory stable at ~128MB RSS
  • Gateway log shows only memory_monitor entries — all QQ Bot reconnect logging went silent

Timeline

TimeEvent
T+0QQ Bot session timeout (code 4009), normal 30-minute cycle
T+2sReconnect attempt 1 scheduled (2s backoff)
T+4sReconnect failed: DNS resolution error [Errno -3] Temporary failure in name resolution
T+4s → T+3hNo QQ Bot log entries — complete logging silence despite process running
T+3hpy-spy confirms main thread spinning in _listen_loop / _read_events

Root Cause Analysis

Reconnect Loop Structure

gateway/platforms/qqbot/adapter.py _listen_loop():

while self._running:
    try:
        connect_time = time.monotonic()
        await self._read_events()          # line 492
        backoff_idx = 0                     # line 493 ← RESET on clean return!
    except QQCloseError:
        # ... handle, then _reconnect() with sleep
    except Exception:
        # ... handle, then _reconnect() with sleep

_read_events():

async def _read_events(self) -> None:
    if not self._ws:
        raise RuntimeError("WebSocket not connected")  # raises → caught
    while self._running and self._ws and not self._ws.closed:  # line 659
        msg = await self._ws.receive()
        # ... handle frame types ...

The Bug

When self._ws is a closed aiohttp WebSocket (.closed == True but object is not None):

  1. _read_events() line 657 check passes — self._ws is not None
  2. Line 659 while condition fails — self._ws.closed is True → function returns immediately with no exception
  3. Back in _listen_loop, line 493: backoff_idx = 0counter incorrectly reset
  4. Loop back to line 492 → _read_events() → immediate silent return → tight zero-delay spin

No reconnect logic fires because no exception propagates. No logging because no error path is hit. Just pure CPU burn.

Contributing Factor

_mark_transport_disconnected() intentionally does NOT close the WebSocket or set it to None (docstring: "without stopping the reconnect loop"). After a QQCloseError, self._ws remains as the closed websocket object. If a subsequent reconnect attempt's _open_ws() fails after setting self._ws = None but the cleanup path leaves the websocket in a closed-but-not-None state, the condition for the tight loop is met.

Reproduction Scenario

  1. DNS failure causes _reconnect() to return False
  2. _open_ws() sets self._ws = None before attempting connection, but ws_connect() fails
  3. _read_events() raises RuntimeError → caught → reconnect with backoff → sleep → repeat
  4. DNS eventually recovers, reconnect succeeds, new WebSocket established
  5. QQ immediately closes the new connection (e.g., stale session/token)
  6. _read_events() catches CLOSE frame → QQCloseError_mark_transport_disconnected() (does NOT null self._ws)
  7. Reconnect attempted; transient issue leaves self._ws as closed-but-not-None
  8. Tight loop begins: _read_events returns silently → backoff_idx = 0 → spin forever

Suggested Fix

After _read_events() returns cleanly but the websocket is known to be inactive, trigger reconnect instead of resetting backoff_idx:

# In _listen_loop, after line 492:
await self._read_events()
# Guard against silent return from closed websocket
if not self._ws or self._ws.closed:
    backoff_idx += 1
    if backoff_idx >= MAX_RECONNECT_ATTEMPTS:
        logger.error("[%s] Max reconnect attempts reached (silent close)", self._log_tag)
        self._mark_disconnected()
        return
    if not await self._reconnect(backoff_idx):
        continue
backoff_idx = 0

Alternative: make _read_events() raise an exception when the websocket is already closed on entry, instead of silently doing nothing.

Verification

  • py-spy sampling: 5 snapshots across 5 seconds confirmed main thread cycling between lines 490-492 and 659
  • strace -c on the spinning process: 0 syscalls in 3 seconds, confirming pure CPU-bound loop
  • After restart, CPU returned to idle (<1%)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix QQ Bot Reconnect Busy Loop Causes 100% CPU Spin