hermes - ✅(Solved) Fix [Bug]: QQ Bot WebSocket silently dies — adapter waits on dead connection for 18+ hours [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#19821Fetched 2026-05-05 06:04:57
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×4commented ×2cross-referenced ×1

Error Message

04-27 20:22 WebSocket error: WebSocket closed 04-27 20:22 Reconnecting in 2s (attempt 1) 04-27 20:22 WebSocket connected 04-27 20:22 Session resumed

Root Cause

Root Cause Hypothesis

Fix Action

Workaround

Using an external heartbeat monitor that parses gateway.log and restarts the Gateway when WebSocket events go silent for > 2 hours.

PR fix notes

PR #19977: fix(qqbot): reconnect on missed heartbeat ACK

Description (problem / solution / changelog)

Summary

Fixes #19648.

QQ Bot already sends gateway heartbeats, but it did not track whether the gateway returned opcode 11 heartbeat ACKs. If the WebSocket stopped acknowledging heartbeats while the process stayed alive, the listener could remain blocked waiting for frames and systemd would still see the gateway as running.

This PR adds a small ACK watchdog to the existing heartbeat loop:

  • mark a heartbeat ACK as pending after sending opcode 1 with the latest sequence number
  • clear the pending state when opcode 11 arrives
  • if the next heartbeat tick sees the previous ACK still pending, close the WebSocket so the existing reconnect path recovers the session
  • reset ACK state when opening/cleaning up WebSocket resources

Scope

Kept intentionally narrow to the QQ Bot adapter and its existing unit tests. No protocol rewrite, watchdog daemon, or systemd integration is added here.

Verification

scripts/run_tests.sh tests/gateway/test_qqbot.py
# 73 passed

I also ran git diff --check successfully. python -m ruff check ... was not available in this local environment (No module named ruff).

Changed files

  • gateway/platforms/qqbot/adapter.py (modified, +33/-0)
  • tests/gateway/test_qqbot.py (modified, +48/-0)

Code Example

04-27 20:22  WebSocket error: WebSocket closed
04-27 20:22  Reconnecting in 2s (attempt 1)
04-27 20:22  WebSocket connected
04-27 20:22  Session resumed

---

05-02 21:33  session 9b5bfc4a created
             ... continuously resumed for 30 hours, seq 11752 ...
05-04 00:53  Last C2C message received
05-04 03:59  Last successful WebSocket resume (seq=1752)
             ===== SILENCE =====
05-04 04:00  Expected "WebSocket closed"NEVER ARRIVED
05-04 07:07  Access token refreshed (token refresh still works!)
05-04 09:04  Access token refreshed
05-04 14:01  Access token refreshed
05-04 18:01  Access token refreshed
             ===== 18 HOURS, NO C2C MESSAGES, NO WEBSOCKET EVENTS =====
05-04 22:35  Auto-update triggered restart → new session → works again
RAW_BUFFERClick to expand / collapse

Bug Description

QQ Bot WebSocket adapter enters a "zombie" state where the TCP connection appears alive but QQ server has silently dropped it — no close frame, no error, just radio silence. The adapter waits forever on a dead connection instead of triggering reconnection.

Gateway process stays alive → systemd doesn't restart → messages are lost for 18+ hours.

Environment

  • Hermes Agent v0.12.0
  • qqbot platform adapter
  • WSL Debian, systemd user service

Evidence from gateway.log

Normal behavior (every ~60s):

04-27 20:22  WebSocket error: WebSocket closed
04-27 20:22  Reconnecting in 2s (attempt 1)
04-27 20:22  WebSocket connected
04-27 20:22  Session resumed

This pattern runs continuously for days. 4604 disconnections in the log.

The failure (May 4):

05-02 21:33  session 9b5bfc4a created
             ... continuously resumed for 30 hours, seq 1 → 1752 ...
05-04 00:53  Last C2C message received
05-04 03:59  Last successful WebSocket resume (seq=1752)
             ===== SILENCE =====
05-04 04:00  Expected "WebSocket closed" → NEVER ARRIVED
05-04 07:07  Access token refreshed (token refresh still works!)
05-04 09:04  Access token refreshed
05-04 14:01  Access token refreshed
05-04 18:01  Access token refreshed
             ===== 18 HOURS, NO C2C MESSAGES, NO WEBSOCKET EVENTS =====
05-04 22:35  Auto-update triggered restart → new session → works again

Key observations:

  1. Token refresh thread kept working (07:07, 09:04, 14:01, 18:01, 22:02)
  2. WebSocket reconnection loop went completely silent after 03:59
  3. No WebSocket error or closed event received — reconnect logic never triggered
  4. This looks like a TCP half-open connection: server-side dropped, client-side thinks it's connected

Root Cause Hypothesis

The QQ WebSocket adapter lacks a ping/pong or application-level heartbeat. QQ server dropped the connection silently after the session aged past some internal limit (~30 hours in this case), but the TCP connection appeared open to the client. Without a close frame, the reconnect loop never fires.

Suggested Fix

Add a receive timeout or periodic WebSocket ping. If no event (message/heartbeat/pong) is received within N minutes, force disconnect and re-identify.

Workaround

Using an external heartbeat monitor that parses gateway.log and restarts the Gateway when WebSocket events go silent for > 2 hours.

extent analysis

TL;DR

Implement a receive timeout or periodic WebSocket ping to detect and recover from silent connection drops.

Guidance

  • Verify the hypothesis by checking if the QQ server has a session timeout limit and if it can be adjusted or worked around.
  • Consider adding a ping/pong mechanism to the WebSocket connection to detect when the server has dropped the connection.
  • Implement a receive timeout to force disconnect and reconnection if no events are received within a certain time frame (e.g., N minutes).
  • As a temporary workaround, use an external heartbeat monitor to parse the gateway.log and restart the Gateway when WebSocket events go silent for an extended period.

Example

import websocket
import time

# Establish WebSocket connection
ws = websocket.create_connection("ws://qq-server.com")

# Set receive timeout to 5 minutes
timeout = 300  # seconds

while True:
    try:
        # Send ping every minute
        ws.send("ping")
        # Wait for response or timeout
        ws.settimeout(timeout)
        response = ws.recv()
        if response == "pong":
            # Reset timeout
            timeout = 300
        else:
            # Handle unexpected response
            print("Unexpected response:", response)
    except websocket.timeout:
        # Force disconnect and reconnection
        print("Timeout: reconnecting...")
        ws.close()
        ws = websocket.create_connection("ws://qq-server.com")

Notes

The provided code snippet is a basic example and may need to be adapted to the specific use case and WebSocket library being used. The receive timeout and ping/pong mechanism should be adjusted according to the QQ server's session timeout limit and the application's requirements.

Recommendation

Apply the suggested fix by implementing a receive timeout or periodic WebSocket ping to detect and recover from silent connection drops, as it addresses the root cause of the issue and provides a more robust solution than the workaround.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: QQ Bot WebSocket silently dies — adapter waits on dead connection for 18+ hours [1 pull requests, 2 comments, 2 participants]