hermes - ✅(Solved) Fix QQ Bot: WebSocket disconnects every ~60s, gateway eventually hangs [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#19648Fetched 2026-05-05 06:05:39
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Timeline (top)
labeled ×4cross-referenced ×2commented ×1

Error Message

16:00:45 WARNING WebSocket error: WebSocket closed 16:00:49 INFO WebSocket connected 16:00:49 INFO Reconnected 16:00:49 INFO Resume sent (session_id=..., seq=N) 16:00:50 INFO Session resumed

... ~60 seconds later ...

16:01:48 WARNING WebSocket error: WebSocket closed

Root Cause

The QQ Bot WebSocket (wss://api.sgroup.qq.com/websocket) is disconnected by the server approximately every 60 seconds. The adapter handles this by reconnecting within 2-4 seconds each time, but the root cause is likely missing heartbeat ACK — QQ Bot gateway protocol uses a heartbeat mechanism (similar to Discord opcode 10/11) and the server closes idle connections.

Fix Action

Workaround

Added RuntimeMaxSec=14400 to systemd unit to force restart every 4 hours:

[Service]
RuntimeMaxSec=14400

PR fix notes

PR #19977: fix(qqbot): reconnect on missed heartbeat ACK

Description (problem / solution / changelog)

Summary

Fixes #19648.

QQ Bot already sends gateway heartbeats, but it did not track whether the gateway returned opcode 11 heartbeat ACKs. If the WebSocket stopped acknowledging heartbeats while the process stayed alive, the listener could remain blocked waiting for frames and systemd would still see the gateway as running.

This PR adds a small ACK watchdog to the existing heartbeat loop:

  • mark a heartbeat ACK as pending after sending opcode 1 with the latest sequence number
  • clear the pending state when opcode 11 arrives
  • if the next heartbeat tick sees the previous ACK still pending, close the WebSocket so the existing reconnect path recovers the session
  • reset ACK state when opening/cleaning up WebSocket resources

Scope

Kept intentionally narrow to the QQ Bot adapter and its existing unit tests. No protocol rewrite, watchdog daemon, or systemd integration is added here.

Verification

scripts/run_tests.sh tests/gateway/test_qqbot.py
# 73 passed

I also ran git diff --check successfully. python -m ruff check ... was not available in this local environment (No module named ruff).

Changed files

  • gateway/platforms/qqbot/adapter.py (modified, +33/-0)
  • tests/gateway/test_qqbot.py (modified, +48/-0)

Code Example

16:00:45  WARNING  WebSocket error: WebSocket closed
16:00:49  INFO     WebSocket connected
16:00:49  INFO     Reconnected
16:00:49  INFO     Resume sent (session_id=..., seq=N)
16:00:50  INFO     Session resumed
# ... ~60 seconds later ...
16:01:48  WARNING  WebSocket error: WebSocket closed

---

[Service]
RuntimeMaxSec=14400
RAW_BUFFERClick to expand / collapse

Bug Description

Three related issues in the QQ Bot platform adapter:

1. WebSocket disconnects every ~60 seconds (heartbeat missing)

The QQ Bot WebSocket (wss://api.sgroup.qq.com/websocket) is disconnected by the server approximately every 60 seconds. The adapter handles this by reconnecting within 2-4 seconds each time, but the root cause is likely missing heartbeat ACK — QQ Bot gateway protocol uses a heartbeat mechanism (similar to Discord opcode 10/11) and the server closes idle connections.

Log pattern (repeats continuously for hours):

16:00:45  WARNING  WebSocket error: WebSocket closed
16:00:49  INFO     WebSocket connected
16:00:49  INFO     Reconnected
16:00:49  INFO     Resume sent (session_id=..., seq=N)
16:00:50  INFO     Session resumed
# ... ~60 seconds later ...
16:01:48  WARNING  WebSocket error: WebSocket closed

No heartbeat/ping/pong log entries appear at all.

2. Gateway process silently hangs after extended operation

After hours of the disconnect/reconnect cycle, the gateway process stops producing any logs while remaining "alive" to systemd (state: S sleeping, PID persists). No crash, no traceback, no OOM — just silent death of the event loop.

Reproduction timeline:

  • Gateway started at 11:26
  • Last log entry at 16:12:52 ("Session resumed")
  • Zero output for 43 minutes until forced restart at 16:55
  • Process: S (sleeping), 9 threads, ~200MB RSS
  • systemd sees "active (running)" — Restart=always does NOT trigger

3. No defense against silent hangs

Restart=always only triggers on process exit, not on hang. The gateway could benefit from:

  • Built-in aliveness probe / internal watchdog
  • systemd WatchdogSec= support (sd_notify)
  • Or at minimum, documentation recommending RuntimeMaxSec as a safety net

Environment

  • Hermes: git-installed at /usr/local/lib/hermes-agent
  • OS: Ubuntu 24.04, systemd
  • QQ App ID: 1903947039
  • Config: QQ_ALLOW_ALL_USERS=true, approvals.mode: off
  • Proxy: Mihomo (Clash Meta) running locally

Workaround

Added RuntimeMaxSec=14400 to systemd unit to force restart every 4 hours:

[Service]
RuntimeMaxSec=14400

extent analysis

TL;DR

Implement a heartbeat mechanism to send periodic pings to the QQ Bot WebSocket to prevent disconnections.

Guidance

  • Investigate the QQ Bot gateway protocol documentation to understand the expected heartbeat message format and interval.
  • Modify the WebSocket connection handling code to send periodic heartbeat messages (e.g., every 30 seconds) to the QQ Bot server.
  • Consider adding logging for heartbeat messages to verify that they are being sent and received correctly.
  • Review the gateway process code to identify potential causes of the silent hang issue, such as infinite loops or unhandled exceptions.

Example

import websocket
import time

# ...

ws = websocket.create_connection("wss://api.sgroup.qq.com/websocket")

# Send heartbeat message every 30 seconds
while True:
    ws.send("heartbeat")  # Replace with actual heartbeat message format
    time.sleep(30)

Notes

The provided workaround using RuntimeMaxSec may not be a permanent solution, as it only forces a restart every 4 hours. Implementing a proper heartbeat mechanism and addressing the silent hang issue are essential for a reliable and stable gateway process.

Recommendation

Apply the workaround by adding RuntimeMaxSec=14400 to the systemd unit file, and prioritize implementing a heartbeat mechanism to prevent disconnections. This will provide a temporary safety net while working on a more permanent solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING