hermes - 💡(How to fix) Fix [Bug]: QQBot gateway can stop heartbeating after reconnect and loop on 4009 Session timed out

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

WARNING gateway.platforms.qqbot.adapter: [QQBot:<app_id>] WebSocket error: WebSocket closed

Root Cause

Based on code inspection of gateway/platforms/qqbot/adapter.py, the _running flag appears to be overloaded: it is used both as the adapter worker long-lived lifecycle flag and as the "current socket is connected" flag. During a reconnect, _running is flipped to False to signal that the current socket is no longer usable, but the heartbeat loop also uses _running as its loop condition. As a result, the heartbeat task appears able to exit during the reconnect gap.

The listener path appears able to recover and establish a new or resumed WebSocket, but with the heartbeat task already gone, no heartbeats are sent on the new connection. The QQ gateway then times the session out with 4009, which matches the observed log sequence.

For completeness, the adapter reconnect handling should also:

  • On op 7 (Reconnect): preserve session_id and seq so the next connect can Resume.
  • On op 9 (Invalid Session): clear the resume state and Identify again on the next connect.

Fix Action

Fix / Workaround

A local patch along these lines is in place, with unit tests covering:

Live verification on a patched container is in progress; I will follow up once the long-running deployment confirms whether the bot stays online across reconnects.

Code Example

WARNING gateway.platforms.qqbot.adapter: [QQBot:<app_id>] WebSocket closed: code=4009 reason=Session timed out

---

WARNING gateway.platforms.qqbot.adapter: [QQBot:<app_id>] WebSocket error: WebSocket closed
RAW_BUFFERClick to expand / collapse

Bug Description

I was running the Hermes gateway in a Docker container with the QQBot adapter enabled. After the bot had been running for a while, it stopped responding to messages in QQ, and QQ reported the bot as offline when I tried to send it a message. Because the bot had previously connected and handled messages successfully during this run, this appears to be a reconnect/heartbeat lifecycle issue rather than a startup or authentication failure.

The QQBot adapter appears to be able to lose its heartbeat task during or after a WebSocket disconnect/reconnect window. If the heartbeat task exits during that gap, any resumed or newly established QQ WebSocket may no longer receive heartbeats, so the QQ gateway can eventually terminate the session with close code 4009 (session timed out). After that, the adapter may settle into a loop of closed-socket errors, and the bot shows as offline on the QQ side.

Steps to Reproduce

No deterministic local reproducer yet. Observed in a long-running Docker deployment of nousresearch/hermes-agent with QQBot enabled.

The observed sequence was:

  1. Run Hermes gateway with the QQBot adapter enabled.
  2. Let the gateway run for some time.
  3. The bot stops responding in QQ.
  4. QQ reports the bot as offline when sending a message.
  5. Gateway logs first show 4009 Session timed out, then repeated WebSocket error: WebSocket closed warnings.

The failure appears to surface after the gateway has been running for some time, likely around a disconnect/reconnect event rather than at startup.

Expected Behavior

After a transient QQBot WebSocket disconnect:

  • The adapter reconnects (resume or re-identify as appropriate).
  • The heartbeat loop continues to run across the reconnect gap and resumes sending heartbeats on the active socket.
  • The bot stays online and continues to receive messages.

Actual Behavior

  • Gateway logs a 4009 Session timed out close from the QQ gateway.
  • Subsequent activity is dominated by repeated WebSocket error: WebSocket closed warnings.
  • QQ reports the bot as offline; messages sent to the bot are not handled.

Affected Component

Gateway / QQBot adapter (gateway/platforms/qqbot/adapter.py)

Messaging Platform

QQBot

Debug Report

Not included in this report.

Operating System

Docker container

Python Version

3.13.5

Hermes Version

Docker image: nousresearch/hermes-agent

Image revision label: dd0923bb89ed2dd56f82cb63656a1323f6f42e6f

Additional Logs / Traceback

Gateway logs first showed repeated QQ gateway close events:

WARNING gateway.platforms.qqbot.adapter: [QQBot:<app_id>] WebSocket closed: code=4009 reason=Session timed out

After some time, the logs transitioned into a steady stream of generic socket errors:

WARNING gateway.platforms.qqbot.adapter: [QQBot:<app_id>] WebSocket error: WebSocket closed

Root Cause Analysis

Based on code inspection of gateway/platforms/qqbot/adapter.py, the _running flag appears to be overloaded: it is used both as the adapter worker long-lived lifecycle flag and as the "current socket is connected" flag. During a reconnect, _running is flipped to False to signal that the current socket is no longer usable, but the heartbeat loop also uses _running as its loop condition. As a result, the heartbeat task appears able to exit during the reconnect gap.

The listener path appears able to recover and establish a new or resumed WebSocket, but with the heartbeat task already gone, no heartbeats are sent on the new connection. The QQ gateway then times the session out with 4009, which matches the observed log sequence.

For completeness, the adapter reconnect handling should also:

  • On op 7 (Reconnect): preserve session_id and seq so the next connect can Resume.
  • On op 9 (Invalid Session): clear the resume state and Identify again on the next connect.

Proposed Fix

Separate lifecycle from connection state in the adapter:

  • Introduce a distinct long-lived flag, for example _gateway_should_run, that governs the listener and heartbeat tasks for the lifetime of the adapter.
  • Keep _running, or an equivalently scoped flag, as the "current socket is connected" indicator.
  • Let the heartbeat task live across reconnect gaps: it should skip sends while the socket is disconnected or closed and resume sending once a new/resumed socket is available.
  • Ensure op 7 preserves session_id / seq for Resume, and op 9 clears resume state and triggers a fresh Identify.

Are you willing to submit a PR for this?

I'd like to fix this myself and submit a PR.

Verification Status

A local patch along these lines is in place, with unit tests covering:

  • Heartbeat surviving a transient disconnected state.
  • op 7 Reconnect behavior preserving resume state.
  • op 9 Invalid Session clearing resume state.
  • read_events converting op 7 into the reconnect flow.

Live verification on a patched container is in progress; I will follow up once the long-running deployment confirms whether the bot stays online across reconnects.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING