hermes - 💡(How to fix) Fix Gateway status should expose platform health and recover stale adapters

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

PR #40198 fixes the immediate QQBot reconnect-loop bug for #31101. This issue tracks the broader observability and auto-recovery work so it can be developed separately without bloating the focused reconnect fix.

Error Message

  • include last inbound event time, last successful heartbeat/reconnect, and last error where available;
  • for gateway URL fetch failures, log exception type/repr and HTTP/network context where safe;

Root Cause

PR #40198 fixes the immediate QQBot reconnect-loop bug for #31101. This issue tracks the broader observability and auto-recovery work so it can be developed separately without bloating the focused reconnect fix.

Code Example

Gateway process: running
QQBot: WebSocket closed
QQBot: Reconnect failed while fetching gateway URL
After that: no new QQBot inbound messages until gateway restart
RAW_BUFFERClick to expand / collapse

Problem

A gateway process can be alive while a platform adapter is effectively dead. We saw this with QQBot: hermes gateway status showed a running Gateway PID, but QQBot had already lost its WebSocket and stopped receiving messages after a reconnect failure.

This makes health checks misleading: users see "gateway running" even though one configured platform is disconnected/stale.

Requested improvements

  1. Expose per-platform transport health in hermes gateway status

    • show each configured/connected platform state, e.g. qqbot: connected/disconnected/reconnecting;
    • include last inbound event time, last successful heartbeat/reconnect, and last error where available;
    • distinguish "process running" from "all platform transports healthy".
  2. Add a Gateway/platform watchdog

    • detect platform adapters that are disconnected/stale for more than a configurable threshold while the gateway process is still alive;
    • for restartable background installs, restart the gateway or platform adapter;
    • notify configured home channels before/after recovery attempts.
  3. Improve QQBot reconnect logging

    • for gateway URL fetch failures, log exception type/repr and HTTP/network context where safe;
    • avoid empty messages such as Reconnect failed: Failed to get QQ Bot gateway URL: with no actionable detail.
  4. Expand regression coverage

    • gateway URL temporary failure;
    • closed stale WebSocket after reconnect failure;
    • op7/4009 server reconnects;
    • gateway PID alive but platform transport disconnected/stale.

Context

PR #40198 fixes the immediate QQBot reconnect-loop bug for #31101. This issue tracks the broader observability and auto-recovery work so it can be developed separately without bloating the focused reconnect fix.

Observed live failure pattern

Gateway process: running
QQBot: WebSocket closed
QQBot: Reconnect failed while fetching gateway URL
After that: no new QQBot inbound messages until gateway restart

Expected behavior: status should reveal the platform-level failure, and watchdog/recovery should prevent a permanently silent QQBot while the gateway PID remains alive.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING