hermes - ✅(Solved) Fix QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#29005Fetched 2026-05-20 04:00:40
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×5commented ×1cross-referenced ×1

Error Message

When platform adapters lose network connectivity and exhaust their internal reconnection retries, they silently stop without notifying the gateway's fatal-error handling mechanism. This leaves the gateway process alive but the platform permanently unresponsive. During _handle_polling_network_error(), the runtime status remains connected during retries 1-10. The fatal error is only set at attempt 11. Consider writing a retrying state during the retry window so external monitors can detect degraded connectivity earlier. | P1 | telegram.py | Write retrying state during network error retries |

Root Cause

File: gateway/platforms/qqbot/adapter.py

At lines ~608 and ~620, when attempt >= MAX_RECONNECT_ATTEMPTS:

# Current (buggy):
self._mark_disconnected()

# Should be:
self._set_fatal_error(f"QQBot reconnect exhausted after {MAX_RECONNECT_ATTEMPTS} attempts")
await self._notify_fatal_error()

Without _notify_fatal_error(), the gateway's _platform_reconnect_watcher() never attempts to restart the QQBot adapter.

File: gateway/platforms/telegram.py

During _handle_polling_network_error(), the runtime status remains connected during retries 1-10. The fatal error is only set at attempt 11. Consider writing a retrying state during the retry window so external monitors can detect degraded connectivity earlier.

Fix Action

Fixed

PR fix notes

PR #29007: fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)

Description (problem / solution / changelog)

What does this PR do?

Fixes #29005 — both halves of the reported regression:

P0 — QQBot: the three MAX_RECONNECT_ATTEMPTS exit paths in QQAdapter._listen_loop ended with a bare _mark_disconnected() and a return. That writes platform_state=disconnected to the runtime status file and kills the listener task — but does not fire the fatal_error_handler that GatewayRunner._platform_reconnect_watcher is wired through. With no fatal-error event, the watcher never adds the platform to _failed_platforms and the adapter stays permanently dead while the gateway process keeps running. Reporter's log shows exactly this: QQBot exhausts its 100-attempt budget at 07:33 and from then on nothing tries to bring it back. Fix replaces each _mark_disconnected() with _set_fatal_error(..., retryable=True) + await _notify_fatal_error(), so the same gateway watcher that already handles Telegram / WhatsApp / Slack picks the platform up with exponential back-off (cap 5 min, circuit-broken after 10 consecutive failures).

P1 — Telegram: during the polling reconnect ladder (_handle_polling_network_error, default 10 attempts with exponential back-off capped at 60 s ≈ 55 min total), platform_state stays at connected because _mark_connected() was called at startup. Anything reading platforms.telegram.state (Prometheus exporter, kanban dashboard, status MCP) has no way to tell the bot has been unreachable for several minutes. The fatal-state flip only happens at attempt 11. Fix writes platform_state=retrying before each back-off sleep with the attempt counter + last error in error_message, and writes connected back on successful reconnect.

Both halves reuse the existing _set_fatal_error / _write_runtime_status_safe helpers and the existing retrying state string the gateway's fatal-error handler already emits — no new schema, no new env vars, no public-API change.

Related Issue

Fixes #29005.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/qqbot/adapter.py (+36/-3) — _listen_loop:
    • Replace _mark_disconnected() with _set_fatal_error("qq_reconnect_exhausted", ..., retryable=True) + await _notify_fatal_error() at all three MAX_RECONNECT_ATTEMPTS exits (rate-limit branch, generic QQCloseError branch, generic Exception branch).
    • Single shared error code; messages differentiate the root cause so log greps still tell the three sites apart.
  • gateway/platforms/telegram.py (+27) — _handle_polling_network_error:
    • Before sleeping: _write_runtime_status_safe(platform_state="retrying", error_code="telegram_network_error", error_message="Network error retry N/MAX (next in Xs): <err>").
    • After successful start_polling(): write platform_state="connected" to clear the retrying state.
  • tests/gateway/test_platform_reconnect_fatal_notify.py (+288, new) — 7 regression tests across 2 classes:
    • TestQQBotReconnectExhaustionNotifies (3 cases) — handler awaited with retryable=True; static guard over _listen_loop source proving at least 3 _set_fatal_error + 3 _notify_fatal_error escalations and the qq_reconnect_exhausted code remain; handler-less adapters don't crash.
    • TestTelegramPollingRetryStatus (4 cases) — first-attempt retrying write happens; the write carries the N/MAX counter + telegram_network_error code; successful reconnect emits retrying then connected in order; final-attempt fatal escalation still fires (no regression for #3173).

No production code outside the two adapter files modified.

How to Test

  1. Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e \".[all,dev]\"
  2. Run the new tests on their own:
    scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v
    Expected: 7 passed.
  3. Run the surrounding suite to confirm no cross-file regressions:
    scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
      tests/gateway/test_telegram_network_reconnect.py \
      tests/gateway/test_qqbot.py
    Expected: 166 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(qqbot): ..., fix(telegram): ..., test(gateway): ...)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments call out the gateway-watcher contract + #29005)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — both adapters' status-write helpers are platform-agnostic; fix is reachable on every OS that runs the gateway
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v
4 workers [7 items]
============================== 7 passed in 1.17s ===============================

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
    tests/gateway/test_telegram_network_reconnect.py \
    tests/gateway/test_qqbot.py
4 workers [166 items]
======================== 166 passed, 1 warning in 2.20s ========================

Made with Cursor

Changed files

  • gateway/platforms/qqbot/adapter.py (modified, +36/-3)
  • gateway/platforms/telegram.py (modified, +27/-0)
  • tests/gateway/test_platform_reconnect_fatal_notify.py (added, +288/-0)

Code Example

06:38  DNS resolution fails
06:38  Telegram starts retrying (1/10, 2/10, ...)
06:38  QQBot starts retrying (1/100, 2/100, ...)
07:33  Both adapters exhaust retries and stop
07:33  No more platform activity; gateway process still alive

---

# Current (buggy):
self._mark_disconnected()

# Should be:
self._set_fatal_error(f"QQBot reconnect exhausted after {MAX_RECONNECT_ATTEMPTS} attempts")
await self._notify_fatal_error()
RAW_BUFFERClick to expand / collapse

Bug Summary

When platform adapters lose network connectivity and exhaust their internal reconnection retries, they silently stop without notifying the gateway's fatal-error handling mechanism. This leaves the gateway process alive but the platform permanently unresponsive.

Observed Behavior

06:38  DNS resolution fails
06:38  Telegram starts retrying (1/10, 2/10, ...)
06:38  QQBot starts retrying (1/100, 2/100, ...)
07:33  Both adapters exhaust retries and stop
07:33  No more platform activity; gateway process still alive

Root Cause

File: gateway/platforms/qqbot/adapter.py

At lines ~608 and ~620, when attempt >= MAX_RECONNECT_ATTEMPTS:

# Current (buggy):
self._mark_disconnected()

# Should be:
self._set_fatal_error(f"QQBot reconnect exhausted after {MAX_RECONNECT_ATTEMPTS} attempts")
await self._notify_fatal_error()

Without _notify_fatal_error(), the gateway's _platform_reconnect_watcher() never attempts to restart the QQBot adapter.

File: gateway/platforms/telegram.py

During _handle_polling_network_error(), the runtime status remains connected during retries 1-10. The fatal error is only set at attempt 11. Consider writing a retrying state during the retry window so external monitors can detect degraded connectivity earlier.

Suggested Fixes

PriorityFileChange
P0qqbot/adapter.pyOn reconnect exhaustion: _set_fatal_error() + _notify_fatal_error()
P1telegram.pyWrite retrying state during network error retries

Environment

  • hermes-agent version: latest main (May 20, 2026)
  • OS: macOS 26.5
  • Python: 3.12

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING