hermes - ✅(Solved) Fix QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status [1 pull requests, 1 comments, 2 participants]

jfdnet · 2026-05-20T00:53:17Z

[hermes] PR 29007: fix gateway : notify on reconnect exhaustion + surface telegram retrying state 29005 - Repository: NousResearch/hermes-agent - Author: xxxig… # PR #29007: fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005) - Repository: NousResearch/hermes-agent - Author: xxxigm - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/29007 ## Description (problem / solution / changelog) ## What does this PR do? Fixes #29005 — both halves of the reported regression: **P0 — QQBot:** the three `MAX_RECONNECT_ATTEMPTS` exit paths in `QQAdapter._listen_loop` ended with a bare `_mark_disconnected()` and a `return`. That writes `platform_state=disconnected` to the runtime status file and kills the listener task — but does **not** fire the `fatal_error_handler` that `GatewayRunner._platform_reconnect_watcher` is wired through. With no fatal-error event, the watcher never adds the platform to `_failed_platforms` and the adapter stays permanently dead while the gateway process keeps running. Reporter's log shows exactly this: QQBot exhausts its 100-attempt budget at 07:33 and from then on nothing tries to bring it back. Fix replaces each `_mark_disconnected()` with `_set_fatal_error(..., retryable=True)` + `await _notify_fatal_error()`, so the same gateway watcher that already handles Telegram / WhatsApp / Slack picks the platform up with exponential back-off (cap 5 min, circuit-broken after 10 consecutive failures). **P1 — Telegram:** during the polling reconnect ladder (`_handle_polling_network_error`, default 10 attempts with exponential back-off capped at 60 s ≈ 55 min total), `platform_state` stays at `connected` because `_mark_connected()` was called at startup. Anything reading `platforms.telegram.state` (Prometheus exporter, kanban dashboard, status MCP) has no way to tell the bot has been unreachable for several minutes. The fatal-state flip only happens at attempt 11. Fix writes `platform_state=retrying` before each back-off sleep with the attempt counter + last error in `error_message`, and writes `connected` back on successful reconnect. Both halves reuse the existing `_set_fatal_error` / `_write_runtime_status_safe` helpers and the existing `retrying` state string the gateway's fatal-error handler already emits — no new schema, no new env vars, no public-API change. ## Related Issue Fixes #29005. ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [x] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - `gateway/platforms/qqbot/adapter.py` (+36/-3) — `_listen_loop`: - Replace `_mark_disconnected()` with `_set_fatal_error("qq_reconnect_exhausted", ..., retryable=True)` + `await _notify_fatal_error()` at all three `MAX_RECONNECT_ATTEMPTS` exits (rate-limit branch, generic `QQCloseError` branch, generic `Exception` branch). - Single shared error code; messages differentiate the root cause so log greps still tell the three sites apart. - `gateway/platforms/telegram.py` (+27) — `_handle_polling_network_error`: - Before sleeping: `_write_runtime_status_safe(platform_state="retrying", error_code="telegram_network_error", error_message="Network error retry N/MAX (next in Xs): ")`. - After successful `start_polling()`: write `platform_state="connected"` to clear the retrying state. - `tests/gateway/test_platform_reconnect_fatal_notify.py` (+288, new) — 7 regression tests across 2 classes: - `TestQQBotReconnectExhaustionNotifies` (3 cases) — handler awaited with `retryable=True`; static guard over `_listen_loop` source proving at least 3 `_set_fatal_error` + 3 `_notify_fatal_error` escalations and the `qq_reconnect_exhausted` code remain; handler-less adapters don't crash. - `TestTelegramPollingRetryStatus` (4 cases) — first-attempt `retrying` write happens; the write carries the `N/MAX` counter + `telegram_network_error` code; successful reconnect emits `retrying` then `connected` in order; final-attempt fatal escalation still fires (no regression for #3173). No production code outside the two adapter files modified. ## How to Test 1. Check out this branch and ensure `.venv` is set up: `python3 -m venv .venv && source .venv/bin/activate && pip install -e \".[all,dev]\"` 2. Run the new tests on their own: ``` scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v ``` Expected: 7 passed. 3. Run the surrounding suite to confirm no cross-file regressions: ``` scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \ tests/gateway/test_telegram_network_reconnect.py \ tests/gateway/test_qqbot.py ``` Expected: 166 passed. ## Checklist ### Code - [x] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [x] My commit messages

hermes2026-05-20 00:53:17

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#29005•Fetched 2026-05-20 04:00:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jfdnet

Participants

alt-glitch

jfdnet

Timeline (top)

labeled ×5commented ×1cross-referenced ×1

Error Message

When platform adapters lose network connectivity and exhaust their internal reconnection retries, they silently stop without notifying the gateway's fatal-error handling mechanism. This leaves the gateway process alive but the platform permanently unresponsive. During _handle_polling_network_error(), the runtime status remains connected during retries 1-10. The fatal error is only set at attempt 11. Consider writing a retrying state during the retry window so external monitors can detect degraded connectivity earlier. | P1 | telegram.py | Write retrying state during network error retries |

Root Cause

File: gateway/platforms/qqbot/adapter.py

At lines ~608 and ~620, when attempt >= MAX_RECONNECT_ATTEMPTS:

# Current (buggy):
self._mark_disconnected()

# Should be:
self._set_fatal_error(f"QQBot reconnect exhausted after {MAX_RECONNECT_ATTEMPTS} attempts")
await self._notify_fatal_error()

Without _notify_fatal_error(), the gateway's _platform_reconnect_watcher() never attempts to restart the QQBot adapter.

File: gateway/platforms/telegram.py

During _handle_polling_network_error(), the runtime status remains connected during retries 1-10. The fatal error is only set at attempt 11. Consider writing a retrying state during the retry window so external monitors can detect degraded connectivity earlier.

Fix Action

Fixed

Fixed by PR: fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005) (https://github.com/NousResearch/hermes-agent/pull/29007)

PR fix notes

PR #29007: fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)

Repository: NousResearch/hermes-agent
Author: xxxigm
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/29007

Description (problem / solution / changelog)

What does this PR do?

Fixes #29005 — both halves of the reported regression:

P0 — QQBot: the three MAX_RECONNECT_ATTEMPTS exit paths in QQAdapter._listen_loop ended with a bare _mark_disconnected() and a return. That writes platform_state=disconnected to the runtime status file and kills the listener task — but does not fire the fatal_error_handler that GatewayRunner._platform_reconnect_watcher is wired through. With no fatal-error event, the watcher never adds the platform to _failed_platforms and the adapter stays permanently dead while the gateway process keeps running. Reporter's log shows exactly this: QQBot exhausts its 100-attempt budget at 07:33 and from then on nothing tries to bring it back. Fix replaces each _mark_disconnected() with _set_fatal_error(..., retryable=True) + await _notify_fatal_error(), so the same gateway watcher that already handles Telegram / WhatsApp / Slack picks the platform up with exponential back-off (cap 5 min, circuit-broken after 10 consecutive failures).

P1 — Telegram: during the polling reconnect ladder (_handle_polling_network_error, default 10 attempts with exponential back-off capped at 60 s ≈ 55 min total), platform_state stays at connected because _mark_connected() was called at startup. Anything reading platforms.telegram.state (Prometheus exporter, kanban dashboard, status MCP) has no way to tell the bot has been unreachable for several minutes. The fatal-state flip only happens at attempt 11. Fix writes platform_state=retrying before each back-off sleep with the attempt counter + last error in error_message, and writes connected back on successful reconnect.

Both halves reuse the existing _set_fatal_error / _write_runtime_status_safe helpers and the existing retrying state string the gateway's fatal-error handler already emits — no new schema, no new env vars, no public-API change.

Related Issue

Fixes #29005.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

gateway/platforms/qqbot/adapter.py (+36/-3) — _listen_loop:
- Replace _mark_disconnected() with _set_fatal_error("qq_reconnect_exhausted", ..., retryable=True) + await _notify_fatal_error() at all three MAX_RECONNECT_ATTEMPTS exits (rate-limit branch, generic QQCloseError branch, generic Exception branch).
- Single shared error code; messages differentiate the root cause so log greps still tell the three sites apart.
gateway/platforms/telegram.py (+27) — _handle_polling_network_error:
- Before sleeping: _write_runtime_status_safe(platform_state="retrying", error_code="telegram_network_error", error_message="Network error retry N/MAX (next in Xs): <err>").
- After successful start_polling(): write platform_state="connected" to clear the retrying state.
tests/gateway/test_platform_reconnect_fatal_notify.py (+288, new) — 7 regression tests across 2 classes:
- TestQQBotReconnectExhaustionNotifies (3 cases) — handler awaited with retryable=True; static guard over _listen_loop source proving at least 3 _set_fatal_error + 3 _notify_fatal_error escalations and the qq_reconnect_exhausted code remain; handler-less adapters don't crash.
- TestTelegramPollingRetryStatus (4 cases) — first-attempt retrying write happens; the write carries the N/MAX counter + telegram_network_error code; successful reconnect emits retrying then connected in order; final-attempt fatal escalation still fires (no regression for #3173).

No production code outside the two adapter files modified.

How to Test

Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e \".[all,dev]\"

Run the new tests on their own:

scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v

Expected: 7 passed.

Run the surrounding suite to confirm no cross-file regressions:

scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
  tests/gateway/test_telegram_network_reconnect.py \
  tests/gateway/test_qqbot.py

Expected: 166 passed.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(qqbot): ..., fix(telegram): ..., test(gateway): ...)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix (no unrelated commits)
I've run scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py and all tests pass
I've added tests for my changes
I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — N/A (no public-API change; inline comments call out the gateway-watcher contract + #29005)
I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — both adapters' status-write helpers are platform-agnostic; fix is reachable on every OS that runs the gateway
I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py -v
4 workers [7 items]
============================== 7 passed in 1.17s ===============================

$ scripts/run_tests.sh tests/gateway/test_platform_reconnect_fatal_notify.py \
    tests/gateway/test_telegram_network_reconnect.py \
    tests/gateway/test_qqbot.py
4 workers [166 items]
======================== 166 passed, 1 warning in 2.20s ========================

Made with Cursor

Changed files

gateway/platforms/qqbot/adapter.py (modified, +36/-3)
gateway/platforms/telegram.py (modified, +27/-0)
tests/gateway/test_platform_reconnect_fatal_notify.py (added, +288/-0)

Code Example

06:38  DNS resolution fails
06:38  Telegram starts retrying (1/10, 2/10, ...)
06:38  QQBot starts retrying (1/100, 2/100, ...)
07:33  Both adapters exhaust retries and stop
07:33  No more platform activity; gateway process still alive

---

# Current (buggy):
self._mark_disconnected()

# Should be:
self._set_fatal_error(f"QQBot reconnect exhausted after {MAX_RECONNECT_ATTEMPTS} attempts")
await self._notify_fatal_error()

RAW_BUFFERClick to expand / collapse

Bug Summary

Observed Behavior

06:38  DNS resolution fails
06:38  Telegram starts retrying (1/10, 2/10, ...)
06:38  QQBot starts retrying (1/100, 2/100, ...)
07:33  Both adapters exhaust retries and stop
07:33  No more platform activity; gateway process still alive

Root Cause

File: gateway/platforms/qqbot/adapter.py

At lines ~608 and ~620, when attempt >= MAX_RECONNECT_ATTEMPTS:

# Current (buggy):
self._mark_disconnected()

# Should be:
self._set_fatal_error(f"QQBot reconnect exhausted after {MAX_RECONNECT_ATTEMPTS} attempts")
await self._notify_fatal_error()

Without _notify_fatal_error(), the gateway's _platform_reconnect_watcher() never attempts to restart the QQBot adapter.

File: gateway/platforms/telegram.py

Suggested Fixes

Priority	File	Change
P0	`qqbot/adapter.py`	On reconnect exhaustion: `_set_fatal_error()` + `_notify_fatal_error()`
P1	`telegram.py`	Write `retrying` state during network error retries

Environment

hermes-agent version: latest main (May 20, 2026)
OS: macOS 26.5
Python: 3.12

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - ✅(Solved) Fix QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #29007: fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Code Example

Bug Summary

Observed Behavior

Root Cause

Suggested Fixes

Environment

Still need to ship something?

TRENDING

hermes - ✅(Solved) Fix QQBot adapter does not notify gateway on reconnect exhaustion; Telegram retry state not reflected in runtime status [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #29007: fix(gateway): notify on reconnect exhaustion + surface telegram retrying state (#29005)

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Code Example

Bug Summary

Observed Behavior

Root Cause

Suggested Fixes

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING