hermes - ✅(Solved) Fix Gateway reconnect watcher permanently stops retryable platforms after 20 failed attempts [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17063Fetched 2026-04-29 06:37:27
View on GitHub
Comments
2
Participants
3
Timeline
8
Reactions
0
Author
Timeline (top)
labeled ×4commented ×2cross-referenced ×2

The gateway-level platform reconnect watcher removes retryable platform failures from _failed_platforms after 20 failed attempts. For long-running messaging adapters such as Telegram, this converts a transient network/proxy outage into a permanent outage until the gateway is manually restarted.

This is distinct from #11614: #11614 covers the gateway exiting when all platforms fail. This bug happens while the gateway process stays alive: the retry queue drops the platform after the hard cap, so the platform never reconnects even after the network recovers.

Root Cause

In gateway/run.py, _platform_reconnect_watcher uses a fixed _MAX_ATTEMPTS = 20 and deletes the platform from _failed_platforms once info["attempts"] >= _MAX_ATTEMPTS.

That behavior is correct for non-retryable errors such as bad auth tokens, but wrong for retryable network failures. A retryable platform should continue retrying indefinitely with capped backoff, or otherwise escalate to a supervisor/restart path. It should not silently stop.

Fix Action

Fix / Workaround

Local patch validated

A local patch changed the watcher to keep retryable failures queued indefinitely with 300s capped backoff. The focused test suite passed:

PR fix notes

PR #17216: fix: gateway reconnect watcher retries indefinitely instead of giving up after 20 attempts

Description (problem / solution / changelog)

Problem

The platform reconnect watcher in gateway/run.py permanently removes retryable platforms from _failed_platforms after 20 failed attempts (_MAX_ATTEMPTS = 20). For long-running gateways (days/weeks), this converts transient network/proxy outages into permanent disconnections requiring manual hermes gateway restart.

Observed timeline (from a real gateway):

  • Telegram hit repeated httpx.ConnectError during proxy-backed Bot API calls
  • Gateway retried for ~2 hours (20 attempts with exponential backoff)
  • Gateway logged Giving up reconnecting telegram after 20 attempts and removed Telegram from _failed_platforms
  • Telegram remained disconnected until manual restart, even after network recovered

Fixes #17063

Fix

Instead of deleting the platform from the retry queue after 20 attempts, reset the attempt counter and continue retrying at the backoff cap (5 minutes). This ensures long-running gateways eventually recover from transient outages.

Before: Platform permanently abandoned after 20 failed attempts After: Platform retries every 5 minutes indefinitely (until gateway restart or successful reconnect)

Changes

  • gateway/run.py: Replace del self._failed_platforms[platform] with info["attempts"] = 0 and schedule next retry at backoff cap
  • Change log level from WARNING to INFO (this is expected behavior, not an error)

Tests

  • Existing gateway reconnect tests should still pass
  • The fix is minimal (4 lines changed) and preserves all existing behavior except the permanent abandonment

Changed files

  • gateway/run.py (modified, +10/-4)

PR #17219: fix(gateway): don't give up on retryable platform reconnect (#17063)

Description (problem / solution / changelog)

What does this PR do?

_platform_reconnect_watcher had a fixed _MAX_ATTEMPTS = 20 and deleted the platform from _failed_platforms once that count was hit — even when the underlying error was a retryable network/proxy outage. For long-running messaging adapters (Telegram, Slack, Discord) the cap plus the 5-minute capped backoff means a multi-hour proxy interruption silently converted to a permanent outage that only hermes gateway restart could recover from.

The reporter observed exactly this on Telegram (#17063): httpx.ConnectError against the Bot API proxy → reconnect queue retried 20 times → Giving up reconnecting telegram after 20 attempts → Telegram stayed offline despite the proxy coming back later. Distinct from #11614, which is about the gateway exiting when all platforms fail at startup; here the gateway itself stays alive but silently loses one platform.

The fix drops the give-up branch entirely and lets retryable failures keep retrying at the 300s capped backoff indefinitely. The non-retryable fast-path (adapter.has_fatal_error and not adapter.fatal_error_retryable) is the correct "stop trying" gate and is unchanged — bad auth tokens and revoked credentials still drop out of the queue immediately.

Related Issue

Fixes #17063

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py_platform_reconnect_watcher: removed the _MAX_ATTEMPTS constant + the attempts >= _MAX_ATTEMPTS give-up branch. Updated the docstring to describe the two real exit conditions (successful reconnect, non-retryable fatal error). Dropped the now-meaningless /<MAX_ATTEMPTS> suffix from the per-attempt info log; the attempt counter still increments forever per platform so attempt-count telemetry is preserved.
  • tests/gateway/test_platform_reconnect.py — replaced the pre-existing test_reconnect_gives_up_after_max_attempts (which codified the buggy behavior) with two regression tests: (1) a retryable failure at attempts=20 stays queued and attempts becomes 21, and (2) a non-retryable failure at attempts=25 is still removed (the fix must not soften that path).

How to Test

  1. Reproduce the issue manually: configure Telegram with a proxy you can take offline; wait for the gateway to mark Telegram retryable; hold the proxy down for 30 + 60 + 120 + 240 + 16 * 300 ≈ 90 minutes worth of attempts.
  2. Before this fix: gateway logs Giving up reconnecting telegram after 20 attempts; even after the proxy comes back, Telegram stays disconnected until hermes gateway restart.
  3. After this fix: gateway keeps logging Reconnecting telegram (attempt N)... at the 300s capped interval; once the proxy is reachable the next attempt succeeds and Telegram comes back online without operator intervention.

Automated:

pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -q

Result on macOS 15.6 / Python 3.14: 15 + 87 = 102 passed, 2 skipped. The new test fails on origin/main (the buggy Giving up reconnecting telegram after 20 attempts log line shows up in the failure).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/gateway/test_platform_reconnect.py tests/gateway/test_telegram_network_reconnect.py tests/gateway/test_platform_base.py -q and the touched surface (102 tests) all passes. Full pytest tests/ -q not run; the gateway-platform reconnect path has dedicated test files which are exercised in full above.
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.6 (Python 3.14)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A (updated _platform_reconnect_watcher's docstring to describe the two real exit conditions; no user-facing docs reference the old 20-attempt limit)
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A (N/A — no config keys touched; the watcher remains a fixed-policy background task)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A (N/A)
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A (N/A — pure asyncio/logging, identical across platforms)
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A (N/A — gateway internal, not a tool)

Screenshots / Logs

$ pytest tests/gateway/test_platform_reconnect.py -q
...............                                                          [100%]
15 passed, 2 warnings in 3.18s

Without the fix, test_reconnect_retryable_keeps_trying_past_old_max_cap fails with the exact symptom the reporter described:

WARNING  gateway.run:run.py:2819 Giving up reconnecting telegram after 20 attempts
AssertionError: assert <Platform.TELEGRAM: 'telegram'> in {}

Changed files

  • gateway/run.py (modified, +14/-14)
  • tests/gateway/test_platform_reconnect.py (modified, +64/-5)
RAW_BUFFERClick to expand / collapse

Summary

The gateway-level platform reconnect watcher removes retryable platform failures from _failed_platforms after 20 failed attempts. For long-running messaging adapters such as Telegram, this converts a transient network/proxy outage into a permanent outage until the gateway is manually restarted.

This is distinct from #11614: #11614 covers the gateway exiting when all platforms fail. This bug happens while the gateway process stays alive: the retry queue drops the platform after the hard cap, so the platform never reconnects even after the network recovers.

Observed timeline from a real gateway

  • 2026-04-28 17:50 CST: Telegram inbound messages were still being received.
  • 2026-04-28 18:11 CST: Telegram polling hit repeated httpx.ConnectError during proxy-backed Bot API calls.
  • 2026-04-28 18:19 CST: Telegram adapter marked a retryable telegram_network_error and queued the platform for gateway-level reconnection.
  • 2026-04-28 18:19-20:17 CST: gateway reconnect watcher retried Telegram repeatedly.
  • 2026-04-28 20:22 CST: gateway logged Giving up reconnecting telegram after 20 attempts and removed Telegram from _failed_platforms.
  • After that point, Telegram remained disconnected until a manual hermes gateway restart, even though the proxy/network was later reachable again.

Root cause

In gateway/run.py, _platform_reconnect_watcher uses a fixed _MAX_ATTEMPTS = 20 and deletes the platform from _failed_platforms once info["attempts"] >= _MAX_ATTEMPTS.

That behavior is correct for non-retryable errors such as bad auth tokens, but wrong for retryable network failures. A retryable platform should continue retrying indefinitely with capped backoff, or otherwise escalate to a supervisor/restart path. It should not silently stop.

Expected behavior

  • Retryable failures keep retrying with capped backoff, e.g. 30s -> 60s -> 120s -> 240s -> 300s cap.
  • Non-retryable failures remain removed from the retry queue.
  • Logs should not say "giving up" for retryable network failures unless there is another recovery path.

Suggested fix

Remove the fixed max-attempt removal for retryable failures and only remove the platform when the adapter reports has_fatal_error and not fatal_error_retryable.

A regression test can seed _failed_platforms[Platform.TELEGRAM] with attempts = 20, return a retryable connect failure, and assert the platform remains queued with attempts == 21.

Local patch validated

A local patch changed the watcher to keep retryable failures queued indefinitely with 300s capped backoff. The focused test suite passed:

venv/bin/python -m pytest tests/gateway/test_platform_reconnect.py -q

Result: 14 passed.

extent analysis

TL;DR

Remove the fixed max-attempt removal for retryable failures in the _platform_reconnect_watcher to prevent silent disconnection of platforms like Telegram after a transient network outage.

Guidance

  • Review the _platform_reconnect_watcher logic in gateway/run.py to understand how retryable failures are currently handled.
  • Modify the condition for removing a platform from _failed_platforms to only consider non-retryable failures, using the suggested fix: has_fatal_error and not fatal_error_retryable.
  • Implement a capped backoff strategy for retryable failures, such as 30s -> 60s -> 120s -> 240s -> 300s cap, to prevent overwhelming the system with retries.
  • Create a regression test to validate the new behavior, seeding _failed_platforms with a retryable failure and asserting the platform remains queued.

Example

if info["has_fatal_error"] and not info["fatal_error_retryable"]:
    # Remove platform from _failed_platforms
    del _failed_platforms[platform]
else:
    # Retry with capped backoff
    attempts = info["attempts"] + 1
    backoff = min(300, 30 * 2 ** (attempts - 1))
    # Schedule retry

Notes

The suggested fix assumes that the has_fatal_error and fatal_error_retryable flags are correctly set by the adapter. Additional logging and monitoring may be necessary to detect and handle cases where the adapter reports incorrect flags.

Recommendation

Apply the suggested workaround by modifying the _platform_reconnect_watcher logic to remove the fixed max-attempt removal for retryable failures, as this will prevent silent disconnection of platforms like Telegram after a transient network outage.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • Retryable failures keep retrying with capped backoff, e.g. 30s -> 60s -> 120s -> 240s -> 300s cap.
  • Non-retryable failures remain removed from the retry queue.
  • Logs should not say "giving up" for retryable network failures unless there is another recovery path.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Gateway reconnect watcher permanently stops retryable platforms after 20 failed attempts [2 pull requests, 2 comments, 3 participants]