hermes - ✅(Solved) Fix fix(gateway): retryable reconnects stop after prolonged network loss [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#12607Fetched 2026-04-20 12:17:59
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the gateway was required to restore connectivity.

Error Message

The gateway should keep retrying retryable reconnect failures with backoff until connectivity returns or the adapter reports a non-retryable error.

Root Cause

After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the gateway was required to restore connectivity.

PR fix notes

PR #13197: fix(gateway): keep retryable platform reconnects queued

Description (problem / solution / changelog)

The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted.

Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit.

What does this PR do?

Fixes a gateway reconnect bug where retryable platform failures were eventually dropped after repeated attempts instead of staying in the reconnect queue. This allows transient outages such as prolonged network loss or DNS failures to recover automatically once connectivity returns.

Related Issue

Fixes #12607

Related: #11241, which keeps retryable fatal failures in-process. This PR fixes the remaining reconnect path where retryable failures were still dropped after 20 attempts.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Updated gateway/run.py so retryable reconnect failures remain queued instead of being dropped after the previous 20-attempt limit.
  • Preserved reconnect backoff behavior until the platform recovers or reports a non-retryable error.
  • Tightened reconnect logging so periodic warnings start at the 10th attempt while normal retry logs include the retry delay.
  • Added a regression test in tests/gateway/test_platform_reconnect.py covering retryable failures beyond the old limit.

How to Test

  1. Run scripts/run_tests.sh tests/gateway/test_platform_reconnect.py.
  2. Confirm the reconnect watcher tests pass, including the retryable failure case beyond 20 attempts.
  3. Optionally reproduce manually by forcing repeated retryable reconnect failures, then restoring connectivity and confirming the gateway resumes reconnecting without a restart.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Debian 13 (x86_64)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Targeted regression test run:

14 passed in 0.74s

Note: upstream main is currently red in GitHub Actions, including at least one unrelated failing test (tests/test_mcp_serve.py::TestEventBridgePollE2E::test_poll_detects_new_message_after_db_write) reproduced on clean upstream/main outside this PR.

Changed files

  • gateway/run.py (modified, +13/-13)
  • tests/gateway/test_platform_reconnect.py (modified, +10/-6)
RAW_BUFFERClick to expand / collapse

Summary

After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the gateway was required to restore connectivity.

Reproduction

  1. Start the gateway with Telegram configured and connected.
  2. Simulate a prolonged infrastructure outage, for example by disabling networking or causing repeated DNS failures for a few hours.
  3. Wait for the reconnect loop to exhaust repeated retryable failures.
  4. Restore network connectivity.

Expected Behavior

The gateway should keep retrying retryable reconnect failures with backoff until connectivity returns or the adapter reports a non-retryable error.

Actual Behavior

The reconnect watcher gives up after repeated retryable failures, so the Telegram adapter remains disconnected permanently until the gateway is restarted.

Notes

This appears to affect transient infrastructure failures such as DNS resolution problems. The failure should remain queued for reconnect instead of being dropped after a fixed attempt limit.

extent analysis

TL;DR

The Telegram gateway can be fixed by modifying the reconnect logic to continue retrying after a prolonged infrastructure outage.

Guidance

  • Review the reconnect watcher's implementation to identify why it gives up after repeated retryable failures, potentially due to a fixed attempt limit.
  • Consider implementing an exponential backoff strategy for reconnect attempts to prevent overwhelming the system during prolonged outages.
  • Investigate the possibility of adding a queue or buffer to store reconnect attempts, allowing the gateway to retry failed connections when connectivity is restored.
  • Verify that the Telegram adapter correctly reports non-retryable errors to prevent infinite reconnect loops.

Example

No code snippet is provided due to the lack of specific implementation details in the issue.

Notes

The solution may require modifications to the gateway's reconnect logic and potentially the Telegram adapter's error reporting mechanism. The exact changes will depend on the specific implementation and requirements of the system.

Recommendation

Apply a workaround by modifying the reconnect logic to continue retrying after a prolonged infrastructure outage, as this appears to be the root cause of the issue. This will allow the gateway to recover from transient infrastructure failures without requiring a restart.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix fix(gateway): retryable reconnects stop after prolonged network loss [1 pull requests, 1 participants]