hermes - ✅(Solved) Fix fix(gateway): retryable reconnects stop after prolonged network loss [1 pull requests, 1 participants]

denhubr · 2026-04-19T15:55:22Z

[hermes] After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the ga… After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the gateway was required to restore connectivity. # PR #13197: fix(gateway): keep retryable platform reconnects queued - Repository: NousResearch/hermes-agent - Author: denhubr - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/13197 ## Description (problem / solution / changelog) The reconnect watcher previously gave up after 20 failed attempts, even when the adapter marked the failure as retryable. That caused transient infrastructure issues like DNS failures to become permanent until the gateway was restarted. Keep retryable failures in the reconnect queue and continue applying backoff until the platform recovers or returns a non-retryable error. Add a regression test covering retryable failures past the previous 20-attempt limit. ## What does this PR do? Fixes a gateway reconnect bug where retryable platform failures were eventually dropped after repeated attempts instead of staying in the reconnect queue. This allows transient outages such as prolonged network loss or DNS failures to recover automatically once connectivity returns. ## Related Issue Fixes #12607 Related: #11241, which keeps retryable fatal failures in-process. This PR fixes the remaining reconnect path where retryable failures were still dropped after 20 attempts. ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [x] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - Updated `gateway/run.py` so retryable reconnect failures remain queued instead of being dropped after the previous 20-attempt limit. - Preserved reconnect backoff behavior until the platform recovers or reports a non-retryable error. - Tightened reconnect logging so periodic warnings start at the 10th attempt while normal retry logs include the retry delay. - Added a regression test in `tests/gateway/test_platform_reconnect.py` covering retryable failures beyond the old limit. ## How to Test 1. Run `scripts/run_tests.sh tests/gateway/test_platform_reconnect.py`. 2. Confirm the reconnect watcher tests pass, including the retryable failure case beyond 20 attempts. 3. Optionally reproduce manually by forcing repeated retryable reconnect failures, then restoring connectivity and confirming the gateway resumes reconnecting without a restart. ## Checklist ### Code - [x] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [x] My commit messages follow [Conventional Commits](https://www.conventionalcommits.org/) (`fix(scope):`, `feat(scope):`, etc.) - [x] I searched for [existing PRs](https://github.com/NousResearch/hermes-agent/pulls) to make sure this isn't a duplicate - [x] My PR contains **only** changes related to this fix/feature (no unrelated commits) - [ ] I've run `pytest tests/ -q` and all tests pass - [x] I've added tests for my changes (required for bug fixes, strongly encouraged for features) - [x] I've tested on my platform: Debian 13 (x86_64) ### Documentation & Housekeeping - [x] I've updated relevant documentation (README, `docs/`, docstrings) — or N/A - [x] I've updated `cli-config.yaml.example` if I added/changed config keys — or N/A - [x] I've updated `CONTRIBUTING.md` or `AGENTS.md` if I changed architecture or workflows — or N/A - [x] I've considered cross-platform impact (Windows, macOS) per the [compatibility guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md#cross-platform-compatibility) — or N/A - [x] I've updated tool descriptions/schemas if I changed tool behavior — or N/A ## Screenshots / Logs Targeted regression test run: `14 passed in 0.74s` Note: upstream `main` is currently red in GitHub Actions, including at least one unrelated failing test (`tests/test_mcp_serve.py::TestEventBridgePollE2E::test_poll_detects_new_message_after_db_write`) reproduced on clean `upstream/main` outside this PR. ## Changed files - `gateway/run.py` (modified, +13/-13) - `tests/gateway/test_platform_reconnect.py` (modified, +10/-6) ## Summary After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the gateway was required to restore connectivity. ## Reproduction 1. Start the gateway with Telegram configured and connected. 2. Simulate a prolonged infrastructure outage, for example by disabling networking or causing repeated DNS failures for a few hours. 3. Wait for the reconnect loop to exhaust repeated retryable failures. 4. Restore network

Summary

After network connectivity was lost for a couple of hours, the Telegram gateway stopped reconnecting and never recovered on its own. Restarting the gateway was required to restore connectivity.

Reproduction

Start the gateway with Telegram configured and connected.
Simulate a prolonged infrastructure outage, for example by disabling networking or causing repeated DNS failures for a few hours.
Wait for the reconnect loop to exhaust repeated retryable failures.
Restore network connectivity.

Expected Behavior

The gateway should keep retrying retryable reconnect failures with backoff until connectivity returns or the adapter reports a non-retryable error.

Actual Behavior

The reconnect watcher gives up after repeated retryable failures, so the Telegram adapter remains disconnected permanently until the gateway is restarted.

Notes

This appears to affect transient infrastructure failures such as DNS resolution problems. The failure should remain queued for reconnect instead of being dropped after a fixed attempt limit.

TL;DR

The Telegram gateway can be fixed by modifying the reconnect logic to continue retrying after a prolonged infrastructure outage.

Guidance

Review the reconnect watcher's implementation to identify why it gives up after repeated retryable failures, potentially due to a fixed attempt limit.
Consider implementing an exponential backoff strategy for reconnect attempts to prevent overwhelming the system during prolonged outages.
Investigate the possibility of adding a queue or buffer to store reconnect attempts, allowing the gateway to retry failed connections when connectivity is restored.
Verify that the Telegram adapter correctly reports non-retryable errors to prevent infinite reconnect loops.

Example

No code snippet is provided due to the lack of specific implementation details in the issue.

Notes

The solution may require modifications to the gateway's reconnect logic and potentially the Telegram adapter's error reporting mechanism. The exact changes will depend on the specific implementation and requirements of the system.

Recommendation

Apply a workaround by modifying the reconnect logic to continue retrying after a prolonged infrastructure outage, as this appears to be the root cause of the issue. This will allow the gateway to recover from transient infrastructure failures without requiring a restart.

hermes - ✅(Solved) Fix fix(gateway): retryable reconnects stop after prolonged network loss [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #13197: fix(gateway): keep retryable platform reconnects queued

Description (problem / solution / changelog)

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Summary

Reproduction

Expected Behavior

Actual Behavior

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING