hermes - ✅(Solved) Fix Gateway exits when Telegram disconnects, killing embedded cron ticker [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#11614Fetched 2026-04-18 05:59:51
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1

Error Message

gateway/run.py lines 1051-1058: when all platforms fail with a retryable error, the process exits immediately instead of letting _platform_reconnect_watcher handle reconnection.

Root Cause

gateway/run.py lines 1051-1058: when all platforms fail with a retryable error, the process exits immediately instead of letting _platform_reconnect_watcher handle reconnection.

Fix Action

Fixed

PR fix notes

PR #11691: fix(gateway): keep cron alive during reconnect backoff

Description (problem / solution / changelog)

What does this PR do?

Keeps the gateway process alive when the last connected platform fails with a retryable fatal error and has already been queued for background reconnection. This preserves the embedded cron ticker while _platform_reconnect_watcher handles reconnect backoff.

Related Issue

Fixes #11614

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✅ Tests (adding or improving test coverage)

Changes Made

  • remove the forced shutdown path in gateway/run.py when all adapters are down but platforms remain queued for reconnect
  • keep the existing reconnect watcher responsible for recovering the failed platform in the background
  • update tests/gateway/test_platform_reconnect.py to assert the gateway stays alive and preserves the reconnect queue instead of stopping immediately

How to Test

  1. Reproduce the retryable fatal-error path where the final connected platform disconnects and is placed into _failed_platforms.
  2. Run ./.venv/bin/python -m pytest -o addopts='' tests/gateway/test_platform_reconnect.py -q.
  3. Confirm the reconnect test asserts runner.stop() is not called and the gateway remains alive while reconnect backoff continues.

Notes

The issue already points to _platform_reconnect_watcher as the intended recovery path. This patch keeps that behavior active instead of exiting the process and interrupting embedded cron jobs during the reconnect window.

Changed files

  • gateway/run.py (modified, +8/-16)
  • tests/gateway/test_platform_reconnect.py (modified, +8/-7)
  • tests/gateway/test_runner_fatal_adapter.py (modified, +5/-4)
RAW_BUFFERClick to expand / collapse

Bug Description

When Telegram connection fails after all retry attempts, the Gateway calls await self.stop() and exits. This kills the embedded cron ticker (run.py ~9590), causing scheduled cron jobs to miss execution during the restart window.

Root Cause

gateway/run.py lines 1051-1058: when all platforms fail with a retryable error, the process exits immediately instead of letting _platform_reconnect_watcher handle reconnection.

Fix Applied

Removed the if adapter.fatal_error_retryable: ... await self.stop() branch. Both retryable and non-retryable errors now stay alive — _platform_reconnect_watcher handles reconnection with exponential backoff (30s → 60s → 120s → 240s → 300s cap) while the cron ticker continues running.

Changed in gateway/run.py lines 1047-1058:

  • Before: retryable errors → await self.stop() → process exits, cron killed
  • After: all errors → warning log → stay alive, cron keeps running

Environment

  • Hermes Agent (NousResearch/hermes-agent)
  • Gateway running as systemd service with Restart=on-failure
  • Single platform: Telegram only

extent analysis

TL;DR

Remove the if adapter.fatal_error_retryable: ... await self.stop() branch in gateway/run.py to prevent the process from exiting when all platforms fail with a retryable error.

Guidance

  • Identify the lines of code responsible for the issue (1051-1058 in gateway/run.py) and verify that the if adapter.fatal_error_retryable branch is removed.
  • Ensure that the _platform_reconnect_watcher is handling reconnection with exponential backoff as intended.
  • Test the Gateway's behavior when all platforms fail with a retryable error to confirm that the process stays alive and the cron ticker continues running.
  • Review the systemd service configuration to ensure that the Restart=on-failure setting is still appropriate given the changes made to the Gateway's error handling.

Example

No code snippet is provided as the issue already includes the necessary information about the code changes made.

Notes

This fix assumes that the _platform_reconnect_watcher is correctly implemented to handle reconnection with exponential backoff. If this is not the case, additional changes may be necessary.

Recommendation

Apply the workaround by removing the if adapter.fatal_error_retryable: ... await self.stop() branch, as this allows the Gateway to stay alive and the cron ticker to continue running even when all platforms fail with a retryable error.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING