hermes - 💡(How to fix) Fix [Bug] Telegram adapter auto-pause never auto-recovers after transient DNS failure [2 pull requests]

Error Message

The core problem: When the network recovers (DNS resolves again), the adapter remains permanently paused and never auto-recovers. The user must manually run /platform resume telegram or restart the gateway — which is not obvious from the error message.

User receives misleading error message: "fix the underlying issue then run /platform resume telegram" even when the underlying issue (DNS failure) has already been resolved

Root Cause

File: gateway/run.py lines 5500-5501 and 2603-2638

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

The _pause_failed_platform() method sets info["paused"] = True and pushes next_retry to now + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue

The logic gap: The circuit breaker correctly stops hammering a failed endpoint, but it never detects when the endpoint becomes reachable again. On a machine behind a GFW, the network may be flaky — DNS fails for minutes, then recovers, but the adapter never wakes up.

Code Example

DoH discovery yielded no usable IPs (system DNS: unknown); using seed fallback IPs 149.154.167.220
Primary api.telegram.org connection failed ([Errno 11001] getaddrinfo failed); trying fallback IPs 149.154.167.220
Fallback IP 149.154.167.220 failed: All connection attempts failed

---

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

---

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue

Bug Description

Gateway uses DoH (DNS-over-HTTPS) fallback to resolve api.telegram.org, which works well when DoH providers (Google/Cloudflare) are reachable. However, when both system DNS and DoH providers are blocked/unreachable, the fallback chain degrades as follows:

System DNS (socket.getaddrinfo) fails → getaddrinfo failed reported in logs
DoH to dns.google and cloudflare-dns.com also fails (network-level block)
Falls back to hardcoded seed IP 149.154.167.220
TCP connection to seed IP also fails
After 10 consecutive failures, Telegram adapter is auto-paused (gateway/run.py:_PAUSE_AFTER_FAILURES=10)
Paused adapter stops all retry attempts permanently; requires manual /platform resume telegram to recover

Steps to Reproduce

Run hermes gateway with Telegram adapter configured
Simulate DNS failure (e.g., block port 53, or use a network that has no DNS resolution for api.telegram.org)
Observe logs:

DoH discovery yielded no usable IPs (system DNS: unknown); using seed fallback IPs 149.154.167.220
Primary api.telegram.org connection failed ([Errno 11001] getaddrinfo failed); trying fallback IPs 149.154.167.220
Fallback IP 149.154.167.220 failed: All connection attempts failed

After 10 attempts: telegram paused after 10 consecutive failures (telegram connect timed out after 30s) — fix the underlying issue then run /platform resume telegram to retry
When network recovers (DNS resolves again), the adapter stays paused — no auto-recovery

Expected Behavior

When network connectivity recovers (system DNS can resolve api.telegram.org again), the Telegram adapter should automatically reconnect without manual intervention. The circuit breaker (pause after 10 failures) should only stop hammering a permanently failed endpoint, not a temporarily unreachable one that has since recovered.

Actual Behavior

Telegram adapter goes into paused state after 10 consecutive failures
It stays paused even after network recovers
User receives misleading error message: "fix the underlying issue then run /platform resume telegram" even when the underlying issue (DNS failure) has already been resolved
Requires manual /platform resume telegram or gateway restart to recover

Root Cause Analysis

File: gateway/run.py lines 5500-5501 and 2603-2638

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

The _pause_failed_platform() method sets info["paused"] = True and pushes next_retry to now + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue

Proposed Fix

When a platform is in paused state, the reconnect watcher should still periodically poll system DNS to detect if the endpoint has become reachable again. Specifically:

Add a DNS probe phase for paused platforms (e.g., every 5 minutes) that checks if the platform's host can be resolved
If system DNS resolves successfully, auto-resume the platform (reset attempt counter, schedule immediate reconnect)
This is a targeted fix — the circuit breaker still protects against hammering a permanently unreachable endpoint, but recovered endpoints auto-heal

Affected file: gateway/run.py — _platform_reconnect_watcher() method

OS / Environment

OS: Windows 10 (native, Git Bash / MSYS shell)
Hermes version: latest (as of May 30, 2026)
Telegram adapter with no proxy configured
Network: ISP-level DNS occasionally fails for api.telegram.org

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug] Telegram adapter auto-pause never auto-recovers after transient DNS failure [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis