hermes - 💡(How to fix) Fix [Bug] Telegram adapter auto-pause never auto-recovers after transient DNS failure [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

The core problem: When the network recovers (DNS resolves again), the adapter remains permanently paused and never auto-recovers. The user must manually run /platform resume telegram or restart the gateway — which is not obvious from the error message.

  • User receives misleading error message: "fix the underlying issue then run /platform resume telegram" even when the underlying issue (DNS failure) has already been resolved

Root Cause

File: gateway/run.py lines 5500-5501 and 2603-2638

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

The _pause_failed_platform() method sets info["paused"] = True and pushes next_retry to now + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue

The logic gap: The circuit breaker correctly stops hammering a failed endpoint, but it never detects when the endpoint becomes reachable again. On a machine behind a GFW, the network may be flaky — DNS fails for minutes, then recovers, but the adapter never wakes up.

Fix Action

Fixed

Code Example

DoH discovery yielded no usable IPs (system DNS: unknown); using seed fallback IPs 149.154.167.220
Primary api.telegram.org connection failed ([Errno 11001] getaddrinfo failed); trying fallback IPs 149.154.167.220
Fallback IP 149.154.167.220 failed: All connection attempts failed

---

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

---

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue
RAW_BUFFERClick to expand / collapse

Bug Description

Gateway uses DoH (DNS-over-HTTPS) fallback to resolve api.telegram.org, which works well when DoH providers (Google/Cloudflare) are reachable. However, when both system DNS and DoH providers are blocked/unreachable, the fallback chain degrades as follows:

  1. System DNS (socket.getaddrinfo) fails → getaddrinfo failed reported in logs
  2. DoH to dns.google and cloudflare-dns.com also fails (network-level block)
  3. Falls back to hardcoded seed IP 149.154.167.220
  4. TCP connection to seed IP also fails
  5. After 10 consecutive failures, Telegram adapter is auto-paused (gateway/run.py:_PAUSE_AFTER_FAILURES=10)
  6. Paused adapter stops all retry attempts permanently; requires manual /platform resume telegram to recover

The core problem: When the network recovers (DNS resolves again), the adapter remains permanently paused and never auto-recovers. The user must manually run /platform resume telegram or restart the gateway — which is not obvious from the error message.

Steps to Reproduce

  1. Run hermes gateway with Telegram adapter configured
  2. Simulate DNS failure (e.g., block port 53, or use a network that has no DNS resolution for api.telegram.org)
  3. Observe logs:
DoH discovery yielded no usable IPs (system DNS: unknown); using seed fallback IPs 149.154.167.220
Primary api.telegram.org connection failed ([Errno 11001] getaddrinfo failed); trying fallback IPs 149.154.167.220
Fallback IP 149.154.167.220 failed: All connection attempts failed
  1. After 10 attempts: telegram paused after 10 consecutive failures (telegram connect timed out after 30s) — fix the underlying issue then run /platform resume telegram to retry
  2. When network recovers (DNS resolves again), the adapter stays paused — no auto-recovery

Expected Behavior

When network connectivity recovers (system DNS can resolve api.telegram.org again), the Telegram adapter should automatically reconnect without manual intervention. The circuit breaker (pause after 10 failures) should only stop hammering a permanently failed endpoint, not a temporarily unreachable one that has since recovered.

Actual Behavior

  • Telegram adapter goes into paused state after 10 consecutive failures
  • It stays paused even after network recovers
  • User receives misleading error message: "fix the underlying issue then run /platform resume telegram" even when the underlying issue (DNS failure) has already been resolved
  • Requires manual /platform resume telegram or gateway restart to recover

Root Cause Analysis

File: gateway/run.py lines 5500-5501 and 2603-2638

_BACKOFF_CAP = 300  # 5 minutes max between retries
_PAUSE_AFTER_FAILURES = 10  # circuit-breaker threshold

The _pause_failed_platform() method sets info["paused"] = True and pushes next_retry to now + 300s. The reconnect watcher (_platform_reconnect_watcher) skips platforms that are paused:

# gateway/run.py — reconnect watcher loop
if info.get("paused"):
    # circuit breaker: don't hammer a known-bad platform
    continue

The logic gap: The circuit breaker correctly stops hammering a failed endpoint, but it never detects when the endpoint becomes reachable again. On a machine behind a GFW, the network may be flaky — DNS fails for minutes, then recovers, but the adapter never wakes up.

Proposed Fix

When a platform is in paused state, the reconnect watcher should still periodically poll system DNS to detect if the endpoint has become reachable again. Specifically:

  1. Add a DNS probe phase for paused platforms (e.g., every 5 minutes) that checks if the platform's host can be resolved
  2. If system DNS resolves successfully, auto-resume the platform (reset attempt counter, schedule immediate reconnect)
  3. This is a targeted fix — the circuit breaker still protects against hammering a permanently unreachable endpoint, but recovered endpoints auto-heal

Affected file: gateway/run.py_platform_reconnect_watcher() method

OS / Environment

  • OS: Windows 10 (native, Git Bash / MSYS shell)
  • Hermes version: latest (as of May 30, 2026)
  • Telegram adapter with no proxy configured
  • Network: ISP-level DNS occasionally fails for api.telegram.org

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [Bug] Telegram adapter auto-pause never auto-recovers after transient DNS failure [2 pull requests]