openclaw - ✅(Solved) Fix [Bug] Telegram fetch transport lacks circuit breaker/backoff: 1500+ ENETUNREACH/hour during offline window starves event loop and trips TUI stream watchdog [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77900Fetched 2026-05-06 06:19:33
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
commented ×1cross-referenced ×1

When api.telegram.org is unreachable (offline, regional block, blackholed route), the gateway emits 1,500+ ENETUNREACH errors per hour. Each Telegram fetch runs the 3-attempt cascade in extensions/telegram/src/fetch.ts (DNS-resolved → IPv4-only → hardcoded fallback IP 149.154.167.220) without any global circuit-breaker state, so each long-poll cycle wastes three connection attempts and immediately re-fires.

Multiplied across getUpdates polling and concurrent media downloads, this produces sustained event-loop starvation (eventLoopDelayMaxMs > 3000ms), which in turn causes the TUI streaming watchdog (dist/tui-…js, default ~30s) to falsely report backend may have dropped this run silently — send a new message to resync on otherwise-healthy Codex/OpenAI streams.

The streaming watchdog is doing its job — it's reporting that no frames arrived for 30s. The frames didn't arrive because the gateway loop was blocked by Telegram retries, not because the upstream model failed.

Error Message

  1. Track consecutive transient failures per (host, error-class) key

Root Cause

The streaming watchdog is doing its job — it's reporting that no frames arrived for 30s. The frames didn't arrive because the gateway loop was blocked by Telegram retries, not because the upstream model failed.

Fix Action

Fix / Workaround

Add a per-host circuit breaker in the Telegram fetch dispatcher path (Hystrix/Polly pattern, Google SRE Ch. 22):

PR fix notes

PR #78097: fix: cool down unhealthy telegram transports

Description (problem / solution / changelog)

Summary

Adds a Telegram-local cooldown for repeatedly failing Bot API transport attempts.

The existing fallback cascade can become sticky on the pinned Telegram IP. If that route is blackholed, every poll cycle keeps paying another failed network attempt. This patch does the small thing instead of building circuit-breaker cosplay: each managed transport attempt tracks transient failures, opens a short cooldown after repeated failures, and short-circuits that attempt until the cooldown expires.

Changes

  • Track per-attempt health inside resolveTelegramTransport().
  • After 5 fallback-eligible failures, mark the attempt unhealthy for 10s; repeated opens back off up to 60s.
  • Do not charge caller-provided dispatchers against managed transport health.
  • Preserve fallback progression when earlier attempts are cooling down.
  • Add regression coverage for a sticky pinned fallback that stops issuing fetches during cooldown.

Testing

  • git diff --check fork/main...HEAD — passed.
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm test extensions/telegram/src/fetch.test.ts extensions/telegram/src/polling-session.test.ts extensions/telegram/src/monitor.test.ts -- --reporter=dot — passed, 3 files, 85 tests.
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs — passed.

Fixes #77900

Changed files

  • extensions/telegram/src/fetch.test.ts (modified, +31/-0)
  • extensions/telegram/src/fetch.ts (modified, +107/-17)

Code Example

delay_ms = min(MAX_DELAY, BASE * 2^consecutive_failures)
delay_ms += random_jitter(0, delay_ms / 2)
RAW_BUFFERClick to expand / collapse

Summary

When api.telegram.org is unreachable (offline, regional block, blackholed route), the gateway emits 1,500+ ENETUNREACH errors per hour. Each Telegram fetch runs the 3-attempt cascade in extensions/telegram/src/fetch.ts (DNS-resolved → IPv4-only → hardcoded fallback IP 149.154.167.220) without any global circuit-breaker state, so each long-poll cycle wastes three connection attempts and immediately re-fires.

Multiplied across getUpdates polling and concurrent media downloads, this produces sustained event-loop starvation (eventLoopDelayMaxMs > 3000ms), which in turn causes the TUI streaming watchdog (dist/tui-…js, default ~30s) to falsely report backend may have dropped this run silently — send a new message to resync on otherwise-healthy Codex/OpenAI streams.

The streaming watchdog is doing its job — it's reporting that no frames arrived for 30s. The frames didn't arrive because the gateway loop was blocked by Telegram retries, not because the upstream model failed.

Repro

  1. Have Telegram channel enabled in openclaw.json with active long-poll (entries.telegram.enabled: true)
  2. Disconnect WiFi (or block 149.154.167.220:443 at the firewall) while leaving DNS resolution working
  3. Within ~5 minutes, gateway logs fill with ENETUNREACH 149.154.167.220:443 (~25/min)
  4. After ~10–15 minutes, liveness warnings appear with eventLoopDelayMaxMs > 1000ms despite no real workload
  5. TUI sessions running on healthy providers (e.g. openai-codex/gpt-5.5) trip the streaming watchdog despite the upstream stream being alive

Evidence (one user's gateway, last 60 min)

  • 1,535 ENETUNREACH 149.154.167.220:443 errors in 60 min (~25/min)
  • eventLoopDelayMaxMs = 9646.9, 3116.4, 2803.9, 1722.8 across multiple windows
  • TUI watchdog fired on a session backed by openai-codex/gpt-5.5 while OpenAI itself was healthy (verified by openclaw agent --agent main --message "respond with PONG" returning PONG in the same time window from a separate shell)
  • Standalone curl --max-time 5 https://149.154.167.220/ returns code 000 (immediate failure), while curl https://api.telegram.org/ returns 302 — DNS-based path works, the hardcoded fallback IP route does not

Source references

  • dist/fetch-B1BHd80D.js:135TELEGRAM_FALLBACK_IPS = [\"149.154.167.220\"] (single hardcoded IP, no rotation, no health awareness)
  • dist/fetch-B1BHd80D.js:383-426createTelegramTransportAttempts: cascade builds 3 attempts per request with no per-host circuit-breaker state
  • dist/fetch-B1BHd80D.js:149-155FALLBACK_RETRY_ERROR_CODES correctly identifies ENETUNREACH / EHOSTUNREACH etc. as retry triggers, but the cascade still re-runs on every getUpdates cycle
  • No matches in the codebase for circuitBreaker, markUnhealthy, or exponentialBackoff — the pattern is genuinely missing

Adjacent issues (related, different scope)

These came up in search; this issue is differentiated by being about the inbound getUpdates fetch transport cascade, not outbound actions or webhooks:

  • #56096 — sendChatAction infinite retry loop (outbound action; same family, different code path)
  • #76087 — restart-sentinel continuation stuck after transient sendMessage failure (outbound)
  • #73255 (CLOSED) — deleteWebhook ENETUNREACH on startup (one-shot, not poll loop)
  • #69165 — outbound send queue with per-chat backoff (outbound only)
  • #77634 — Discord fetch-timeout blocks event loop (same pattern, different channel — strong evidence this is generalized)
  • #58519 — Slack Socket Mode event-loop starvation (same pattern, different channel)
  • #62615 — gateway-side circuit breaker for unhealthy sessions (this issue is about a host-level breaker, complementary)
  • #41899 — Plugin Circuit Breaker framework feature request (this issue would land naturally inside that framework)

Suggested fix (first-principles)

Add a per-host circuit breaker in the Telegram fetch dispatcher path (Hystrix/Polly pattern, Google SRE Ch. 22):

  1. Track consecutive transient failures per (host, error-class) key
  2. After N consecutive failures within window W → state=OPEN: short-circuit further requests with host_unhealthy for cooldown T
  3. After T → state=HALF_OPEN: allow one probe; success → CLOSED + reset, failure → back to OPEN with longer T (exponential, capped)
  4. Reset to CLOSED on any successful response

Defaults that would match the telemetry above: N=5, W=30s, T=10s initial doubling to max 60s. All configurable under channels.telegram.transport.circuitBreaker.{ failureThreshold, window, cooldown, maxCooldown }.

Pair with exponential backoff + jitter on the getUpdates polling loop itself:

delay_ms = min(MAX_DELAY, BASE * 2^consecutive_failures)
delay_ms += random_jitter(0, delay_ms / 2)

Suggested defaults: BASE=500ms, MAX_DELAY=60s. Reset on first successful response. Jitter prevents thundering-herd on shared networks.

The same component should be reusable by Discord (#77634), Slack (#58519), and future channels — strong fit for the framework in #41899.

Out of scope (cross-reference, would file separately if useful)

  • TUI watchdog should reset on a gateway→TUI keepalive frame, not only on stream content. This would let the watchdog stay at 30s without false-positiving during legitimate upstream silence. Related to #68596 (configurable threshold), #69978 (suppress duplicates), #67052 (TUI streaming indicator stays active).
  • Sequential stream-prep (~27s observed: core-plugin-tools:8s + system-prompt:7s + stream-setup:7s) makes each turn vulnerable to short network blips during prep. core-plugin-tools has no dependency on session-resource-loader/system-prompt and could parallelize. Plugin tool definitions are also invariant per gateway lifetime and would benefit from caching.

Happy to test a candidate fix in a real environment with offline-Telegram + active Codex traffic.

Environment

  • openclaw 2026.5.3-1 (also reproduces in 2026.5.4 per source-equivalent inspection)
  • macOS 15.x (darwin 25.3.0)
  • Node 22
  • Channels enabled: telegram, whatsapp, discord

extent analysis

TL;DR

Implement a per-host circuit breaker in the Telegram fetch dispatcher path to prevent repeated connection attempts when api.telegram.org is unreachable.

Guidance

  • Introduce a circuit breaker with a failure threshold, window, cooldown, and max cooldown to track consecutive transient failures per host.
  • Implement exponential backoff with jitter on the getUpdates polling loop to prevent thundering-herd effects.
  • Consider making the circuit breaker component reusable for other channels like Discord and Slack.
  • Review and adjust the TUI watchdog to reset on gateway-TUI keepalive frames to prevent false positives.

Example

// Simplified example of a circuit breaker
class CircuitBreaker {
  constructor(failureThreshold, window, cooldown, maxCooldown) {
    this.failureThreshold = failureThreshold;
    this.window = window;
    this.cooldown = cooldown;
    this.maxCooldown = maxCooldown;
    this.state = 'CLOSED';
    this.consecutiveFailures = 0;
  }

  // Call before making a request
  beforeRequest() {
    if (this.state === 'OPEN') {
      throw new Error('Host is currently unhealthy');
    }
  }

  // Call after a failed request
  afterFailure() {
    this.consecutiveFailures++;
    if (this.consecutiveFailures >= this.failureThreshold) {
      this.state = 'OPEN';
      // Start cooldown timer
    }
  }

  // Call after a successful request
  afterSuccess() {
    this.consecutiveFailures = 0;
    this.state = 'CLOSED';
  }
}

Notes

The provided example is a simplified illustration of a circuit breaker. A real-world implementation should consider additional factors like timer management, error handling, and configuration options.

Recommendation

Apply the suggested fix by implementing a per-host circuit breaker and exponential backoff with jitter to prevent repeated connection attempts and event-loop starvation. This should help mitigate the issue and

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug] Telegram fetch transport lacks circuit breaker/backoff: 1500+ ENETUNREACH/hour during offline window starves event loop and trips TUI stream watchdog [1 pull requests, 1 comments, 2 participants]