openclaw - ✅(Solved) Fix [Bug] Telegram fetch transport lacks circuit breaker/backoff: 1500+ ENETUNREACH/hour during offline window starves event loop and trips TUI stream watchdog [1 pull requests, 1 comments, 2 participants]

artyomx33 · 2026-05-05T15:11:14Z

[openclaw] When api.telegram.org is unreachable offline, regional block, blackholed route , the gateway emits 1,500+ ENETUNREACH errors per hour . Each Telegra… When `api.telegram.org` is unreachable (offline, regional block, blackholed route), the gateway emits **1,500+ `ENETUNREACH` errors per hour**. Each Telegram fetch runs the 3-attempt cascade in `extensions/telegram/src/fetch.ts` (DNS-resolved → IPv4-only → hardcoded fallback IP `149.154.167.220`) **without any global circuit-breaker state**, so each long-poll cycle wastes three connection attempts and immediately re-fires. Multiplied across `getUpdates` polling and concurrent media downloads, this produces sustained event-loop starvation (`eventLoopDelayMaxMs > 3000ms`), which in turn causes the **TUI streaming watchdog** (`dist/tui-…js`, default ~30s) to falsely report `backend may have dropped this run silently — send a new message to resync` on otherwise-healthy Codex/OpenAI streams. The streaming watchdog is doing its job — it's reporting that no frames arrived for 30s. The frames didn't arrive because the gateway loop was blocked by Telegram retries, not because the upstream model failed. # PR #78097: fix: cool down unhealthy telegram transports - Repository: openclaw/openclaw - Author: bryce-d-greybeard - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/78097 ## Description (problem / solution / changelog) ## Summary Adds a Telegram-local cooldown for repeatedly failing Bot API transport attempts. The existing fallback cascade can become sticky on the pinned Telegram IP. If that route is blackholed, every poll cycle keeps paying another failed network attempt. This patch does the small thing instead of building circuit-breaker cosplay: each managed transport attempt tracks transient failures, opens a short cooldown after repeated failures, and short-circuits that attempt until the cooldown expires. ## Changes - Track per-attempt health inside `resolveTelegramTransport()`. - After 5 fallback-eligible failures, mark the attempt unhealthy for 10s; repeated opens back off up to 60s. - Do not charge caller-provided dispatchers against managed transport health. - Preserve fallback progression when earlier attempts are cooling down. - Add regression coverage for a sticky pinned fallback that stops issuing fetches during cooldown. ## Testing - `git diff --check fork/main...HEAD` — passed. - `PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm test extensions/telegram/src/fetch.test.ts extensions/telegram/src/polling-session.test.ts extensions/telegram/src/monitor.test.ts -- --reporter=dot` — passed, 3 files, 85 tests. - `PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs` — passed. Fixes #77900 ## Changed files - `extensions/telegram/src/fetch.test.ts` (modified, +31/-0) - `extensions/telegram/src/fetch.ts` (modified, +107/-17) ## Fix / Workaround Add a **per-host circuit breaker** in the Telegram fetch dispatcher path (Hystrix/Polly pattern, Google SRE Ch. 22): ## Summary When `api.telegram.org` is unreachable (offline, regional block, blackholed route), the gateway emits **1,500+ `ENETUNREACH` errors per hour**. Each Telegram fetch runs the 3-attempt cascade in `extensions/telegram/src/fetch.ts` (DNS-resolved → IPv4-only → hardcoded fallback IP `149.154.167.220`) **without any global circuit-breaker state**, so each long-poll cycle wastes three connection attempts and immediately re-fires. Multiplied across `getUpdates` polling and concurrent media downloads, this produces sustained event-loop starvation (`eventLoopDelayMaxMs > 3000ms`), which in turn causes the **TUI streaming watchdog** (`dist/tui-…js`, default ~30s) to falsely report `backend may have dropped this run silently — send a new message to resync` on otherwise-healthy Codex/OpenAI streams. The streaming watchdog is doing its job — it's reporting that no frames arrived for 30s. The frames didn't arrive because the gateway loop was blocked by Telegram retries, not because the upstream model failed. ## Repro 1. Have Telegram channel enabled in `openclaw.json` with active long-poll (`entries.telegram.enabled: true`) 2. Disconnect WiFi (or block `149.154.167.220:443` at the firewall) while leaving DNS resolution working 3. Within ~5 minutes, gateway logs fill with `ENETUNREACH 149.154.167.220:443` (~25/min) 4. After ~10–15 minutes, liveness warnings appear with `eventLoopDelayMaxMs > 1000ms` despite no real workload 5. TUI sessions running on **healthy providers** (e.g. `openai-codex/gpt-5.5`) trip the streaming watchdog despite the upstream stream being alive ## Evidence (one user's gateway, last 60 min) - **1,535** `ENETUNREACH 149.154.167.220:443` errors in 60 min (~25/min) - `eventLoopDelayMaxMs = 9646.9, 3116.4, 2803.9, 1722.8` across multiple windows - TUI watchdog fired on a session backed by `openai-codex/gpt-5.5` while OpenAI itself was healthy (verified by `openclaw agent --agent main --message "respond with PONG"` returning `PONG` in the same t

openclaw2026-05-05 15:11:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77900•Fetched 2026-05-06 06:19:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

artyomx33

Participants

artyomx33

clawsweeper[bot]

Timeline (top)

commented ×1cross-referenced ×1

When api.telegram.org is unreachable (offline, regional block, blackholed route), the gateway emits 1,500+ ENETUNREACH errors per hour. Each Telegram fetch runs the 3-attempt cascade in extensions/telegram/src/fetch.ts (DNS-resolved → IPv4-only → hardcoded fallback IP 149.154.167.220) without any global circuit-breaker state, so each long-poll cycle wastes three connection attempts and immediately re-fires.

Multiplied across getUpdates polling and concurrent media downloads, this produces sustained event-loop starvation (eventLoopDelayMaxMs > 3000ms), which in turn causes the TUI streaming watchdog (dist/tui-…js, default ~30s) to falsely report backend may have dropped this run silently — send a new message to resync on otherwise-healthy Codex/OpenAI streams.

The streaming watchdog is doing its job — it's reporting that no frames arrived for 30s. The frames didn't arrive because the gateway loop was blocked by Telegram retries, not because the upstream model failed.

Error Message

Track consecutive transient failures per (host, error-class) key

Root Cause

Fix Action

Fix / Workaround

Add a per-host circuit breaker in the Telegram fetch dispatcher path (Hystrix/Polly pattern, Google SRE Ch. 22):

PR fix notes

PR #78097: fix: cool down unhealthy telegram transports

Repository: openclaw/openclaw
Author: bryce-d-greybeard
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/78097

Description (problem / solution / changelog)

Summary

Adds a Telegram-local cooldown for repeatedly failing Bot API transport attempts.

The existing fallback cascade can become sticky on the pinned Telegram IP. If that route is blackholed, every poll cycle keeps paying another failed network attempt. This patch does the small thing instead of building circuit-breaker cosplay: each managed transport attempt tracks transient failures, opens a short cooldown after repeated failures, and short-circuits that attempt until the cooldown expires.

Changes

Track per-attempt health inside resolveTelegramTransport().
After 5 fallback-eligible failures, mark the attempt unhealthy for 10s; repeated opens back off up to 60s.
Do not charge caller-provided dispatchers against managed transport health.
Preserve fallback progression when earlier attempts are cooling down.
Add regression coverage for a sticky pinned fallback that stops issuing fetches during cooldown.

Testing

git diff --check fork/main...HEAD — passed.
PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm test extensions/telegram/src/fetch.test.ts extensions/telegram/src/polling-session.test.ts extensions/telegram/src/monitor.test.ts -- --reporter=dot — passed, 3 files, 85 tests.
PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs — passed.

Fixes #77900

Changed files

extensions/telegram/src/fetch.test.ts (modified, +31/-0)
extensions/telegram/src/fetch.ts (modified, +107/-17)

Code Example

delay_ms = min(MAX_DELAY, BASE * 2^consecutive_failures)
delay_ms += random_jitter(0, delay_ms / 2)

RAW_BUFFERClick to expand / collapse

Summary

Repro

Have Telegram channel enabled in openclaw.json with active long-poll (entries.telegram.enabled: true)
Disconnect WiFi (or block 149.154.167.220:443 at the firewall) while leaving DNS resolution working
Within ~5 minutes, gateway logs fill with ENETUNREACH 149.154.167.220:443 (~25/min)
After ~10–15 minutes, liveness warnings appear with eventLoopDelayMaxMs > 1000ms despite no real workload
TUI sessions running on healthy providers (e.g. openai-codex/gpt-5.5) trip the streaming watchdog despite the upstream stream being alive

Evidence (one user's gateway, last 60 min)

1,535 ENETUNREACH 149.154.167.220:443 errors in 60 min (~25/min)
eventLoopDelayMaxMs = 9646.9, 3116.4, 2803.9, 1722.8 across multiple windows
TUI watchdog fired on a session backed by openai-codex/gpt-5.5 while OpenAI itself was healthy (verified by openclaw agent --agent main --message "respond with PONG" returning PONG in the same time window from a separate shell)
Standalone curl --max-time 5 https://149.154.167.220/ returns code 000 (immediate failure), while curl https://api.telegram.org/ returns 302 — DNS-based path works, the hardcoded fallback IP route does not

Source references

dist/fetch-B1BHd80D.js:135 — TELEGRAM_FALLBACK_IPS = [\"149.154.167.220\"] (single hardcoded IP, no rotation, no health awareness)
dist/fetch-B1BHd80D.js:383-426 — createTelegramTransportAttempts: cascade builds 3 attempts per request with no per-host circuit-breaker state
dist/fetch-B1BHd80D.js:149-155 — FALLBACK_RETRY_ERROR_CODES correctly identifies ENETUNREACH / EHOSTUNREACH etc. as retry triggers, but the cascade still re-runs on every getUpdates cycle
No matches in the codebase for circuitBreaker, markUnhealthy, or exponentialBackoff — the pattern is genuinely missing

Adjacent issues (related, different scope)

These came up in search; this issue is differentiated by being about the inbound getUpdates fetch transport cascade, not outbound actions or webhooks:

#56096 — sendChatAction infinite retry loop (outbound action; same family, different code path)
#76087 — restart-sentinel continuation stuck after transient sendMessage failure (outbound)
#73255 (CLOSED) — deleteWebhook ENETUNREACH on startup (one-shot, not poll loop)
#69165 — outbound send queue with per-chat backoff (outbound only)
#77634 — Discord fetch-timeout blocks event loop (same pattern, different channel — strong evidence this is generalized)
#58519 — Slack Socket Mode event-loop starvation (same pattern, different channel)
#62615 — gateway-side circuit breaker for unhealthy sessions (this issue is about a host-level breaker, complementary)
#41899 — Plugin Circuit Breaker framework feature request (this issue would land naturally inside that framework)

Suggested fix (first-principles)

Add a per-host circuit breaker in the Telegram fetch dispatcher path (Hystrix/Polly pattern, Google SRE Ch. 22):

Track consecutive transient failures per (host, error-class) key
After N consecutive failures within window W → state=OPEN: short-circuit further requests with host_unhealthy for cooldown T
After T → state=HALF_OPEN: allow one probe; success → CLOSED + reset, failure → back to OPEN with longer T (exponential, capped)
Reset to CLOSED on any successful response

Defaults that would match the telemetry above: N=5, W=30s, T=10s initial doubling to max 60s. All configurable under channels.telegram.transport.circuitBreaker.{ failureThreshold, window, cooldown, maxCooldown }.

Pair with exponential backoff + jitter on the getUpdates polling loop itself:

delay_ms = min(MAX_DELAY, BASE * 2^consecutive_failures)
delay_ms += random_jitter(0, delay_ms / 2)

Suggested defaults: BASE=500ms, MAX_DELAY=60s. Reset on first successful response. Jitter prevents thundering-herd on shared networks.

The same component should be reusable by Discord (#77634), Slack (#58519), and future channels — strong fit for the framework in #41899.

Out of scope (cross-reference, would file separately if useful)

TUI watchdog should reset on a gateway→TUI keepalive frame, not only on stream content. This would let the watchdog stay at 30s without false-positiving during legitimate upstream silence. Related to #68596 (configurable threshold), #69978 (suppress duplicates), #67052 (TUI streaming indicator stays active).
Sequential stream-prep (~27s observed: core-plugin-tools:8s + system-prompt:7s + stream-setup:7s) makes each turn vulnerable to short network blips during prep. core-plugin-tools has no dependency on session-resource-loader/system-prompt and could parallelize. Plugin tool definitions are also invariant per gateway lifetime and would benefit from caching.

Happy to test a candidate fix in a real environment with offline-Telegram + active Codex traffic.

Environment

openclaw 2026.5.3-1 (also reproduces in 2026.5.4 per source-equivalent inspection)
macOS 15.x (darwin 25.3.0)
Node 22
Channels enabled: telegram, whatsapp, discord

extent analysis

TL;DR

Implement a per-host circuit breaker in the Telegram fetch dispatcher path to prevent repeated connection attempts when api.telegram.org is unreachable.

Guidance

Introduce a circuit breaker with a failure threshold, window, cooldown, and max cooldown to track consecutive transient failures per host.
Implement exponential backoff with jitter on the getUpdates polling loop to prevent thundering-herd effects.
Consider making the circuit breaker component reusable for other channels like Discord and Slack.
Review and adjust the TUI watchdog to reset on gateway-TUI keepalive frames to prevent false positives.

Example

// Simplified example of a circuit breaker
class CircuitBreaker {
  constructor(failureThreshold, window, cooldown, maxCooldown) {
    this.failureThreshold = failureThreshold;
    this.window = window;
    this.cooldown = cooldown;
    this.maxCooldown = maxCooldown;
    this.state = 'CLOSED';
    this.consecutiveFailures = 0;
  }

  // Call before making a request
  beforeRequest() {
    if (this.state === 'OPEN') {
      throw new Error('Host is currently unhealthy');
    }
  }

  // Call after a failed request
  afterFailure() {
    this.consecutiveFailures++;
    if (this.consecutiveFailures >= this.failureThreshold) {
      this.state = 'OPEN';
      // Start cooldown timer
    }
  }

  // Call after a successful request
  afterSuccess() {
    this.consecutiveFailures = 0;
    this.state = 'CLOSED';
  }
}

Notes

The provided example is a simplified illustration of a circuit breaker. A real-world implementation should consider additional factors like timer management, error handling, and configuration options.

Recommendation

Apply the suggested fix by implementing a per-host circuit breaker and exponential backoff with jitter to prevent repeated connection attempts and event-loop starvation. This should help mitigate the issue and

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug] Telegram fetch transport lacks circuit breaker/backoff: 1500+ ENETUNREACH/hour during offline window starves event loop and trips TUI stream watchdog [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #78097: fix: cool down unhealthy telegram transports

Description (problem / solution / changelog)

Summary

Changes

Testing

Changed files

Code Example

Summary

Repro

Evidence (one user's gateway, last 60 min)

Source references

Adjacent issues (related, different scope)

Suggested fix (first-principles)

Out of scope (cross-reference, would file separately if useful)

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING