openclaw - ✅(Solved) Fix [Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77088Fetched 2026-05-05 05:52:26
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
2
Author
Timeline (top)
cross-referenced ×2commented ×1mentioned ×1subscribed ×1

In extensions/telegram/src/fetch.ts, the stickyAttemptIndex closure variable is monotonically non-decreasing — once promoted to a fallback transport (IPv4-only, then pinned fallback IP 149.154.167.220), the fetch stack never returns to the default transport even after the upstream network fully recovers. Combined with connections=10 per origin and keepAliveMaxTimeout=600000ms, a transient network blip reliably degrades the Telegram fetch stack into a stuck state until the whole gateway process is restarted.

On my box (macOS 26.2, Node 25.5.0, openclaw 2026.4.29, behind GFW with an occasional DC4 blackhole) the gateway saturates roughly once every 12–24 hours with this exact pattern:

  • [telegram] sendChatAction failed: Network request for 'sendChatAction' failed! (20+ repeats)
  • [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (once)
  • [telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (once)
  • eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004
  • [ws] handshake timeout — in-process WebSocket clients can't connect to the gateway anymore
  • lsof -p <pid> shows 9+ ESTABLISHED sockets to api.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutes

Once in this state, the process never recovers even when upstream Telegram connectivity is restored — only launchctl kickstart -k fixes it.

Root Cause

  • [telegram] sendChatAction failed: Network request for 'sendChatAction' failed! (20+ repeats)
  • [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (once)
  • [telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (once)
  • eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004
  • [ws] handshake timeout — in-process WebSocket clients can't connect to the gateway anymore
  • lsof -p <pid> shows 9+ ESTABLISHED sockets to api.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutes

Fix Action

Fix / Workaround

  • [telegram] sendChatAction failed: Network request for 'sendChatAction' failed! (20+ repeats)
  • [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (once)
  • [telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (once)
  • eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004
  • [ws] handshake timeout — in-process WebSocket clients can't connect to the gateway anymore
  • lsof -p <pid> shows 9+ ESTABLISHED sockets to api.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutes

There is no path back to stickyAttemptIndex = 0 — no success counter, no time-based reset, no periodic probe of the primary transport. Once a single transient failure walks the index to 2 (pinned fallback IP), every subsequent request for the lifetime of the process uses only the fallback IP. If that IP later goes soft-bad (still answering TLS handshake but slow) or the pinned dispatcher's keepalive pool fills with dead sockets, the stack has nowhere to escape to.

const TELEGRAM_DISPATCHER_KEEP_ALIVE_MAX_TIMEOUT_MS = 6e5;    // 10 minutes
const TELEGRAM_DISPATCHER_CONNECTIONS_PER_ORIGIN = 10;

PR fix notes

PR #77157: fix(telegram): recover sticky fetch fallback after transient failures

Description (problem / solution / changelog)

Summary

  • Problem: Telegram fetch sticky fallback only promoted from primary to IPv4/pinned-IP transports and never returned to primary.
  • Why it matters: transient Telegram egress failures could leave a gateway on degraded transport until restart.
  • What changed: after repeated successful sticky fallback requests, Telegram fetch performs one primary recovery probe and resets/demotes sticky state only on successful transport recovery.
  • What did NOT change (scope boundary): no config knobs, fallback IP changes, dispatcher pool changes, or Telegram send/polling API changes.

Change Type

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #77088
  • Related #N/A
  • This PR fixes a bug or regression

Root Cause

  • Root cause: stickyAttemptIndex was monotonic and had no success-path recovery logic.
  • Missing detection / guardrail: existing tests asserted sticky promotion but not recovery after transient failure.
  • Contributing context (if known): Telegram transport fallback is useful for persistent IPv6/DNS issues, but needed a bounded recovery path.

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/telegram/src/fetch.test.ts
  • Scenario the test should lock in: sticky IPv4 and pinned-IP fallback recover to primary after repeated successes, while a failed primary probe keeps fallback sticky.
  • Why this is the smallest reliable guardrail: it tests the transport state machine directly with mocked fetch dispatchers.
  • Existing test that already covers this (if any): existing sticky fallback tests covered promotion only.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram transport can recover from sticky IPv4/pinned-IP fallback without restarting the gateway after the primary path becomes healthy again.

Diagram

Before:
primary failure -> sticky fallback -> remains degraded until restart

After:
primary failure -> sticky fallback -> recovery probe -> primary restored when healthy

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 24.13.0
  • Model/provider: N/A
  • Integration/channel (if any): Telegram
  • Relevant config (redacted): default Telegram fetch transport with fallback enabled

Steps

  1. Simulate a transient primary Telegram fetch failure.
  2. Observe sticky fallback promotion to IPv4 or pinned-IP dispatcher.
  3. Let fallback requests succeed enough to trigger recovery probing.
  4. Simulate primary recovery.

Expected

  • Sticky fallback resets to primary after a successful primary recovery probe.

Actual

  • Before this fix, sticky fallback remained degraded until process restart.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)
pnpm test extensions/telegram/src/fetch.test.ts
pnpm test extensions/telegram/src/polling-transport-state.test.ts extensions/telegram/src/polling-session.test.ts
pnpm test extensions/telegram/src/send.test.ts
pnpm check:changed

Human Verification

What you personally verified (not just CI), and how:

  • Verified scenarios: IPv4 sticky fallback recovery, pinned-IP sticky fallback recovery, failed primary recovery probe retaining fallback.
  • Edge cases checked: caller-provided dispatchers do not advance recovery state; all-attempt failure still leaves armed fallback sticky.
  • What you did not verify: live overnight flaky-network saturation behind GFW.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: Primary transport could still be unhealthy when probed.
    • Mitigation: the probe is bounded to one normal request after repeated sticky successes; if primary fails, the same request falls back and sticky fallback remains active.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/telegram/src/fetch.test.ts (modified, +84/-15)
  • extensions/telegram/src/fetch.ts (modified, +55/-1)

Code Example

let stickyAttemptIndex = 0;
const promoteStickyAttempt = (nextIndex, err, reason) => {
  if (nextIndex <= stickyAttemptIndex || nextIndex >= transportAttempts.length) return false;
  // ...
  stickyAttemptIndex = nextIndex;  // only goes UP, never down
  return true;
};

const resolvedFetch = (async (input, init) => {
  const startIndex = Math.min(stickyAttemptIndex, transportAttempts.length - 1);
  // ... tries startIndex first, on failure walks forward through the list
  for (let nextIndex = startIndex + 1; nextIndex < transportAttempts.length; nextIndex += 1) {
    promoteStickyAttempt(nextIndex, err);
    // ...
  }
});

---

const TELEGRAM_DISPATCHER_KEEP_ALIVE_MAX_TIMEOUT_MS = 6e5;    // 10 minutes
const TELEGRAM_DISPATCHER_CONNECTIONS_PER_ORIGIN = 10;

---

let stickyAttemptIndex = 0;
let consecutiveSuccessOnSticky = 0;
const STICKY_RESET_THRESHOLD = 5;  // or make this configurable

const demoteStickyAttempt = () => {
  if (stickyAttemptIndex === 0) return;
  consecutiveSuccessOnSticky += 1;
  if (consecutiveSuccessOnSticky >= STICKY_RESET_THRESHOLD) {
    log.info(`telegram fetch stack: resetting sticky index ${stickyAttemptIndex} -> 0 after ${consecutiveSuccessOnSticky} consecutive successes`);
    stickyAttemptIndex = 0;
    consecutiveSuccessOnSticky = 0;
  }
};

// in the success branch of resolvedFetch, after a clean response on the start attempt:
demoteStickyAttempt();

// in promoteStickyAttempt, reset the counter:
stickyAttemptIndex = nextIndex;
consecutiveSuccessOnSticky = 0;
RAW_BUFFERClick to expand / collapse

Bug type

Design flaw / stability

Summary

In extensions/telegram/src/fetch.ts, the stickyAttemptIndex closure variable is monotonically non-decreasing — once promoted to a fallback transport (IPv4-only, then pinned fallback IP 149.154.167.220), the fetch stack never returns to the default transport even after the upstream network fully recovers. Combined with connections=10 per origin and keepAliveMaxTimeout=600000ms, a transient network blip reliably degrades the Telegram fetch stack into a stuck state until the whole gateway process is restarted.

On my box (macOS 26.2, Node 25.5.0, openclaw 2026.4.29, behind GFW with an occasional DC4 blackhole) the gateway saturates roughly once every 12–24 hours with this exact pattern:

  • [telegram] sendChatAction failed: Network request for 'sendChatAction' failed! (20+ repeats)
  • [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (once)
  • [telegram] fetch fallback: DNS-resolved IP unreachable; trying alternative Telegram API IP (once)
  • eventLoopDelayMaxMs=19981.8 eventLoopUtilization=1 cpuCoreRatio=1.004
  • [ws] handshake timeout — in-process WebSocket clients can't connect to the gateway anymore
  • lsof -p <pid> shows 9+ ESTABLISHED sockets to api.telegram.org:443, all presumably stale because the origin pool keepalive is 10 minutes

Once in this state, the process never recovers even when upstream Telegram connectivity is restored — only launchctl kickstart -k fixes it.

Root cause (as I read the code)

File: extensions/telegram/src/fetch.ts (reading from the compiled dist/extensions/telegram/fetch-*.js in 2026.4.29 — original TS file on disk unavailable).

Three compounding design choices:

1. stickyAttemptIndex is monotonic. From resolveTelegramTransport:

let stickyAttemptIndex = 0;
const promoteStickyAttempt = (nextIndex, err, reason) => {
  if (nextIndex <= stickyAttemptIndex || nextIndex >= transportAttempts.length) return false;
  // ...
  stickyAttemptIndex = nextIndex;  // only goes UP, never down
  return true;
};

const resolvedFetch = (async (input, init) => {
  const startIndex = Math.min(stickyAttemptIndex, transportAttempts.length - 1);
  // ... tries startIndex first, on failure walks forward through the list
  for (let nextIndex = startIndex + 1; nextIndex < transportAttempts.length; nextIndex += 1) {
    promoteStickyAttempt(nextIndex, err);
    // ...
  }
});

There is no path back to stickyAttemptIndex = 0 — no success counter, no time-based reset, no periodic probe of the primary transport. Once a single transient failure walks the index to 2 (pinned fallback IP), every subsequent request for the lifetime of the process uses only the fallback IP. If that IP later goes soft-bad (still answering TLS handshake but slow) or the pinned dispatcher's keepalive pool fills with dead sockets, the stack has nowhere to escape to.

2. Connection pool too wide, keepalive too long.

const TELEGRAM_DISPATCHER_KEEP_ALIVE_MAX_TIMEOUT_MS = 6e5;    // 10 minutes
const TELEGRAM_DISPATCHER_CONNECTIONS_PER_ORIGIN = 10;

With 10 connections per origin, when the upstream flaps it's common for several sockets to go into "ESTABLISHED but dead" state (the remote silently dropped them, kernel hasn't noticed). They then occupy slots in the origin pool for up to 10 minutes, during which sendChatAction requests sitting in that agent block on socket acquisition or stall inside await. Across multiple concurrent sessions this drives eventLoopUtilization to 1.0 and produces the multi-second eventLoopDelayMaxMs I quoted above.

3. TELEGRAM_FALLBACK_IPS = ["149.154.167.220"] is a single-point-of-failure fallback. Once the stack has promoted to the fallback-IP attempt, if that single pinned IP also degrades there is no further option in the list — the code returns to the top of the loop with stickyAttemptIndex = 2 and repeats the same broken path forever.

Steps to reproduce

Hard to reproduce deterministically on a clean network, but reliably happens on a host behind a flaky egress (e.g. behind the GFW, or any ISP where DC4 149.154.166.0/24 intermittently blackholes). After ~12h of uptime:

  1. Let the gateway run overnight with a Telegram bot channel configured and at least one active embedded-agent session using it.
  2. During a period where api.telegram.org's DNS result is unreachable for ~30s (a typical GFW flutter), observe the two fetch fallback log lines fire.
  3. Upstream connectivity restores within a minute.
  4. The gateway stays on the fallback path and gradually accumulates [telegram] sendChatAction failed spam and eventLoopDelayMaxMs > 5s until [ws] handshake timeout starts appearing and the gateway becomes unresponsive.

Expected behavior

  • After N consecutive successful fetches (e.g. 5), stickyAttemptIndex should decay back toward 0 so that the cheapest/primary transport is re-probed when the network recovers.
  • Or: a periodic background probe of the primary dispatcher (every 60–120s while sticky > 0) that resets the index on success.
  • Additionally, connections per origin should be lower (2–4 seems plenty for a Telegram bot) and keepAliveMaxTimeout should be much shorter (30–60s) to bound the dead-socket problem.

Actual behavior

Once promoted, the stack stays promoted forever; event loop saturates; gateway requires a manual launchctl kickstart -k.

Suggested fix (willing to PR if direction is agreed)

Minimal invasive change in resolveTelegramTransport:

let stickyAttemptIndex = 0;
let consecutiveSuccessOnSticky = 0;
const STICKY_RESET_THRESHOLD = 5;  // or make this configurable

const demoteStickyAttempt = () => {
  if (stickyAttemptIndex === 0) return;
  consecutiveSuccessOnSticky += 1;
  if (consecutiveSuccessOnSticky >= STICKY_RESET_THRESHOLD) {
    log.info(`telegram fetch stack: resetting sticky index ${stickyAttemptIndex} -> 0 after ${consecutiveSuccessOnSticky} consecutive successes`);
    stickyAttemptIndex = 0;
    consecutiveSuccessOnSticky = 0;
  }
};

// in the success branch of resolvedFetch, after a clean response on the start attempt:
demoteStickyAttempt();

// in promoteStickyAttempt, reset the counter:
stickyAttemptIndex = nextIndex;
consecutiveSuccessOnSticky = 0;

Orthogonally:

  • expose connections / keepAliveMaxTimeout / the reset threshold / the fallback-IP list as channels.telegram.network.* config knobs (or environment vars following the existing OPENCLAW_TELEGRAM_* pattern), so users behind hostile networks can tune without patching dist/.
  • consider adding a second fallback IP in the DC5 range (the current list has one) so that when DC4 is blackholed there is still a second option if .220 degrades.

Related issues (different angle, same blast radius)

  • #45759 Telegram typing keepalive loop lacks circuit breaker
  • #56096 Telegram sendChatAction infinite retry loop with no backoff
  • #76852 Periodic getMe 10s fetch-timeout storm on networks without IPv6 egress
  • #55347 Native gateway self-healing

The common root across several of these is that the Telegram subsystem has no feedback loop from "upstream is healthy again" back into its internal state — every failure mode is latched.

Environment

  • openclaw 2026.4.29
  • Node v25.5.0
  • macOS 26.2 (arm64)
  • Behind a network that occasionally blackholes the 149.154.166.0/23 DC4 range

extent analysis

TL;DR

The most likely fix involves modifying the stickyAttemptIndex logic in resolveTelegramTransport to allow it to decay back to 0 after a series of successful fetches, preventing the fetch stack from getting stuck on a fallback transport.

Guidance

  • Introduce a consecutiveSuccessOnSticky counter to track the number of successful fetches on the sticky attempt index.
  • Implement a demoteStickyAttempt function to reset stickyAttemptIndex to 0 after a threshold of consecutive successes (e.g., 5).
  • Reset the consecutiveSuccessOnSticky counter when promoting the sticky attempt index.
  • Consider exposing configuration knobs for connections, keepAliveMaxTimeout, and the reset threshold to allow users to tune these settings without patching the code.
  • Adding a second fallback IP in a different range can provide an alternative when the primary fallback IP degrades.

Example

let stickyAttemptIndex = 0;
let consecutiveSuccessOnSticky = 0;
const STICKY_RESET_THRESHOLD = 5;

const demoteStickyAttempt = () => {
  if (stickyAttemptIndex === 0) return;
  consecutiveSuccessOnSticky += 1;
  if (consecutiveSuccessOnSticky >= STICKY_RESET_THRESHOLD) {
    // Reset stickyAttemptIndex to 0 after threshold successes
    stickyAttemptIndex = 0;
    consecutiveSuccessOnSticky = 0;
  }
};

Notes

  • The provided fix assumes that the issue is primarily caused by the monotonic nature of stickyAttemptIndex and the lack of a feedback loop to reset it when the network recovers.
  • Additional issues related to connection pooling and keepalive timeouts may require separate fixes or configuration adjustments.
  • The effectiveness of the proposed fix may depend on the specific network conditions and usage patterns.

Recommendation

Apply the suggested fix to modify the `stickyAttempt

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • After N consecutive successful fetches (e.g. 5), stickyAttemptIndex should decay back toward 0 so that the cheapest/primary transport is re-probed when the network recovers.
  • Or: a periodic background probe of the primary dispatcher (every 60–120s while sticky > 0) that resets the index on success.
  • Additionally, connections per origin should be lower (2–4 seems plenty for a Telegram bot) and keepAliveMaxTimeout should be much shorter (30–60s) to bound the dead-socket problem.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug/Design]: Telegram fetch stickyAttemptIndex is monotonic — gateway never recovers from transient network failures without restart [1 pull requests, 1 comments, 2 participants]