openclaw - ✅(Solved) Fix [Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78422Fetched 2026-05-07 03:37:04
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
2
Timeline (top)
cross-referenced ×2closed ×1commented ×1subscribed ×1

The Telegram channel's getUpdates long-poll silently stopped delivering inbound updates for 39 minutes on a healthy gateway. No errors, no warnings, no recovery — the polling watchdog (pollingStallThresholdMs, default 120000ms) never fired. A manual openclaw gateway restart recovered it instantly.

The behavior is reproducible across two unrelated hosts (macOS arm64 + Windows 11 x64) running the same OpenClaw build, and reading dist/monitor-polling.runtime-DjS2STzm.js shows the watchdog's stall-detect logic uses an || that lets unrelated API activity suppress stall detection even when getUpdates is dead.

Error Message

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity

If getUpdates is in a tight retry-error loop (server returning 502s, transient socket close), every error bumps lastApiActivityAt, again silencing the watchdog. The watchdog should treat successful API activity as the only liveness signal, not error events. Both are consistent with the observed evidence (no error logs, no recovery, immediate fix on restart). 2. Stop bumping lastApiActivityAt from noteGetUpdatesError. Errors are not liveness; they are signals the runner is failing. Otherwise a tight error-retry loop perpetually suppresses the watchdog.

  • Add a [telegram][diag] log line on stall detection even if no restart fires (currently [telegram][diag] polling cycle finished/error reason=... only logs on cycle exit)

Root Cause

Root cause hypothesis: || in detectStall allows unrelated API activity to suppress watchdog

Fix Action

Fix / Workaround

Happy to test patches against the affected hosts.

PR fix notes

PR #78646: fix(telegram): keep polling watchdog on getUpdates liveness

Description (problem / solution / changelog)

Summary

  • Problem: Telegram polling stall recovery treated unrelated outbound Bot API activity as liveness for inbound getUpdates polling.
  • Why it matters: active sendMessage traffic could mask a wedged inbound polling loop, leaving Telegram replies silent until a manual restart.
  • What changed: make the stall watchdog depend on completed/stuck getUpdates liveness only, while keeping unrelated API elapsed time in diagnostics.
  • What did NOT change (scope boundary): this does not redesign Telegram transport rebuild behavior beyond ensuring the watchdog fires when inbound polling is stale.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #78422
  • Related #78473
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: TelegramPollingLivenessTracker.detectStall() returned no stall when either getUpdates elapsed time or generic Bot API elapsed time was still within the threshold.
  • Missing detection / guardrail: tests covered stale polling and stale unrelated API calls, but not the case where stale getUpdates coincides with recent or in-flight non-polling API traffic.
  • Contributing context (if known): outbound Telegram API success proves the Bot API path is alive, but it does not prove inbound long-polling is still progressing.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/telegram/src/polling-liveness.test.ts, extensions/telegram/src/polling-session.test.ts
  • Scenario the test should lock in: stale getUpdates still triggers watchdog restart even when sendMessage recently succeeded or a non-getUpdates API call is in flight.
  • Why this is the smallest reliable guardrail: the regression is in the polling liveness decision and session watchdog behavior, so targeted tracker/session tests catch it without live Telegram credentials.
  • Existing test that already covers this (if any): existing stale polling tests covered the baseline restart path but not unrelated API masking.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram polling recovery now restarts stale inbound polling even if unrelated outbound Telegram API calls are active or recently succeeded.

Diagram (if applicable)

Before:
stale getUpdates + recent sendMessage -> watchdog suppressed -> inbound polling stays wedged

After:
stale getUpdates + recent sendMessage -> watchdog restart -> polling cycle recovers

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Ubuntu 24.04.4 LTS
  • Runtime/container: Node 22 / pnpm
  • Model/provider: N/A
  • Integration/channel (if any): Telegram plugin polling watchdog
  • Relevant config (redacted): targeted regression tests do not require a live token; live proof used TELEGRAM_BOT_TOKEN env fallback from a local redacted token file

Steps

  1. Create a stale getUpdates liveness state.
  2. Record unrelated Telegram API activity such as sendMessage success or an in-flight non-getUpdates API call.
  3. Fire the polling stall watchdog.

Expected

  • Watchdog reports a polling stall and restarts the polling cycle.

Actual

  • Before this fix, recent unrelated API activity suppressed the watchdog.
  • After this fix, stale getUpdates liveness controls the watchdog and restart proceeds.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Validation on the rebased branch:

pnpm exec oxfmt --check --threads=1 CHANGELOG.md extensions/telegram/src/polling-liveness.ts extensions/telegram/src/polling-liveness.test.ts extensions/telegram/src/polling-session.test.ts
All matched files use the correct format.

pnpm test extensions/telegram/src/polling-liveness.test.ts extensions/telegram/src/polling-session.test.ts -- --reporter=verbose
Test Files 2 passed (2)
Tests 23 passed (23)

Real behavior proof

  • Behavior or issue addressed: Telegram polling watchdog recovery should fire from stale getUpdates liveness even when unrelated outbound Bot API calls are active.
  • Real environment tested: Ubuntu 24.04.4 LTS, PR branch fix/telegram-polling-watchdog-getupdates, commit e301533582, Node v22.22.1, pnpm 10.33.2, real Telegram Bot API token from a local redacted token file, and a private DM chat with the bot.
  • Exact steps or command run after this patch: Called real Telegram Bot API getMe, read a recent private DM via getUpdates, sent a disabled-notification proof message with sendMessage, exercised the PR liveness code to verify stale getUpdates returns STALL after outbound Telegram activity, then ran an isolated source-mode Gateway on port 19986 across the watchdog window with TELEGRAM_BOT_TOKEN supplied via env fallback.
  • Evidence after fix: Copied live output from Ubuntu 24.04.4 LTS, token omitted:
telegram_getMe=ok botId=8656041674 username=set
telegram_recent_chat=found chatType=private updateId=198414331
telegram_sendMessage=ok source=updates chatId=6599824666 messageId=70
live_sendMessage_stale_getUpdates=STALL
live_liveness_message=Polling stall detected (active getUpdates stuck for 120s); forcing restart. [diag inFlight=1 outcome=started startedAt=0 finishedAt=n/a durationMs=n/a offset=123 apiElapsedMs=60001]

Telegram client also showed the real round trip:

[5/6/2026 3:30 PM] Crazy Cat: test
[5/6/2026 3:30 PM] Orinclaw Assistant: OpenClaw PR #78646 live watchdog proof 2026-05-06T22:30:53.827Z

Additional isolated Gateway live proof after the same patch:

branch=fix/telegram-polling-watchdog-getupdates
commit=e301533582
os=Ubuntu 24.04.4 LTS
mode=isolated source-mode Gateway, port 19986, real Telegram bot token from env fallback
telegram_provider_start=[default] starting provider (@orinclaw_ai_bot)
inbound_updates=real pending Telegram DM updates consumed by Gateway poller
window=2026-05-06T22:51:45+00:00..2026-05-06T22:55:47+00:00
samples=5
health=live on every sample
ready=true failing=[] on every sample
polling_stall_count=0
getupdates_conflict_count=0
telegram_provider_start_count=1
final_health={"ok":true,"status":"live"}
final_ready={"ready":true,"failing":[]}
shutdown=clean SIGINT after validation

Before-fix long-lived reproduction on parent commit d05415d603:

scenario=active getUpdates started at t=0, unrelated non-getUpdates API success every 30s, watchdog threshold=120000ms
sample_0 t=0s result=NO_STALL
sample_1 t=30s result=NO_STALL
sample_2 t=60s result=NO_STALL
sample_3 t=90s result=NO_STALL
sample_4 t=120s result=NO_STALL
sample_135s t=135s result=NO_STALL
final_expected=STALL
final_actual=NO_STALL
reproduced_bug=stale getUpdates exceeded threshold but watchdog stayed suppressed by unrelated API liveness
  • Observed result after fix: The bot successfully handled real getMe, getUpdates, and sendMessage; the watchdog returned STALL after real outbound Telegram activity; the isolated Gateway stayed live/ready across the watchdog window with one Telegram provider start, zero false Polling stall detected logs, and zero getUpdates conflict logs.
  • What was not tested: I did not run a multi-hour production soak or model-response verification in the isolated test home. The isolated home intentionally had no OpenAI auth, so agent replies failed after Telegram polling consumed inbound DM updates; that auth failure is separate from Telegram polling liveness.

Human Verification (required)

  • Verified scenarios: stale getUpdates with recent non-polling API success, stale getUpdates with recent in-flight non-polling API activity, stale getUpdates with newer in-flight non-polling activity, existing stale polling restart paths, pre-fix long-lived suppression reproduction, real Telegram Bot API getMe/getUpdates/sendMessage, and an isolated live Gateway Telegram polling run across the watchdog window.
  • Edge cases checked: diagnostic output keeps apiElapsedMs for debugging while not using generic API liveness to suppress stale polling recovery.
  • What was not tested: I did not run a multi-hour production soak or model-response verification in the isolated test home; live Telegram polling startup, real update consumption, and watchdog-window stability were verified with a real token.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: Telegram polling may restart while outbound API traffic is healthy.
    • Mitigation: this is intentional; outbound API health is not inbound getUpdates health, and the watchdog threshold/throttling still bounds restarts.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/telegram/src/polling-liveness.test.ts (modified, +10/-6)
  • extensions/telegram/src/polling-liveness.ts (modified, +31/-20)
  • extensions/telegram/src/polling-session.test.ts (modified, +19/-22)

Code Example

{
    "enabled": true,
    "dmPolicy": "pairing",
    "groupPolicy": "allowlist",
    "streaming": { "mode": "off", "preview": { "toolProgress": false } },
    "timeoutSeconds": 180,
    "retry": { "attempts": 5, "minDelayMs": 500, "maxDelayMs": 30000, "jitter": 0.2 }
    // pollingStallThresholdMs not overridden — using default 120000
  }

---

detectStall(params) {
    const now = params.now ?? this.#now();
    const activeElapsed = this.#inFlightGetUpdates > 0 && this.#lastGetUpdatesStartedAt != null
        ? now - this.#lastGetUpdatesStartedAt : 0;
    const idleElapsed = this.#inFlightGetUpdates > 0
        ? 0
        : now - (this.#lastGetUpdatesFinishedAt ?? this.#lastGetUpdatesAt);
    const elapsed = this.#inFlightGetUpdates > 0 ? activeElapsed : idleElapsed;
    const apiElapsed = now - (this.#latestInFlightApiStartedAt == null
        ? this.#lastApiActivityAt
        : Math.max(this.#lastApiActivityAt, this.#latestInFlightApiStartedAt));
    if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
    ...
}

---

noteGetUpdatesSuccess(result, at) { ...; this.#lastApiActivityAt = at; ... }
noteGetUpdatesError(err, at)      { ...; this.#lastApiActivityAt = at; ... }

---

if (elapsed <= params.thresholdMs && apiElapsed <= params.thresholdMs) return null;
RAW_BUFFERClick to expand / collapse

[Bug] Telegram polling silently dies for 30+ min with no error and no auto-recovery; pollingStallThresholdMs watchdog suppressed by unrelated API activity

Summary

The Telegram channel's getUpdates long-poll silently stopped delivering inbound updates for 39 minutes on a healthy gateway. No errors, no warnings, no recovery — the polling watchdog (pollingStallThresholdMs, default 120000ms) never fired. A manual openclaw gateway restart recovered it instantly.

The behavior is reproducible across two unrelated hosts (macOS arm64 + Windows 11 x64) running the same OpenClaw build, and reading dist/monitor-polling.runtime-DjS2STzm.js shows the watchdog's stall-detect logic uses an || that lets unrelated API activity suppress stall detection even when getUpdates is dead.

Environment

  • OpenClaw: 2026.5.4 (325df3e) (also seen on 2026.5.3-1 before upgrade)
  • OS A: Windows 11 x64, Node v22.22.2, gateway started via Task Scheduler gateway.cmd
  • OS B: macOS Darwin 25.4.0 arm64, Node v25.9.0
  • Mode: polling (no webhook configured)
  • Network: US home broadband, no proxy, api.telegram.org directly reachable
  • Account config (excerpt):
    {
      "enabled": true,
      "dmPolicy": "pairing",
      "groupPolicy": "allowlist",
      "streaming": { "mode": "off", "preview": { "toolProgress": false } },
      "timeoutSeconds": 180,
      "retry": { "attempts": 5, "minDelayMs": 500, "maxDelayMs": 30000, "jitter": 0.2 }
      // pollingStallThresholdMs not overridden — using default 120000
    }

Incident timeline (Windows host, all times EDT)

TimeEvidence
01:02:42Last Telegram-channel embedded run start event recorded by gateway.
01:02:42 → 01:4539 min of silence. Gateway healthy: probe ok, port listening, crons firing on schedule. Heartbeats normal. No errors, no crashes, no warnings. Watchdog never logged a stall. Webhook heartbeats showed webhooks=0/0/0. compliance.jsonl has zero inbound from sender during this window.
01:41User sends a Telegram DM from a paired account. Message never reaches the gateway. Telegram client shows it as delivered to the bot.
01:45openclaw gateway restart issued.
01:45 → nowPolling resumes immediately. Inbound updates start flowing again, including the previously-missed 01:41 message via offset replay.

The same pattern was observed on the macOS host on a separate occasion the same week.

Expected vs actual

Expected: With pollingStallThresholdMs=120000 (2 min default), if getUpdates stops completing for >2 min, the polling watchdog restarts the polling runner and recovers.

Actual: Polling stayed wedged for 39 minutes (~19× the threshold). No Polling stall detected log. No transport rebuild. No restart cycle.

Root cause hypothesis: || in detectStall allows unrelated API activity to suppress watchdog

In extensions/telegram/src/polling-liveness.ts (bundled at dist/monitor-polling.runtime-DjS2STzm.js lines 78–87):

detectStall(params) {
    const now = params.now ?? this.#now();
    const activeElapsed = this.#inFlightGetUpdates > 0 && this.#lastGetUpdatesStartedAt != null
        ? now - this.#lastGetUpdatesStartedAt : 0;
    const idleElapsed = this.#inFlightGetUpdates > 0
        ? 0
        : now - (this.#lastGetUpdatesFinishedAt ?? this.#lastGetUpdatesAt);
    const elapsed = this.#inFlightGetUpdates > 0 ? activeElapsed : idleElapsed;
    const apiElapsed = now - (this.#latestInFlightApiStartedAt == null
        ? this.#lastApiActivityAt
        : Math.max(this.#lastApiActivityAt, this.#latestInFlightApiStartedAt));
    if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;
    ...
}

Two issues compound here:

1. || instead of &&

The stall is suppressed if either getUpdates-elapsed or any-API-elapsed is below threshold. So any non-getUpdates API call (e.g. an outbound sendMessage from a cron, or a getMe health check) within the last 2 min keeps apiElapsed low and silences the watchdog even when getUpdates is fully dead.

This contradicts the watchdog's documented purpose: detect getUpdates liveness death and restart polling. Outbound-sender health does not imply inbound is working — they're independent code paths from Telegram's perspective.

2. noteGetUpdatesError and noteGetUpdatesSuccess both bump lastApiActivityAt

noteGetUpdatesSuccess(result, at) { ...; this.#lastApiActivityAt = at; ... }
noteGetUpdatesError(err, at)      { ...; this.#lastApiActivityAt = at; ... }

If getUpdates is in a tight retry-error loop (server returning 502s, transient socket close), every error bumps lastApiActivityAt, again silencing the watchdog. The watchdog should treat successful API activity as the only liveness signal, not error events.

Why "no logs, no errors" in the incident

If getUpdates was either:

  • (a) hung in-flight on a TCP zombie socket — inFlightGetUpdates>0 but the request never returns. lastApiActivityAt is whatever the last completed activity was. After 2 min, apiElapsed > thresholdMs → stall should fire — unless any other internal bot.api.* call (including grammY's own internal getMe / setMyCommands / etc.) was made within the window, suppressing the check; or
  • (b) returning empty arrays via grammY's own internal long-poll loop after the runner's internal task got into a bad state where the per-cycle bot.api.config.use middleware no longer sees the calls. Then lastApiActivityAt stays at whatever it was when the bot was last "alive," and again any other API touch hides the stall.

Both are consistent with the observed evidence (no error logs, no recovery, immediate fix on restart).

Suggested fix

  1. Change || to && on the stall-detect gate, so the watchdog fires whenever getUpdates itself is stale, regardless of unrelated API activity:

    if (elapsed <= params.thresholdMs && apiElapsed <= params.thresholdMs) return null;

    Or — preferably — drop apiElapsed from the gate entirely. Inbound-update liveness is the only thing the watchdog is supposed to protect. Outbound API health is orthogonal.

  2. Stop bumping lastApiActivityAt from noteGetUpdatesError. Errors are not liveness; they are signals the runner is failing. Otherwise a tight error-retry loop perpetually suppresses the watchdog.

  3. Add a 30-minute "last-resort" tier independent of the per-cycle watchdog: if lastTransportActivityAt is older than TELEGRAM_POLLING_STALE_TRANSPORT_MS (30 min, the same threshold collectTelegramPollingRuntimeIssues already uses for status-issue surfacing), force a session-level restart via setMyCommands-style health probe + runner.stop(). The status-issue threshold and the watchdog disagree by 15× — nothing acts on the 30-min staleness today; it only surfaces in openclaw channels status.

  4. Emit a structured warning the first time the watchdog detects a stall, even if it ends up being a false alarm. Right now stall events are logged via opts.log to whatever stdout/stderr the gateway has — under Task Scheduler on Windows that's discarded by default, which contributed to debugging difficulty here.

Repro outline

I don't have a deterministic in-vitro repro for the underlying network stall yet (it's a transient zombie-socket condition that surfaced naturally over multiple weeks). I can repro the watchdog suppression trivially by:

  1. Start gateway with Telegram polling enabled.
  2. While polling is running, drop the upstream TCP connection mid-getUpdates (e.g. block port 443 to api.telegram.org, or use pfctl/netsh to drop SYN-ACK).
  3. Concurrently fire any non-getUpdates API call every 60s — e.g. a cron that does bot.api.getMe() or sendMessage to a chat the network reaches. (You can simulate this with a separate process holding the same token, but in practice gateway-internal traffic alone will trigger it in many deployments.)
  4. Observe: the inbound poll is dead but pollingStallThresholdMs never fires, no transport rebuild, no [telegram] Polling stall detected (...) log.

A cleaner unit-test repro: in polling-liveness.test.ts, set inFlightGetUpdates=1, lastGetUpdatesStartedAt=now-180000, but call noteApiCallStarted() at now-30000. detectStall({ thresholdMs: 120000 }) returns null → bug. Should return a stall.

Cross-references / not duplicates

  • #50040 — "Telegram delivery reliability: polling stalls can lead to silent outbound message loss" — related class but inverse direction (outbound)
  • #71066 — "Telegram subsystem: getUpdates polling silently non-functional" — region-specific and log-spammy (sticky IPv4 fallback warnings every 7s); this report is the silent case with no logs
  • #75498 — "macOS 26.4.1 / 2026.4.29 Telegram Web UI-only replies with partial streaming, polling stall, and session modelOverride pollution" — co-mentions polling stall and the same modelOverrideSource:"auto" persistence we hit; we're filing the modelOverride half separately
  • #73602, #73323, #61616 — adjacent stall-class reports on WSL2 / Windows / cross-OS

What I'm asking for

  • Confirm or refute the || analysis in polling-liveness.ts
  • Decide whether the watchdog should care about apiElapsed at all; my read is no
  • Add a [telegram][diag] log line on stall detection even if no restart fires (currently [telegram][diag] polling cycle finished/error reason=... only logs on cycle exit)

Happy to test patches against the affected hosts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING