openclaw - ✅(Solved) Fix Telegram polling silently wedges after stall — transport rebuild never starts new polling cycle (5.4 + 5.5) [1 pull requests, 1 comments, 2 participants]

openclaw2026-05-06 12:59:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#78473•Fetched 2026-05-07 03:36:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Rigbay

Participants

clawsweeper[bot]

Rigbay

Timeline (top)

commented ×1cross-referenced ×1

Error Message

…then silence. No new polling cycle starts, no error logged. #runPollingCycle() either never re-enters or hangs in a state that doesn't surface diagnostics. 2. Add error/timeout handling in the transport-rebuild path so silent failures surface as logs.

Fix Action

Workaround

Wait for self-recovery, or openclaw update --tag <new-version> to replace the npm package and force fresh JS file load.

PR fix notes

PR #78646: fix(telegram): keep polling watchdog on getUpdates liveness

Repository: openclaw/openclaw
Author: ai-hpc
State: closed | merged: True
Link: https://github.com/openclaw/openclaw/pull/78646

Description (problem / solution / changelog)

Summary

Problem: Telegram polling stall recovery treated unrelated outbound Bot API activity as liveness for inbound getUpdates polling.
Why it matters: active sendMessage traffic could mask a wedged inbound polling loop, leaving Telegram replies silent until a manual restart.
What changed: make the stall watchdog depend on completed/stuck getUpdates liveness only, while keeping unrelated API elapsed time in diagnostics.
What did NOT change (scope boundary): this does not redesign Telegram transport rebuild behavior beyond ensuring the watchdog fires when inbound polling is stale.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #78422
Related #78473
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: TelegramPollingLivenessTracker.detectStall() returned no stall when either getUpdates elapsed time or generic Bot API elapsed time was still within the threshold.
Missing detection / guardrail: tests covered stale polling and stale unrelated API calls, but not the case where stale getUpdates coincides with recent or in-flight non-polling API traffic.
Contributing context (if known): outbound Telegram API success proves the Bot API path is alive, but it does not prove inbound long-polling is still progressing.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/telegram/src/polling-liveness.test.ts, extensions/telegram/src/polling-session.test.ts
Scenario the test should lock in: stale getUpdates still triggers watchdog restart even when sendMessage recently succeeded or a non-getUpdates API call is in flight.
Why this is the smallest reliable guardrail: the regression is in the polling liveness decision and session watchdog behavior, so targeted tracker/session tests catch it without live Telegram credentials.
Existing test that already covers this (if any): existing stale polling tests covered the baseline restart path but not unrelated API masking.
If no new test is added, why not: N/A

User-visible / Behavior Changes

Telegram polling recovery now restarts stale inbound polling even if unrelated outbound Telegram API calls are active or recently succeeded.

Diagram (if applicable)

Before:
stale getUpdates + recent sendMessage -> watchdog suppressed -> inbound polling stays wedged

After:
stale getUpdates + recent sendMessage -> watchdog restart -> polling cycle recovers

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: Ubuntu 24.04.4 LTS
Runtime/container: Node 22 / pnpm
Model/provider: N/A
Integration/channel (if any): Telegram plugin polling watchdog
Relevant config (redacted): targeted regression tests do not require a live token; live proof used TELEGRAM_BOT_TOKEN env fallback from a local redacted token file

Steps

Create a stale getUpdates liveness state.
Record unrelated Telegram API activity such as sendMessage success or an in-flight non-getUpdates API call.
Fire the polling stall watchdog.

Expected

Watchdog reports a polling stall and restarts the polling cycle.

Actual

Before this fix, recent unrelated API activity suppressed the watchdog.
After this fix, stale getUpdates liveness controls the watchdog and restart proceeds.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Validation on the rebased branch:

pnpm exec oxfmt --check --threads=1 CHANGELOG.md extensions/telegram/src/polling-liveness.ts extensions/telegram/src/polling-liveness.test.ts extensions/telegram/src/polling-session.test.ts
All matched files use the correct format.

pnpm test extensions/telegram/src/polling-liveness.test.ts extensions/telegram/src/polling-session.test.ts -- --reporter=verbose
Test Files 2 passed (2)
Tests 23 passed (23)

Real behavior proof

Behavior or issue addressed: Telegram polling watchdog recovery should fire from stale getUpdates liveness even when unrelated outbound Bot API calls are active.
Real environment tested: Ubuntu 24.04.4 LTS, PR branch fix/telegram-polling-watchdog-getupdates, commit e301533582, Node v22.22.1, pnpm 10.33.2, real Telegram Bot API token from a local redacted token file, and a private DM chat with the bot.
Exact steps or command run after this patch: Called real Telegram Bot API getMe, read a recent private DM via getUpdates, sent a disabled-notification proof message with sendMessage, exercised the PR liveness code to verify stale getUpdates returns STALL after outbound Telegram activity, then ran an isolated source-mode Gateway on port 19986 across the watchdog window with TELEGRAM_BOT_TOKEN supplied via env fallback.
Evidence after fix: Copied live output from Ubuntu 24.04.4 LTS, token omitted:

telegram_getMe=ok botId=8656041674 username=set
telegram_recent_chat=found chatType=private updateId=198414331
telegram_sendMessage=ok source=updates chatId=6599824666 messageId=70
live_sendMessage_stale_getUpdates=STALL
live_liveness_message=Polling stall detected (active getUpdates stuck for 120s); forcing restart. [diag inFlight=1 outcome=started startedAt=0 finishedAt=n/a durationMs=n/a offset=123 apiElapsedMs=60001]

Telegram client also showed the real round trip:

[5/6/2026 3:30 PM] Crazy Cat: test
[5/6/2026 3:30 PM] Orinclaw Assistant: OpenClaw PR #78646 live watchdog proof 2026-05-06T22:30:53.827Z

Additional isolated Gateway live proof after the same patch:

branch=fix/telegram-polling-watchdog-getupdates
commit=e301533582
os=Ubuntu 24.04.4 LTS
mode=isolated source-mode Gateway, port 19986, real Telegram bot token from env fallback
telegram_provider_start=[default] starting provider (@orinclaw_ai_bot)
inbound_updates=real pending Telegram DM updates consumed by Gateway poller
window=2026-05-06T22:51:45+00:00..2026-05-06T22:55:47+00:00
samples=5
health=live on every sample
ready=true failing=[] on every sample
polling_stall_count=0
getupdates_conflict_count=0
telegram_provider_start_count=1
final_health={"ok":true,"status":"live"}
final_ready={"ready":true,"failing":[]}
shutdown=clean SIGINT after validation

Before-fix long-lived reproduction on parent commit d05415d603:

scenario=active getUpdates started at t=0, unrelated non-getUpdates API success every 30s, watchdog threshold=120000ms
sample_0 t=0s result=NO_STALL
sample_1 t=30s result=NO_STALL
sample_2 t=60s result=NO_STALL
sample_3 t=90s result=NO_STALL
sample_4 t=120s result=NO_STALL
sample_135s t=135s result=NO_STALL
final_expected=STALL
final_actual=NO_STALL
reproduced_bug=stale getUpdates exceeded threshold but watchdog stayed suppressed by unrelated API liveness

Observed result after fix: The bot successfully handled real getMe, getUpdates, and sendMessage; the watchdog returned STALL after real outbound Telegram activity; the isolated Gateway stayed live/ready across the watchdog window with one Telegram provider start, zero false Polling stall detected logs, and zero getUpdates conflict logs.
What was not tested: I did not run a multi-hour production soak or model-response verification in the isolated test home. The isolated home intentionally had no OpenAI auth, so agent replies failed after Telegram polling consumed inbound DM updates; that auth failure is separate from Telegram polling liveness.

Human Verification (required)

Verified scenarios: stale getUpdates with recent non-polling API success, stale getUpdates with recent in-flight non-polling API activity, stale getUpdates with newer in-flight non-polling activity, existing stale polling restart paths, pre-fix long-lived suppression reproduction, real Telegram Bot API getMe/getUpdates/sendMessage, and an isolated live Gateway Telegram polling run across the watchdog window.
Edge cases checked: diagnostic output keeps apiElapsedMs for debugging while not using generic API liveness to suppress stale polling recovery.
What was not tested: I did not run a multi-hour production soak or model-response verification in the isolated test home; live Telegram polling startup, real update consumption, and watchdog-window stability were verified with a real token.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: Telegram polling may restart while outbound API traffic is healthy.
- Mitigation: this is intentional; outbound API health is not inbound getUpdates health, and the watchdog threshold/throttling still bounds restarts.

Changed files

CHANGELOG.md (modified, +1/-0)
extensions/telegram/src/polling-liveness.test.ts (modified, +10/-6)
extensions/telegram/src/polling-liveness.ts (modified, +31/-20)
extensions/telegram/src/polling-session.test.ts (modified, +19/-22)

Code Example

if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;

---

[telegram] Polling stall detected (no completed getUpdates for 149.99s); forcing restart.
[telegram] Polling runner stop timed out after 15s; forcing restart cycle.
[telegram][diag] polling cycle finished reason=polling stall detected
[telegram] Telegram polling runner stopped (...); restarting in 2.22s.
[telegram][diag] rebuilding transport for next polling cycle

RAW_BUFFERClick to expand / collapse

Two related bugs in dist/monitor-polling.runtime-*.js reproduced in 2026.5.4 and 2026.5.5.

Symptom

Gateway running, telegram channel reports running, connected, mode:polling, works via openclaw channels status --probe
ZERO TCP from gateway PID to 149.154.x or 91.108.x (Telegram backbone)
pending_update_count > 0 at telegram side, growing over time
No getUpdates / polling log entries for hours
Outbound sendMessage works fine (state-drift: gateway reports healthy while inbound is dead)
Multiple gateway restarts (systemctl --user restart openclaw-gateway) re-enter the same wedged state
Self-recovery eventually (~75 min in one case, indeterminate in another) — mechanism unclear; possibly when the npm package is replaced (e.g. openclaw update)

Bug 1 — masked stall detection

File: dist/monitor-polling.runtime-DjS2STzm.js (5.4) / monitor-polling.runtime-DBv9gGnS.js (5.5)

Line 84:

if (elapsed <= params.thresholdMs || apiElapsed <= params.thresholdMs) return null;

apiElapsed is updated by noteApiCallSuccess() on ANY successful API call (including outbound sendMessage). Result: stall-detection is suppressed during normal outbound activity, even when getUpdates has hung indefinitely. Should likely be && or just if (elapsed <= params.thresholdMs) return null; — polling-elapsed alone determines the polling stall.

Bug 2 — transport-rebuild silent failure

When stall IS detected (e.g. before any outbound activity occurs), the recovery sequence logs:

[telegram] Polling stall detected (no completed getUpdates for 149.99s); forcing restart.
[telegram] Polling runner stop timed out after 15s; forcing restart cycle.
[telegram][diag] polling cycle finished reason=polling stall detected
[telegram] Telegram polling runner stopped (...); restarting in 2.22s.
[telegram][diag] rebuilding transport for next polling cycle

…then silence. No new polling cycle starts, no error logged. #runPollingCycle() either never re-enters or hangs in a state that doesn't surface diagnostics.

Cost / impact

Sky-down on inbound for 1–3 hours per occurrence. Two occurrences in a single day during 2026-05-06.

Trigger

Both occurrences followed an external disruption (network blip from Docker WSL toggle reset; auth-profile failure from Anthropic billing exhaustion). The disruption is recoverable in itself; the polling-restart code path doesn't survive it.

Workaround

Wait for self-recovery, or openclaw update --tag <new-version> to replace the npm package and force fresh JS file load.

Suggested fix

Drop the apiElapsed check in detectStall — or use && — so stall-detection isn't masked by outbound activity.
Add error/timeout handling in the transport-rebuild path so silent failures surface as logs.

Versions affected

[email protected]
[email protected]

Environment

Node v24.13.0 (nvm), Ubuntu (WSL2 on Windows 11)
Gateway managed by systemd-user

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #inference speed #output truncation #response parsing #generation error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Telegram polling silently wedges after stall — transport rebuild never starts new polling cycle (5.4 + 5.5) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Workaround

PR fix notes

PR #78646: fix(telegram): keep polling watchdog on getUpdates liveness

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Real behavior proof

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Symptom

Bug 1 — masked stall detection

Bug 2 — transport-rebuild silent failure

Cost / impact

Trigger

Workaround

Suggested fix

Versions affected

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING