openclaw - 💡(How to fix) Fix Telegram polling watchdog livelock: rebuild loop fires before runner first getUpdates [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74344Fetched 2026-04-30 06:25:17
View on GitHub
Comments
3
Participants
3
Timeline
5
Reactions
2
Author
Timeline (top)
commented ×3closed ×1cross-referenced ×1

On a fleet of 18 OpenClaw v2026.4.15 agents (3 pods × 6 each, Docker on macOS, Telegram polling enabled), a brief upstream Telegram blip on 2026-04-27 02:50 UTC triggered a self-sustaining "polling stall storm" lasting ~3h 13m and affecting every agent. The polling watchdog detected stalls every ~90s and rebuilt the polling runner before the new runner could complete its first getUpdates, putting each agent into a rebuild loop.

The watchdog constants are compiled into the runtime as immutable literals — there is no process.env override, no channels.telegram.config.polling.stallThresholdMs field, and no other documented escape hatch. Once the loop starts, the only intervention is a full container kill.

This is the same symptom class as #43233 / #54513 / #41704 / #49461 / #56061, plus the additional concern that the watchdog itself is the trigger when downstream startup work takes longer than 90s (which is exactly the pattern in the recently-closed #73645 — sidecars.channels stalling 145–625s on startup).

Root Cause

On a fleet of 18 OpenClaw v2026.4.15 agents (3 pods × 6 each, Docker on macOS, Telegram polling enabled), a brief upstream Telegram blip on 2026-04-27 02:50 UTC triggered a self-sustaining "polling stall storm" lasting ~3h 13m and affecting every agent. The polling watchdog detected stalls every ~90s and rebuilt the polling runner before the new runner could complete its first getUpdates, putting each agent into a rebuild loop.

The watchdog constants are compiled into the runtime as immutable literals — there is no process.env override, no channels.telegram.config.polling.stallThresholdMs field, and no other documented escape hatch. Once the loop starts, the only intervention is a full container kill.

This is the same symptom class as #43233 / #54513 / #41704 / #49461 / #56061, plus the additional concern that the watchdog itself is the trigger when downstream startup work takes longer than 90s (which is exactly the pattern in the recently-closed #73645 — sidecars.channels stalling 145–625s on startup).

Fix Action

Workaround

For our fleet, the operational workaround is to avoid simultaneous restarts — if all 6 agents on a host are restarted at once, the host CPU/IO contention can push individual startup times into the watchdog window. We've implemented a Phase 3.2 / 3.3 alert pair (Loki + Grafana) so the next storm is detected within ~10 minutes instead of ~3 hours. But there's no in-OpenClaw mitigation available against the current build.

Code Example

const POLL_STALL_THRESHOLD_MS = 9e4;   // 90 000 ms
const POLL_WATCHDOG_INTERVAL_MS = 3e4;  // 30 000 ms
const POLL_STOP_GRACE_MS = 15e3;        // 15 000 ms
RAW_BUFFERClick to expand / collapse

Bug type

Bug + small enhancement request

Summary

On a fleet of 18 OpenClaw v2026.4.15 agents (3 pods × 6 each, Docker on macOS, Telegram polling enabled), a brief upstream Telegram blip on 2026-04-27 02:50 UTC triggered a self-sustaining "polling stall storm" lasting ~3h 13m and affecting every agent. The polling watchdog detected stalls every ~90s and rebuilt the polling runner before the new runner could complete its first getUpdates, putting each agent into a rebuild loop.

The watchdog constants are compiled into the runtime as immutable literals — there is no process.env override, no channels.telegram.config.polling.stallThresholdMs field, and no other documented escape hatch. Once the loop starts, the only intervention is a full container kill.

This is the same symptom class as #43233 / #54513 / #41704 / #49461 / #56061, plus the additional concern that the watchdog itself is the trigger when downstream startup work takes longer than 90s (which is exactly the pattern in the recently-closed #73645 — sidecars.channels stalling 145–625s on startup).

Steps to reproduce

  1. Run OpenClaw v2026.4.15 with a Telegram channel configured.
  2. Trigger any condition that delays the first getUpdates past 90 seconds — examples observed in the wild:
    • Slow startup (#73645 — sidecars stall 145–625s on cold start).
    • Brief Telegram API blip / proxy TCP drop / NAT timeout during boot.
    • Heavy concurrent agent load on a host (e.g. 6 agents booting simultaneously).
  3. Observe [telegram] Polling stall detected (no getUpdates for 90.01s); forcing restart.
  4. The watchdog rebuilds the polling runner. The new runner needs another full startup window to issue its first getUpdates.
  5. The 90s window expires before the new runner's first call lands. Watchdog rebuilds again. Loop.

Observed evidence (2026-04-27 02:50–06:03 UTC)

  • All 18 agents hit simultaneously, ~74 stall-and-rebuild cycles each across the 3h window.
  • No host-level network failure during the storm (Tailscale healthy throughout, direct curl to api.telegram.org and to xiaomimimo.com responded normally from each host).
  • Self-resolved at ~06:03 UTC without operator intervention.

Compiled constants in v2026.4.15

/app/dist/extensions/telegram/monitor-polling.runtime-eDxUeolT.js:

const POLL_STALL_THRESHOLD_MS = 9e4;   // 90 000 ms
const POLL_WATCHDOG_INTERVAL_MS = 3e4;  // 30 000 ms
const POLL_STOP_GRACE_MS = 15e3;        // 15 000 ms

grep -c process.env in this file = 0. No camelCase pollingStallThresholdMs config field exists. The literal 90000 does not appear — the build toolchain inlined the value as 9e4, confirming it is a pure constant with no runtime resolution path.

Expected behavior

Either of these would prevent the storm class:

  1. Env-var override: const POLL_STALL_THRESHOLD_MS = Number(process.env.POLL_STALL_THRESHOLD_MS) || 9e4; lets operators raise the threshold during known slow-startup conditions or post-incident.
  2. Config-schema field: channels.telegram.config.polling.stallThresholdMs (with the same fallback) lets agents declare a higher threshold per-instance.

Either is a small change. The 90s default is reasonable for steady-state operation but too aggressive against the 145–625s sidecars startup window observed in #73645.

OpenClaw version

openclaw-fleet:patched (a local pin of v2026.4.15, built 2026-04-16). I checked v2026.4.26's release notes and don't see a polling-watchdog change there — the constants are still presumably hardcoded. (Happy to verify this on the latest tag if the maintainers confirm a different file path.)

Operating system

macOS (Apple Silicon, Docker Desktop). Symptom class is OS-agnostic per #43233 comment thread (reproduced on macOS, Linux, Windows+WSL2).

Install method

Docker image, custom build pinned to v2026.4.15.

Cross-references

  • #43233 — original "Polling stall detected ... forcing restart" thread (CLOSED). Confirms the symptom is the same class.
  • #54513 — "Telegram polling has no stall detection (unlike Slack health-monitor)" (CLOSED). Original request that motivated the watchdog. The introduction context.
  • #59332 — "Feature: Telegram polling watchdog — auto-restart connection on stall without killing gateway" (CLOSED). The watchdog itself.
  • #73645 — sidecars.channels stalls 145–625s on startup (CLOSED 2026-04-28). The proximate trigger when startup delays exceed the watchdog's 90s threshold.
  • #41704 / #49461 / #56061 / #56065 — adjacent symptom classes (proxy TCP drops, NAT timeouts, dead sockets, lost messages).

Workaround

For our fleet, the operational workaround is to avoid simultaneous restarts — if all 6 agents on a host are restarted at once, the host CPU/IO contention can push individual startup times into the watchdog window. We've implemented a Phase 3.2 / 3.3 alert pair (Loki + Grafana) so the next storm is detected within ~10 minutes instead of ~3 hours. But there's no in-OpenClaw mitigation available against the current build.

What I'm requesting

  • Confirmation of whether either of the two override paths above (env var or config field) is acceptable. I'm happy to send a PR if either is welcome.
  • If neither is welcome, an alternative: please document in the OpenClaw config schema docs that the threshold is intentionally non-overridable, so future operators don't waste investigation time looking for a setting that doesn't exist.

extent analysis

TL;DR

The most likely fix is to introduce an env-var override or a config-schema field to allow operators to adjust the polling stall threshold.

Guidance

  • Introduce an env-var override by modifying the POLL_STALL_THRESHOLD_MS constant to use process.env.POLL_STALL_THRESHOLD_MS as a fallback.
  • Add a config-schema field channels.telegram.config.polling.stallThresholdMs to allow per-instance configuration of the threshold.
  • Verify the effectiveness of the change by testing with a higher threshold value and observing the behavior of the polling watchdog.
  • Consider implementing a temporary workaround to avoid simultaneous restarts of multiple agents on the same host to prevent CPU/IO contention.

Example

const POLL_STALL_THRESHOLD_MS = Number(process.env.POLL_STALL_THRESHOLD_MS) || 9e4;

This code snippet demonstrates how to introduce an env-var override for the polling stall threshold.

Notes

The introduction of an env-var override or config-schema field requires careful consideration of the potential impact on the system's behavior and performance. It is essential to test and verify the effectiveness of the change before deploying it to production.

Recommendation

Apply a workaround by introducing an env-var override or a config-schema field to allow operators to adjust the polling stall threshold. This change will provide a flexible solution to prevent the polling stall storm without requiring significant modifications to the existing codebase.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Either of these would prevent the storm class:

  1. Env-var override: const POLL_STALL_THRESHOLD_MS = Number(process.env.POLL_STALL_THRESHOLD_MS) || 9e4; lets operators raise the threshold during known slow-startup conditions or post-incident.
  2. Config-schema field: channels.telegram.config.polling.stallThresholdMs (with the same fallback) lets agents declare a higher threshold per-instance.

Either is a small change. The 90s default is reasonable for steady-state operation but too aggressive against the 145–625s sidecars startup window observed in #73645.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Telegram polling watchdog livelock: rebuild loop fires before runner first getUpdates [3 comments, 3 participants]