openclaw - ✅(Solved) Fix Telegram supervisor leaves channel in stopped, disconnected state after stop()-timeout during polling-stall recovery [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75519Fetched 2026-05-02 05:33:34
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
2
Timeline (top)
closed ×1commented ×1cross-referenced ×1

When the Telegram polling runner detects a stall (Polling stall detected (no completed getUpdates for >120s); forcing restart), the supervisor calls stop() on the channel. If the in-flight getUpdates socket is wedged and stop() exceeds its timeout (channel stop exceeded 5000ms after abort; continuing shutdown), the channel is left in stopped, disconnected state and never re-start()ed. The gateway process stays alive, so launchd's KeepAlive does not help. Recovery requires a full gateway restart.

This makes the bot silently unresponsive — Telegram users see messages delivered with checkmarks but the bot never receives them. openclaw channels status reports the channel as stopped, disconnected indefinitely.

Error Message

Gateway reachable.

  • Telegram default: enabled, configured, stopped, disconnected, in:6h ago, out:4h ago, mode:polling, token:config, error:channel stop timed out after 5000ms

Root Cause

  1. External watchdog (FAILED — restart-looped 519 times in 24h with no recovery). A 60-second launchd job that detected stopped, disconnected (and later running, disconnected) and ran launchctl kickstart -k. Logged 519 unhealthy events and 519 restarts in a single 24-hour window with only 2 hourly "ok" heartbeats — the channel never stayed in running, connected long enough to register a healthy hour. Restarts cannot fix this because each fresh polling connection wedges the same way.

Fix Action

Fix / Workaround

Workarounds tried (in order, with results)

PR fix notes

PR #72912: Recover channel restarts when old lifecycles wedge

Description (problem / solution / changelog)

The gateway health monitor already notices unhealthy channel lifecycles and tries to restart them, but the restart path had a blind spot: if the old channel task ignored abort and failed to settle within the stop grace window, stopChannel logged the timeout and left the stale task registered. The immediately-following startChannel then saw the existing task and no-oped, so the monitor could say it was restarting without actually creating a fresh Discord or Telegram lifecycle.

This adds an explicit forced-retirement path for health-monitor recovery. Normal manual stops keep the existing conservative behavior. Health-monitor restarts can now retire the wedged lifecycle from manager bookkeeping, clear stale busy/connected state, and start a new channel account. Runtime status updates, catch handlers, cleanup handlers, and auto-restart logic are now lifecycle-gated so a late callback from the retired task cannot mark the fresh lifecycle stopped or restart behind its back.

Changed files

  • src/gateway/channel-health-monitor.test.ts (modified, +20/-4)
  • src/gateway/channel-health-monitor.ts (modified, +3/-1)
  • src/gateway/server-channels.approval-bootstrap.test.ts (modified, +45/-0)
  • src/gateway/server-channels.test.ts (modified, +79/-0)
  • src/gateway/server-channels.ts (modified, +94/-11)
  • src/infra/channel-runtime-context.test.ts (modified, +23/-0)
  • src/infra/channel-runtime-context.ts (modified, +11/-3)
  • src/infra/exec-approval-channel-runtime.test.ts (modified, +29/-0)
  • src/infra/exec-approval-channel-runtime.ts (modified, +35/-7)

Code Example

2026-04-29T16:55:12.189+04:00 [telegram] Polling stall detected (no completed getUpdates for 146.15s); forcing restart. [diag inFlight=0 outcome=error startedAt=… durationMs=15014 offset=509561555 error=Network request for 'getUpdates' failed!]
2026-04-29T16:55:27.224+04:00 [telegram] Polling runner stop timed out after 15s; forcing restart cycle.
2026-04-29T16:55:27.229+04:00 [telegram] [diag] polling cycle finished reason=polling stall detected … error=Network request for 'getUpdates' failed!
2026-04-29T16:55:27.233+04:00 [telegram] polling runner stopped (polling stall detected); restarting in 2.1s.

2026-04-29T17:18:11.881+04:00 [health-monitor] [telegram:default] health-monitor: restarting (reason: stale-socket)
2026-04-29T17:18:16.917+04:00 [telegram] [default] channel stop exceeded 5000ms after abort; continuing shutdown
[no subsequent "starting provider" log entry — channel rotted from this point]

---

Gateway reachable.
- Telegram default: enabled, configured, stopped, disconnected, in:6h ago, out:4h ago, mode:polling, token:config, error:channel stop timed out after 5000ms
RAW_BUFFERClick to expand / collapse

Bug: Telegram channel stays in stopped, disconnected after stop()-timeout during polling-stall recovery

Repo: https://github.com/openclaw/openclaw/issues openclaw version: 2026.4.26 (be8c246) Node: v25.9.0 OS: macOS (darwin 24.6.0) Channel: Telegram (default account, polling mode)

Summary

When the Telegram polling runner detects a stall (Polling stall detected (no completed getUpdates for >120s); forcing restart), the supervisor calls stop() on the channel. If the in-flight getUpdates socket is wedged and stop() exceeds its timeout (channel stop exceeded 5000ms after abort; continuing shutdown), the channel is left in stopped, disconnected state and never re-start()ed. The gateway process stays alive, so launchd's KeepAlive does not help. Recovery requires a full gateway restart.

This makes the bot silently unresponsive — Telegram users see messages delivered with checkmarks but the bot never receives them. openclaw channels status reports the channel as stopped, disconnected indefinitely.

Reproduction (observed)

Real-world reproducer was a flaky network path to api.telegram.org over IPv6. The exact log sequence (multiple occurrences in a single day):

2026-04-29T16:55:12.189+04:00 [telegram] Polling stall detected (no completed getUpdates for 146.15s); forcing restart. [diag inFlight=0 outcome=error startedAt=… durationMs=15014 offset=509561555 error=Network request for 'getUpdates' failed!]
2026-04-29T16:55:27.224+04:00 [telegram] Polling runner stop timed out after 15s; forcing restart cycle.
2026-04-29T16:55:27.229+04:00 [telegram] [diag] polling cycle finished reason=polling stall detected … error=Network request for 'getUpdates' failed!
2026-04-29T16:55:27.233+04:00 [telegram] polling runner stopped (polling stall detected); restarting in 2.1s.
2026-04-29T17:18:11.881+04:00 [health-monitor] [telegram:default] health-monitor: restarting (reason: stale-socket)
2026-04-29T17:18:16.917+04:00 [telegram] [default] channel stop exceeded 5000ms after abort; continuing shutdown
[no subsequent "starting provider" log entry — channel rotted from this point]

After the second-to-last line, the expected [telegram] [default] starting provider never appears. The channel sits in stopped, disconnected. Subsequent health-monitor retries hit the same stop() timeout and produce no progress.

openclaw channels status after the wedge:

Gateway reachable.
- Telegram default: enabled, configured, stopped, disconnected, in:6h ago, out:4h ago, mode:polling, token:config, error:channel stop timed out after 5000ms

Across an 8-day baseline (Apr 20–27), the same gateway logged 426–432 polling stalls per day — most recovered, but ~1–3 per day produced this stuck-stopped state.

Expected behavior

When stop() exceeds its timeout during a recovery cycle, the supervisor should still proceed to start() the channel — either:

  1. Force-terminate the wedged transport (close socket, abort fetch) and proceed to fresh start, or
  2. Treat the stop() timeout as terminal-state-reached and unconditionally invoke start() after the timeout fires.

Today the supervisor logs continuing shutdown and stops, leaving no path back to running.

Workarounds tried (in order, with results)

  1. External watchdog (FAILED — restart-looped 519 times in 24h with no recovery). A 60-second launchd job that detected stopped, disconnected (and later running, disconnected) and ran launchctl kickstart -k. Logged 519 unhealthy events and 519 restarts in a single 24-hour window with only 2 hourly "ok" heartbeats — the channel never stayed in running, connected long enough to register a healthy hour. Restarts cannot fix this because each fresh polling connection wedges the same way.

  2. Force IPv4 to api.telegram.org (PARTIAL HELP). Setting channels.telegram.network.dnsResultOrder: "ipv4first" and channels.telegram.network.autoSelectFamily: false reduced raw stall counts. Note: schema docs say ipv4first is the default on Node 22+, but on Node 25.9.0 we observed IPv6 connections being made until we set this explicitly — possibly a separate bug.

  3. Webhook mode (FIXED — permanent). Migrating from mode:polling to mode:webhook (via webhookUrl + webhookSecret config and a Cloudflare quick tunnel for the public HTTPS URL) eliminated the entire failure class. Telegram now POSTs to us over short-lived HTTPS connections. No more polling stalls, no more wedged sockets, no more channel-stop timeouts. Channel status: running, mode:webhook, token:config. Stable.

What the symptom actually was

The deeper finding from migrating off polling: the underlying problem was network-layer silent termination of long-lived TCP connections between this Mac (residential ISP, behind NAT) and api.telegram.org. Short HTTPS requests succeeded ~9/10 with sub-second latency, but getUpdates long-polls would sit open for 150–930 seconds without data, FIN, or RST. Possible upstream causes (NAT idle pruning, ISP-level connection state cleanup, packet inspector). openclaw's polling runner couldn't recover because:

  • Stop-timeout path leaves the channel in dead state (the bug above)
  • Even if it recovered cleanly, the next polling connection would wedge the same way

So while the supervisor bug is real and worth fixing, it's a soft symptom of an inherently-unreliable transport choice for residential networks. Webhook mode is the right answer for any user behind NAT on a typical home/office connection.

Suggested fix locations

  • The polling-runner shutdown path that emits polling runner stop timed out after 15s; forcing restart cycle. already promises a "restart cycle" — that promise isn't being kept when followed by channel stop exceeded 5000ms after abort. The two timeouts should not silently cancel the restart side of the cycle.
  • The health-monitor restarting (reason: stale-socket | disconnected) path should fall through to a fresh start() regardless of stop() outcome.

Related

  • Schema docs for TelegramNetworkConfig.dnsResultOrder say "Default: ipv4first on Node 22+ to avoid common fetch failures". We're on Node 25.9.0 and the default did not apply in practice — connections went IPv6 until explicitly configured. Possibly a separate bug worth tracking.

extent analysis

TL;DR

The most likely fix is to modify the polling-runner shutdown path to ensure a restart cycle is completed even if the stop() timeout is exceeded.

Guidance

  • Review the stop() timeout handling in the polling-runner shutdown path to ensure it does not silently cancel the restart cycle.
  • Consider modifying the health-monitor restarting path to fall through to a fresh start() regardless of stop() outcome.
  • Investigate the discrepancy between the documented default dnsResultOrder behavior and the observed behavior on Node 25.9.0.
  • As a workaround, consider migrating to webhook mode, which has been shown to eliminate the failure class.

Example

No code snippet is provided as the issue does not include specific code references.

Notes

The issue is complex and has multiple potential causes, including network-layer silent termination of long-lived TCP connections. The suggested fix locations are based on the provided information and may require further investigation to fully resolve the issue.

Recommendation

Apply a workaround by migrating to webhook mode, as it has been shown to eliminate the failure class and provide a more reliable transport choice for residential networks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When stop() exceeds its timeout during a recovery cycle, the supervisor should still proceed to start() the channel — either:

  1. Force-terminate the wedged transport (close socket, abort fetch) and proceed to fresh start, or
  2. Treat the stop() timeout as terminal-state-reached and unconditionally invoke start() after the timeout fires.

Today the supervisor logs continuing shutdown and stops, leaving no path back to running.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Telegram supervisor leaves channel in stopped, disconnected state after stop()-timeout during polling-stall recovery [1 pull requests, 1 comments, 2 participants]