openclaw - ✅(Solved) Fix [Bug]: Telegram polling stall recovery fails — watchdog tracks call initiation, not success [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#44595Fetched 2026-04-08 00:44:49
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
referenced ×2cross-referenced ×1

Error Message

Add MAX_CONSECUTIVE_POLL_RESTARTS = 5. After 5 consecutive failed restart attempts without a successful getUpdates, throw an error to crash the process and let the process manager (systemd/launchd) do a clean restart.

Root Cause

In the polling cycle setup (around TelegramPollingSession#runPollingCycle), the API middleware updates lastGetUpdatesAt when getUpdates is called, not when it returns successfully:

// Current (broken): timestamps on initiation
bot.api.config.use((prev, method, payload, signal) => {
    if (method === "getUpdates") lastGetUpdatesAt = Date.now();
    return prev(method, payload, signal);
});

After 2 failed restart attempts, grammY's internal retry loop takes over. Each retry calls getUpdates (which fails), but the middleware updates the timestamp anyway. The watchdog sees fresh timestamps and never fires again.

Fix Action

Fixed

PR fix notes

PR #44601: fix(telegram): polling stall recovery fails when grammY retries mask the stall

Description (problem / solution / changelog)

Summary

Fixes #44595

The polling stall watchdog in TelegramPollingSession tracks getUpdates call initiation, not successful completion. When the watchdog triggers a restart but network recovery fails, grammY's internal retry mechanism continues making failed getUpdates calls at intervals shorter than the 90s stall threshold. Each failed attempt updates lastGetUpdatesAt through the API middleware, fooling the watchdog into thinking polling is healthy when it's actually broken.

This caused a 50-minute outage where the gateway process was alive (/health returning {"ok":true}) but Telegram was completely deaf — no messages received on any account.

Root Cause

// BEFORE: timestamps on initiation — broken retries mask the stall
bot.api.config.use((prev, method, payload, signal) => {
  if (method === "getUpdates") lastGetUpdatesAt = Date.now();
  return prev(method, payload, signal);
});

Timeline of the failure:

  1. Polling stall detected (92–103s silence) → watchdog fired correctly, restart triggered ✅
  2. Second stall (90s) → watchdog fired again, restart #2 triggered ✅
  3. After restart #2: grammY's retry interval < 90s → each failed getUpdates call updates timestamp → watchdog never fires again
  4. Gateway sat deaf for 50 minutes until manual restart

Fix (3 changes, 1 file)

1. Track success, not initiation

Make the API transformer async, await prev() before updating lastGetUpdatesAt. Failed calls no longer reset the watchdog clock.

2. Reset #restartAttempts on success

Prevents permanent backoff growth after genuine recovery — the counter resets when getUpdates actually succeeds.

3. Escalate after 5 consecutive stall restarts

New MAX_CONSECUTIVE_POLL_RESTARTS = 5 constant. After exhausting retries without a single successful getUpdates, calls process.exit(1) to let the process manager (systemd/launchd/pm2) do a clean restart.

Testing

This was discovered and diagnosed from production logs on OpenClaw 2026.3.8. The fix addresses the specific code path in src/telegram/polling-session.ts that the existing monitor.test.ts stall detection tests don't cover (they test detection, not recovery failure under grammY retries).

Changed files

  • extensions/telegram/src/polling-session.ts (modified, +15/-3)

Code Example

// Current (broken): timestamps on initiation
bot.api.config.use((prev, method, payload, signal) => {
    if (method === "getUpdates") lastGetUpdatesAt = Date.now();
    return prev(method, payload, signal);
});

---

bot.api.config.use(async (prev, method, payload, signal) => {
    const result = await prev(method, payload, signal);
    if (method === "getUpdates") {
        lastGetUpdatesAt = Date.now();
        this.#restartAttempts = 0;
    }
    return result;
});
RAW_BUFFERClick to expand / collapse

Bug

Telegram polling stall detection correctly fires after ~90s of silence, but the recovery fails because the watchdog timestamp is updated on getUpdates call initiation, not successful completion. grammY's internal retry mechanism (maxRetryTime: 3600000) makes failed getUpdates calls at intervals shorter than the 90s stall threshold, keeping the timestamp fresh and fooling the watchdog into thinking polling is healthy.

Result: gateway process stays alive, health endpoint returns {"ok":true}, but Telegram is completely deaf. No escalation, no process restart. Requires manual intervention.

Reproduction

This occurred on macOS (Node 22, OpenClaw 2026.3.8, launchd service) on March 12, 2026. Likely triggered by a transient network/DNS issue reaching api.telegram.org.

Timeline:

  • 19:02:02 PDT — Polling stall detected on both Telegram accounts. No getUpdates response for 92–103s. Auto-restart triggered.
  • 19:02:05 — DNS fallback forced (autoSelectFamily=false, dnsResultOrder=ipv4first)
  • 19:03:35 — Second polling stall detected (90s). Restart attempt #2.
  • 19:04–19:52 — Dead silence. No errors logged. Gateway alive but Telegram deaf for 50 minutes.
  • 19:52:42 — Manual gateway restart recovered polling immediately.

Root Cause

In the polling cycle setup (around TelegramPollingSession#runPollingCycle), the API middleware updates lastGetUpdatesAt when getUpdates is called, not when it returns successfully:

// Current (broken): timestamps on initiation
bot.api.config.use((prev, method, payload, signal) => {
    if (method === "getUpdates") lastGetUpdatesAt = Date.now();
    return prev(method, payload, signal);
});

After 2 failed restart attempts, grammY's internal retry loop takes over. Each retry calls getUpdates (which fails), but the middleware updates the timestamp anyway. The watchdog sees fresh timestamps and never fires again.

Proposed Fix (PR incoming)

Three targeted changes:

1. Track success, not initiation

bot.api.config.use(async (prev, method, payload, signal) => {
    const result = await prev(method, payload, signal);
    if (method === "getUpdates") {
        lastGetUpdatesAt = Date.now();
        this.#restartAttempts = 0;
    }
    return result;
});

2. Escalation after exhausting retries

Add MAX_CONSECUTIVE_POLL_RESTARTS = 5. After 5 consecutive failed restart attempts without a successful getUpdates, throw an error to crash the process and let the process manager (systemd/launchd) do a clean restart.

3. Reset counter on success

#restartAttempts = 0 on successful getUpdates so backoff resets after genuine recovery.

Related Issues

  • #36259 — Closed earlier today as "fixed" (stall detection works, but recovery doesn't — this issue)
  • #41704 — Polling stalls indefinitely when proxy TCP connection drops
  • #42100 — macOS polling stalls after sleep/wake or network change

Environment

  • OpenClaw 2026.3.8 (3caab92)
  • macOS 15.3.2 (arm64), Node v22.22.1
  • Telegram channel, launchd service, token auth mode
  • grammY runner with default maxRetryTime: 3600000

extent analysis

Fix Plan

To fix the issue, we need to implement the proposed changes:

  1. Track success, not initiation: Update the lastGetUpdatesAt timestamp when getUpdates is successful, not when it's initiated.
  2. Escalation after exhausting retries: Introduce a MAX_CONSECUTIVE_POLL_RESTARTS limit and throw an error after 5 consecutive failed restart attempts.
  3. Reset counter on success: Reset the #restartAttempts counter when getUpdates is successful.

Here's the updated code:

const MAX_CONSECUTIVE_POLL_RESTARTS = 5;

bot.api.config.use(async (prev, method, payload, signal) => {
    const result = await prev(method, payload, signal);
    if (method === "getUpdates") {
        lastGetUpdatesAt = Date.now();
        this.#restartAttempts = 0;
    } else if (this.#restartAttempts >= MAX_CONSECUTIVE_POLL_RESTARTS) {
        throw new Error("Exhausted retry attempts");
    }
    return result;
});

Additionally, you should handle the error thrown when the retry attempts are exhausted:

process.on("uncaughtException", (err) => {
    if (err.message === "Exhausted retry attempts") {
        // Let the process manager (systemd/launchd) restart the process
        process.exit(1);
    }
});

Verification

To verify the fix, you can simulate a polling stall by introducing a delay or error in the getUpdates call. Then, check that:

  • The lastGetUpdatesAt timestamp is updated correctly
  • The #restartAttempts counter is reset after a successful getUpdates
  • The process crashes and restarts after 5 consecutive failed restart attempts

Extra Tips

  • Make sure to test the fix in a production-like environment to ensure it works as expected.
  • Consider adding logging and monitoring to detect and respond to polling stalls and retry attempts.
  • Review the related issues (#36259, #41704, #42100) to ensure that the fix doesn't introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Telegram polling stall recovery fails — watchdog tracks call initiation, not success [1 pull requests, 1 participants]