openclaw - ✅(Solved) Fix [Bug]: Telegram polling stall recovery fails — watchdog tracks call initiation, not success [1 pull requests, 1 participants]

openclaw2026-03-13 03:19:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#44595•Fetched 2026-04-08 00:44:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

talison

Participants

talison

Timeline (top)

referenced ×2cross-referenced ×1

Error Message

Add MAX_CONSECUTIVE_POLL_RESTARTS = 5. After 5 consecutive failed restart attempts without a successful getUpdates, throw an error to crash the process and let the process manager (systemd/launchd) do a clean restart.

Root Cause

In the polling cycle setup (around TelegramPollingSession#runPollingCycle), the API middleware updates lastGetUpdatesAt when getUpdates is called, not when it returns successfully:

// Current (broken): timestamps on initiation
bot.api.config.use((prev, method, payload, signal) => {
    if (method === "getUpdates") lastGetUpdatesAt = Date.now();
    return prev(method, payload, signal);
});

After 2 failed restart attempts, grammY's internal retry loop takes over. Each retry calls getUpdates (which fails), but the middleware updates the timestamp anyway. The watchdog sees fresh timestamps and never fires again.

Fix Action

Fixed

Fixed by PR: fix(telegram): polling stall recovery fails when grammY retries mask the stall (https://github.com/openclaw/openclaw/pull/44601)

PR fix notes

PR #44601: fix(telegram): polling stall recovery fails when grammY retries mask the stall

Repository: openclaw/openclaw
Author: talison
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/44601

Description (problem / solution / changelog)

Summary

Fixes #44595

The polling stall watchdog in TelegramPollingSession tracks getUpdates call initiation, not successful completion. When the watchdog triggers a restart but network recovery fails, grammY's internal retry mechanism continues making failed getUpdates calls at intervals shorter than the 90s stall threshold. Each failed attempt updates lastGetUpdatesAt through the API middleware, fooling the watchdog into thinking polling is healthy when it's actually broken.

This caused a 50-minute outage where the gateway process was alive (/health returning {"ok":true}) but Telegram was completely deaf — no messages received on any account.

Root Cause

// BEFORE: timestamps on initiation — broken retries mask the stall
bot.api.config.use((prev, method, payload, signal) => {
  if (method === "getUpdates") lastGetUpdatesAt = Date.now();
  return prev(method, payload, signal);
});

Timeline of the failure:

Polling stall detected (92–103s silence) → watchdog fired correctly, restart triggered ✅
Second stall (90s) → watchdog fired again, restart #2 triggered ✅
After restart #2: grammY's retry interval < 90s → each failed getUpdates call updates timestamp → watchdog never fires again ❌
Gateway sat deaf for 50 minutes until manual restart

Fix (3 changes, 1 file)

1. Track success, not initiation

Make the API transformer async, await prev() before updating lastGetUpdatesAt. Failed calls no longer reset the watchdog clock.

2. Reset `#restartAttempts` on success

Prevents permanent backoff growth after genuine recovery — the counter resets when getUpdates actually succeeds.

3. Escalate after 5 consecutive stall restarts

New MAX_CONSECUTIVE_POLL_RESTARTS = 5 constant. After exhausting retries without a single successful getUpdates, calls process.exit(1) to let the process manager (systemd/launchd/pm2) do a clean restart.

Testing

This was discovered and diagnosed from production logs on OpenClaw 2026.3.8. The fix addresses the specific code path in src/telegram/polling-session.ts that the existing monitor.test.ts stall detection tests don't cover (they test detection, not recovery failure under grammY retries).

Changed files

extensions/telegram/src/polling-session.ts (modified, +15/-3)

Code Example

// Current (broken): timestamps on initiation
bot.api.config.use((prev, method, payload, signal) => {
    if (method === "getUpdates") lastGetUpdatesAt = Date.now();
    return prev(method, payload, signal);
});

---

bot.api.config.use(async (prev, method, payload, signal) => {
    const result = await prev(method, payload, signal);
    if (method === "getUpdates") {
        lastGetUpdatesAt = Date.now();
        this.#restartAttempts = 0;
    }
    return result;
});

RAW_BUFFERClick to expand / collapse

Bug

Telegram polling stall detection correctly fires after ~90s of silence, but the recovery fails because the watchdog timestamp is updated on getUpdates call initiation, not successful completion. grammY's internal retry mechanism (maxRetryTime: 3600000) makes failed getUpdates calls at intervals shorter than the 90s stall threshold, keeping the timestamp fresh and fooling the watchdog into thinking polling is healthy.

Result: gateway process stays alive, health endpoint returns {"ok":true}, but Telegram is completely deaf. No escalation, no process restart. Requires manual intervention.

Reproduction

This occurred on macOS (Node 22, OpenClaw 2026.3.8, launchd service) on March 12, 2026. Likely triggered by a transient network/DNS issue reaching api.telegram.org.

Timeline:

19:02:02 PDT — Polling stall detected on both Telegram accounts. No getUpdates response for 92–103s. Auto-restart triggered.
19:02:05 — DNS fallback forced (autoSelectFamily=false, dnsResultOrder=ipv4first)
19:03:35 — Second polling stall detected (90s). Restart attempt #2.
19:04–19:52 — Dead silence. No errors logged. Gateway alive but Telegram deaf for 50 minutes.
19:52:42 — Manual gateway restart recovered polling immediately.

Root Cause

In the polling cycle setup (around TelegramPollingSession#runPollingCycle), the API middleware updates lastGetUpdatesAt when getUpdates is called, not when it returns successfully:

// Current (broken): timestamps on initiation
bot.api.config.use((prev, method, payload, signal) => {
    if (method === "getUpdates") lastGetUpdatesAt = Date.now();
    return prev(method, payload, signal);
});

Proposed Fix (PR incoming)

Three targeted changes:

1. Track success, not initiation

bot.api.config.use(async (prev, method, payload, signal) => {
    const result = await prev(method, payload, signal);
    if (method === "getUpdates") {
        lastGetUpdatesAt = Date.now();
        this.#restartAttempts = 0;
    }
    return result;
});

2. Escalation after exhausting retries

3. Reset counter on success

#restartAttempts = 0 on successful getUpdates so backoff resets after genuine recovery.

Related Issues

#36259 — Closed earlier today as "fixed" (stall detection works, but recovery doesn't — this issue)
#41704 — Polling stalls indefinitely when proxy TCP connection drops
#42100 — macOS polling stalls after sleep/wake or network change

Environment

OpenClaw 2026.3.8 (3caab92)
macOS 15.3.2 (arm64), Node v22.22.1
Telegram channel, launchd service, token auth mode
grammY runner with default maxRetryTime: 3600000

extent analysis

Fix Plan

To fix the issue, we need to implement the proposed changes:

Track success, not initiation: Update the lastGetUpdatesAt timestamp when getUpdates is successful, not when it's initiated.
Escalation after exhausting retries: Introduce a MAX_CONSECUTIVE_POLL_RESTARTS limit and throw an error after 5 consecutive failed restart attempts.
Reset counter on success: Reset the #restartAttempts counter when getUpdates is successful.

Here's the updated code:

const MAX_CONSECUTIVE_POLL_RESTARTS = 5;

bot.api.config.use(async (prev, method, payload, signal) => {
    const result = await prev(method, payload, signal);
    if (method === "getUpdates") {
        lastGetUpdatesAt = Date.now();
        this.#restartAttempts = 0;
    } else if (this.#restartAttempts >= MAX_CONSECUTIVE_POLL_RESTARTS) {
        throw new Error("Exhausted retry attempts");
    }
    return result;
});

Additionally, you should handle the error thrown when the retry attempts are exhausted:

process.on("uncaughtException", (err) => {
    if (err.message === "Exhausted retry attempts") {
        // Let the process manager (systemd/launchd) restart the process
        process.exit(1);
    }
});

Verification

To verify the fix, you can simulate a polling stall by introducing a delay or error in the getUpdates call. Then, check that:

The lastGetUpdatesAt timestamp is updated correctly
The #restartAttempts counter is reset after a successful getUpdates
The process crashes and restarts after 5 consecutive failed restart attempts

Extra Tips

Make sure to test the fix in a production-like environment to ensure it works as expected.
Consider adding logging and monitoring to detect and respond to polling stalls and retry attempts.
Review the related issues (#36259, #41704, #42100) to ensure that the fix doesn't introduce any regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #memory optimization #batch processing #GPU compatibility #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Telegram polling stall recovery fails — watchdog tracks call initiation, not success [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #44601: fix(telegram): polling stall recovery fails when grammY retries mask the stall

Description (problem / solution / changelog)

Summary

Root Cause

Fix (3 changes, 1 file)

1. Track success, not initiation

2. Reset #restartAttempts on success

3. Escalate after 5 consecutive stall restarts

Testing

Changed files

Code Example

Bug

Reproduction

Root Cause

Proposed Fix (PR incoming)

1. Track success, not initiation

2. Escalation after exhausting retries

3. Reset counter on success

Related Issues

Environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

2. Reset `#restartAttempts` on success