openclaw - 💡(How to fix) Fix gateway: per-account auto-restart hard-stops after MAX_RESTART_ATTEMPTS=10 with no recovery path

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

src/gateway/server-channels.ts per-account auto-restart caps reconnect attempts at MAX_RESTART_ATTEMPTS = 10. Once exceeded, the account is left with restartPending: false, reconnectAttempts: 11+, and no further restart is ever scheduled — the channel is permanently dead until the gateway process itself is bounced. This is too brittle for channels (telegram especially) where 10 consecutive failures can come from a transient network outage longer than the backoff sum, not a structural problem.

Error Message

if (attempt > MAX_RESTART_ATTEMPTS) {

  • setRuntime(channelId, id, {
  •  accountId: id,
  •  restartPending: false,
  •  reconnectAttempts: attempt,
  • });
  • log.error?.([${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts);
  • return;
  • // Don't permanently give up — a 10-attempt streak is often a long
  • // network outage, not a structural failure. Wait a long cooldown,
  • // then reset the attempt counter and re-enter the restart cycle.
  • const cooldownMs = MAX_RESTART_COOLDOWN_MS; // e.g. 5 * 60_000
  • setRuntime(channelId, id, {
  •  accountId: id,
  •  restartPending: true,
  •  reconnectAttempts: attempt,
  •  lastError: `paused after ${MAX_RESTART_ATTEMPTS} restart attempts; will retry in ${Math.round(cooldownMs / 1000)}s`,
  • });
  • log.error?.(
  •  `[${id}] paused after ${MAX_RESTART_ATTEMPTS} restart attempts; retry scheduled in ${Math.round(cooldownMs / 1000)}s`,
  • );
  • await sleep(cooldownMs);
  • if (manuallyStopped.has(rKey) || abort.signal.aborted) return;
  • restartAttempts.set(rKey, 0);
  • // fall through to start() below as if attempt was 1 }

Root Cause

src/gateway/server-channels.ts per-account auto-restart caps reconnect attempts at MAX_RESTART_ATTEMPTS = 10. Once exceeded, the account is left with restartPending: false, reconnectAttempts: 11+, and no further restart is ever scheduled — the channel is permanently dead until the gateway process itself is bounced. This is too brittle for channels (telegram especially) where 10 consecutive failures can come from a transient network outage longer than the backoff sum, not a structural problem.

Code Example

.then(async () => {
  if (manuallyStopped.has(rKey)) {
    return;
  }
  const attempt = (restartAttempts.get(rKey) ?? 0) + 1;
  restartAttempts.set(rKey, attempt);
  if (attempt > MAX_RESTART_ATTEMPTS) {
    setRuntime(channelId, id, {
      accountId: id,
      restartPending: false,
      reconnectAttempts: attempt,
    });
    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
    return;                                  // ← permanent dead end
  }
  const delayMs = computeBackoff(CHANNEL_RESTART_POLICY, attempt);
})

---

if (attempt > MAX_RESTART_ATTEMPTS) {
-    setRuntime(channelId, id, {
-      accountId: id,
-      restartPending: false,
-      reconnectAttempts: attempt,
-    });
-    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
-    return;
+    // Don't permanently give up — a 10-attempt streak is often a long
+    // network outage, not a structural failure. Wait a long cooldown,
+    // then reset the attempt counter and re-enter the restart cycle.
+    const cooldownMs = MAX_RESTART_COOLDOWN_MS;  // e.g. 5 * 60_000
+    setRuntime(channelId, id, {
+      accountId: id,
+      restartPending: true,
+      reconnectAttempts: attempt,
+      lastError: `paused after ${MAX_RESTART_ATTEMPTS} restart attempts; will retry in ${Math.round(cooldownMs / 1000)}s`,
+    });
+    log.error?.(
+      `[${id}] paused after ${MAX_RESTART_ATTEMPTS} restart attempts; retry scheduled in ${Math.round(cooldownMs / 1000)}s`,
+    );
+    await sleep(cooldownMs);
+    if (manuallyStopped.has(rKey) || abort.signal.aborted) return;
+    restartAttempts.set(rKey, 0);
+    // fall through to start() below as if attempt was 1
   }
RAW_BUFFERClick to expand / collapse

Summary

src/gateway/server-channels.ts per-account auto-restart caps reconnect attempts at MAX_RESTART_ATTEMPTS = 10. Once exceeded, the account is left with restartPending: false, reconnectAttempts: 11+, and no further restart is ever scheduled — the channel is permanently dead until the gateway process itself is bounced. This is too brittle for channels (telegram especially) where 10 consecutive failures can come from a transient network outage longer than the backoff sum, not a structural problem.

Affected code

src/gateway/server-channels.ts:567-589 (the .then(restart) block of the per-account task chain):

.then(async () => {
  if (manuallyStopped.has(rKey)) {
    return;
  }
  const attempt = (restartAttempts.get(rKey) ?? 0) + 1;
  restartAttempts.set(rKey, attempt);
  if (attempt > MAX_RESTART_ATTEMPTS) {
    setRuntime(channelId, id, {
      accountId: id,
      restartPending: false,
      reconnectAttempts: attempt,
    });
    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
    return;                                  // ← permanent dead end
  }
  const delayMs = computeBackoff(CHANNEL_RESTART_POLICY, attempt);
})

There is no other code path that can later clear restartAttempts.get(rKey) or re-schedule a restart for that account. manuallyStopped.has(rKey) === false here, but the early return skips the backoff scheduling, so the supervisor never tries again.

Reproduction

  1. Run the gateway with an active telegram account using isolated polling.
  2. Take api.telegram.org / the bot's network path unreachable for ~10–20 minutes (pfctl block, or yank an upstream uplink, or simulate via OPENCLAW_TELEGRAM_FORCE_TIMEOUT if available).
  3. Restore connectivity.
  4. Observe [<id>] giving up after 10 restart attempts in gateway.log. The account stays running: false forever even though Telegram is reachable again. Gateway bounce required to recover.

Real-world trigger we've seen: a sustained network blip combined with the existing per-stop 5 s timeout means each "attempt" can chew several seconds with the worker still terminating — a 60–90 s outage can burn the full 10 attempts.

Why this is a silent wedge

  • No status surface clearly says "channel permanently abandoned, manual recovery required" — the only signal is the one log line at the moment of giving up.
  • Health probe (gateway call health) shows running: false with lastError from the last failed restart; consumers can't distinguish "transient" from "abandoned".
  • manuallyStopped.has(rKey) is false, so any external "ask the supervisor to restart" hook is also a no-op (the dead-end path doesn't add it back to the restart cycle).

Proposed fix

Replace the hard dead-end with a long cooldown that lets the channel try again, while still backing off aggressively. Sketch:

   if (attempt > MAX_RESTART_ATTEMPTS) {
-    setRuntime(channelId, id, {
-      accountId: id,
-      restartPending: false,
-      reconnectAttempts: attempt,
-    });
-    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
-    return;
+    // Don't permanently give up — a 10-attempt streak is often a long
+    // network outage, not a structural failure. Wait a long cooldown,
+    // then reset the attempt counter and re-enter the restart cycle.
+    const cooldownMs = MAX_RESTART_COOLDOWN_MS;  // e.g. 5 * 60_000
+    setRuntime(channelId, id, {
+      accountId: id,
+      restartPending: true,
+      reconnectAttempts: attempt,
+      lastError: `paused after ${MAX_RESTART_ATTEMPTS} restart attempts; will retry in ${Math.round(cooldownMs / 1000)}s`,
+    });
+    log.error?.(
+      `[${id}] paused after ${MAX_RESTART_ATTEMPTS} restart attempts; retry scheduled in ${Math.round(cooldownMs / 1000)}s`,
+    );
+    await sleep(cooldownMs);
+    if (manuallyStopped.has(rKey) || abort.signal.aborted) return;
+    restartAttempts.set(rKey, 0);
+    // fall through to start() below as if attempt was 1
   }

Or, the smaller surgical version: make MAX_RESTART_ATTEMPTS configurable per channel via config.channels.<id>.maxRestartAttempts (with a much higher default for polling-based channels like telegram), and document that it should be set to a large number unless the operator really wants a hard cap.

Either avoids the "telegram is dead until I notice and bounce" failure mode.

Tests

If preferred, I can put together a test in src/gateway/server-channels.test.ts (or wherever the per-account lifecycle is exercised) that forces a sustained crash loop and asserts a post-cooldown retry happens. Happy to do this in a PR.

Environment

  • Gateway version: v2026.5.16-beta.4
  • Reproduced on macOS, single-host launchd-managed gateway.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix gateway: per-account auto-restart hard-stops after MAX_RESTART_ATTEMPTS=10 with no recovery path