openclaw - 💡(How to fix) Fix gateway: per-account auto-restart hard-stops after MAX_RESTART_ATTEMPTS=10 with no recovery path

openclaw2026-05-17 10:29:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

src/gateway/server-channels.ts per-account auto-restart caps reconnect attempts at MAX_RESTART_ATTEMPTS = 10. Once exceeded, the account is left with restartPending: false, reconnectAttempts: 11+, and no further restart is ever scheduled — the channel is permanently dead until the gateway process itself is bounced. This is too brittle for channels (telegram especially) where 10 consecutive failures can come from a transient network outage longer than the backoff sum, not a structural problem.

Error Message

if (attempt > MAX_RESTART_ATTEMPTS) {

setRuntime(channelId, id, {
```
 accountId: id,
```
```
 restartPending: false,
```
```
 reconnectAttempts: attempt,
```
});
log.error?.([${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts);
return;

// Don't permanently give up — a 10-attempt streak is often a long
// network outage, not a structural failure. Wait a long cooldown,
// then reset the attempt counter and re-enter the restart cycle.
const cooldownMs = MAX_RESTART_COOLDOWN_MS; // e.g. 5 * 60_000
setRuntime(channelId, id, {
```
 accountId: id,
```
```
 restartPending: true,
```
```
 reconnectAttempts: attempt,
```

 lastError: `paused after ${MAX_RESTART_ATTEMPTS} restart attempts; will retry in ${Math.round(cooldownMs / 1000)}s`,

});
log.error?.(

 `[${id}] paused after ${MAX_RESTART_ATTEMPTS} restart attempts; retry scheduled in ${Math.round(cooldownMs / 1000)}s`,

);
await sleep(cooldownMs);
if (manuallyStopped.has(rKey) || abort.signal.aborted) return;
restartAttempts.set(rKey, 0);
// fall through to start() below as if attempt was 1 }

Root Cause

Code Example

.then(async () => {
  if (manuallyStopped.has(rKey)) {
    return;
  }
  const attempt = (restartAttempts.get(rKey) ?? 0) + 1;
  restartAttempts.set(rKey, attempt);
  if (attempt > MAX_RESTART_ATTEMPTS) {
    setRuntime(channelId, id, {
      accountId: id,
      restartPending: false,
      reconnectAttempts: attempt,
    });
    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
    return;                                  // ← permanent dead end
  }
  const delayMs = computeBackoff(CHANNEL_RESTART_POLICY, attempt);
  …
})

---

if (attempt > MAX_RESTART_ATTEMPTS) {
-    setRuntime(channelId, id, {
-      accountId: id,
-      restartPending: false,
-      reconnectAttempts: attempt,
-    });
-    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
-    return;
+    // Don't permanently give up — a 10-attempt streak is often a long
+    // network outage, not a structural failure. Wait a long cooldown,
+    // then reset the attempt counter and re-enter the restart cycle.
+    const cooldownMs = MAX_RESTART_COOLDOWN_MS;  // e.g. 5 * 60_000
+    setRuntime(channelId, id, {
+      accountId: id,
+      restartPending: true,
+      reconnectAttempts: attempt,
+      lastError: `paused after ${MAX_RESTART_ATTEMPTS} restart attempts; will retry in ${Math.round(cooldownMs / 1000)}s`,
+    });
+    log.error?.(
+      `[${id}] paused after ${MAX_RESTART_ATTEMPTS} restart attempts; retry scheduled in ${Math.round(cooldownMs / 1000)}s`,
+    );
+    await sleep(cooldownMs);
+    if (manuallyStopped.has(rKey) || abort.signal.aborted) return;
+    restartAttempts.set(rKey, 0);
+    // fall through to start() below as if attempt was 1
   }

RAW_BUFFERClick to expand / collapse

Summary

Affected code

src/gateway/server-channels.ts:567-589 (the .then(restart) block of the per-account task chain):

.then(async () => {
  if (manuallyStopped.has(rKey)) {
    return;
  }
  const attempt = (restartAttempts.get(rKey) ?? 0) + 1;
  restartAttempts.set(rKey, attempt);
  if (attempt > MAX_RESTART_ATTEMPTS) {
    setRuntime(channelId, id, {
      accountId: id,
      restartPending: false,
      reconnectAttempts: attempt,
    });
    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
    return;                                  // ← permanent dead end
  }
  const delayMs = computeBackoff(CHANNEL_RESTART_POLICY, attempt);
  …
})

There is no other code path that can later clear restartAttempts.get(rKey) or re-schedule a restart for that account. manuallyStopped.has(rKey) === false here, but the early return skips the backoff scheduling, so the supervisor never tries again.

Reproduction

Run the gateway with an active telegram account using isolated polling.
Take api.telegram.org / the bot's network path unreachable for ~10–20 minutes (pfctl block, or yank an upstream uplink, or simulate via OPENCLAW_TELEGRAM_FORCE_TIMEOUT if available).
Restore connectivity.
Observe [<id>] giving up after 10 restart attempts in gateway.log. The account stays running: false forever even though Telegram is reachable again. Gateway bounce required to recover.

Real-world trigger we've seen: a sustained network blip combined with the existing per-stop 5 s timeout means each "attempt" can chew several seconds with the worker still terminating — a 60–90 s outage can burn the full 10 attempts.

Why this is a silent wedge

No status surface clearly says "channel permanently abandoned, manual recovery required" — the only signal is the one log line at the moment of giving up.
Health probe (gateway call health) shows running: false with lastError from the last failed restart; consumers can't distinguish "transient" from "abandoned".
manuallyStopped.has(rKey) is false, so any external "ask the supervisor to restart" hook is also a no-op (the dead-end path doesn't add it back to the restart cycle).

Proposed fix

Replace the hard dead-end with a long cooldown that lets the channel try again, while still backing off aggressively. Sketch:

   if (attempt > MAX_RESTART_ATTEMPTS) {
-    setRuntime(channelId, id, {
-      accountId: id,
-      restartPending: false,
-      reconnectAttempts: attempt,
-    });
-    log.error?.(`[${id}] giving up after ${MAX_RESTART_ATTEMPTS} restart attempts`);
-    return;
+    // Don't permanently give up — a 10-attempt streak is often a long
+    // network outage, not a structural failure. Wait a long cooldown,
+    // then reset the attempt counter and re-enter the restart cycle.
+    const cooldownMs = MAX_RESTART_COOLDOWN_MS;  // e.g. 5 * 60_000
+    setRuntime(channelId, id, {
+      accountId: id,
+      restartPending: true,
+      reconnectAttempts: attempt,
+      lastError: `paused after ${MAX_RESTART_ATTEMPTS} restart attempts; will retry in ${Math.round(cooldownMs / 1000)}s`,
+    });
+    log.error?.(
+      `[${id}] paused after ${MAX_RESTART_ATTEMPTS} restart attempts; retry scheduled in ${Math.round(cooldownMs / 1000)}s`,
+    );
+    await sleep(cooldownMs);
+    if (manuallyStopped.has(rKey) || abort.signal.aborted) return;
+    restartAttempts.set(rKey, 0);
+    // fall through to start() below as if attempt was 1
   }

Or, the smaller surgical version: make MAX_RESTART_ATTEMPTS configurable per channel via config.channels.<id>.maxRestartAttempts (with a much higher default for polling-based channels like telegram), and document that it should be set to a large number unless the operator really wants a hard cap.

Either avoids the "telegram is dead until I notice and bounce" failure mode.

Tests

If preferred, I can put together a test in src/gateway/server-channels.test.ts (or wherever the per-account lifecycle is exercised) that forces a sustained crash loop and asserts a post-cooldown retry happens. Happy to do this in a PR.

Environment

Gateway version: v2026.5.16-beta.4
Reproduced on macOS, single-host launchd-managed gateway.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix gateway: per-account auto-restart hard-stops after MAX_RESTART_ATTEMPTS=10 with no recovery path

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Affected code

Reproduction

Why this is a silent wedge

Proposed fix

Tests

Environment

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix gateway: per-account auto-restart hard-stops after MAX_RESTART_ATTEMPTS=10 with no recovery path

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Affected code

Reproduction

Why this is a silent wedge

Proposed fix

Tests

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING