openclaw - ✅(Solved) Fix [Bug]: Discord WebSocket stale-socket causes full gateway crash (no reconnect, uncaught exception loop) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56274Fetched 2026-04-08 01:42:52
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
labeled ×2commented ×1cross-referenced ×1mentioned ×1

The Discord WebSocket connection periodically drops (close code 1005 — no close frame). When the health-monitor detects the stale socket (~5 min check interval, ~30 min threshold), it attempts to restart the Discord channel. However, SafeGatewayPlugin.handleReconnectionAttempt throws an uncaught exception because Max reconnect attempts (0) is reached, crashing the entire gateway process.

systemd restarts the gateway, Discord reconnects — but the cycle repeats every ~34 minutes, causing 12-21 full gateway crashes per day.

Error Message

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9) ...

Root Cause

  1. Discord gateway periodically closes WebSocket connections (code 1005) — this is normal Discord behavior (maintenance, idle timeout, routing changes). Clients are expected to reconnect.

  2. The OpenClaw Discord plugin has maxReconnectAttempts = 0 (hardcoded or default), meaning it never attempts to reconnect.

  3. When the health-monitor triggers a channel restart, the reconnect logic throws instead of gracefully reconnecting.

  4. The exception is uncaught → full process crash.

Fix Action

Fix / Workaround

Workaround: Disabling the Discord channel in openclaw.json eliminates the crash loop entirely, since the root cause is the Discord WebSocket lifecycle. All other channels (Telegram, WS) are stable on their own.

PR fix notes

PR #56568: fix(discord): harden reconnect shutdown and idle health

Description (problem / solution / changelog)

Summary

  • prevent stale-socket shutdown from racing an extra disconnect before waitForDiscordGatewayStop() owns teardown
  • mark lifecycle shutdown before abort-driven disconnects so expected reconnect-exhausted events stay informational
  • refresh lastEventAt from gateway metrics only while the socket is still connected
  • add regressions for queued shutdown errors, active abort disconnects, and idle metrics refresh

Why

This consolidates the Discord crash-resilience cluster at the root-cause layer. The crash path came from the reconnect controller calling disconnect() during abort while the lifecycle wait loop had not settled yet. That let a synchronous reconnect-exhausted event surface on the active lifecycle path. Healthy idle periods also looked stale because lastEventAt only moved on debug events.

Validation

  • ./node_modules/.bin/vitest run --config vitest.config.ts --pool=threads --maxWorkers=1 extensions/discord/src/monitor/provider.lifecycle.test.ts
  • ./node_modules/.bin/vitest run --config vitest.config.ts --pool=threads --maxWorkers=1 extensions/discord/src/monitor.gateway.test.ts
  • pre-commit pnpm check

Fixes #56339 Fixes #56399 Fixes #56274

Supersedes #56486 Supersedes #56493

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.reconnect.ts (modified, +15/-0)
  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +161/-2)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +6/-0)

Code Example

00:4801:2201:5702:3103:0503:4004:1404:4905:4806:2206:5707:31

---

[discord] gateway: WebSocket connection closed with code 1005

---

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)

---

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)
    ...

---

openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job, restart counter is at 16.

---

{
  "discord": {
    "enabled": true,
    "groupPolicy": "allowlist",
    "streaming": "partial",
    "dmPolicy": "allowlist",
    "guilds": {
      "<guild-id>": {
        "requireMention": true,
        "channels": {
          "<channel-id>": {
            "allow": true,
            "requireMention": false
          }
        }
      }
    }
  }
}

---

### Crash sequence (from journalctl + gateway log file)

**1. Discord WS drops:**

[discord] gateway: Attempting resume with backoff: 1000ms
[discord] gateway: WebSocket connection closed with code 1005


**2. ~30 min later, health-monitor detects stale socket:**

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)


**3. Reconnect fails → uncaught exception:**

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)
    at WebSocket.emit (node:events:519:28)
    at WebSocket.emitClose (ws/lib/websocket.js:273:10)
    at TLSSocket.socketOnClose (ws/lib/websocket.js:1346:15)


**4. systemd restart cycle:**

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job, restart counter is at 16.


### Crash frequency (3 days observed)

| Date       | Crashes | Source            |
|------------|---------|-------------------|
| 2026-03-26 | 17      | gateway log file  |
| 2026-03-27 | 21      | gateway log file  |
| 2026-03-28 | 12      | (as of 07:30 UTC, ongoing) |

### Crash times on 2026-03-28 (~34 min interval)


00:4801:2201:5702:3103:0503:4004:1404:4905:4806:2206:5707:31
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

Description

The Discord WebSocket connection periodically drops (close code 1005 — no close frame). When the health-monitor detects the stale socket (~5 min check interval, ~30 min threshold), it attempts to restart the Discord channel. However, SafeGatewayPlugin.handleReconnectionAttempt throws an uncaught exception because Max reconnect attempts (0) is reached, crashing the entire gateway process.

systemd restarts the gateway, Discord reconnects — but the cycle repeats every ~34 minutes, causing 12-21 full gateway crashes per day.

Impact

  • Every crash kills all channels (Telegram, Discord, WebSocket) — not just Discord
  • All active agent sessions are interrupted
  • Cron jobs may miss their windows
  • The gateway has been crash-looping for at least 3 days straight

Reproduction

Consistent and automatic — no user action needed. Happens during low-activity hours (overnight) when no Discord messages keep the connection alive.

Crash frequency (observed):

DateCrashes
2026-03-2617
2026-03-2721
2026-03-2812 (as of 07:30 UTC, still ongoing)

Crash cycle (2026-03-28 example):

00:48 → 01:22 → 01:57 → 02:31 → 03:05 → 03:40 → 04:14 → 04:49 → 05:48 → 06:22 → 06:57 → 07:31

Interval: ~34 minutes (consistent).

Log sequence

1. Discord WS drops silently

[discord] gateway: WebSocket connection closed with code 1005

2. Health-monitor detects stale socket

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)

3. Reconnect fails → uncaught exception → process exit

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)
    ...

4. systemd restarts → cycle repeats

openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job, restart counter is at 16.

Root cause analysis

  1. Discord gateway periodically closes WebSocket connections (code 1005) — this is normal Discord behavior (maintenance, idle timeout, routing changes). Clients are expected to reconnect.

  2. The OpenClaw Discord plugin has maxReconnectAttempts = 0 (hardcoded or default), meaning it never attempts to reconnect.

  3. When the health-monitor triggers a channel restart, the reconnect logic throws instead of gracefully reconnecting.

  4. The exception is uncaught → full process crash.

Expected behavior

  • Discord WS drops → plugin reconnects automatically (with exponential backoff)
  • If reconnect fails after N attempts → disable the Discord channel only, not crash the entire gateway
  • Health-monitor restart should not be able to cause an uncaught exception

Environment

  • OpenClaw: 2026.3.24 (cff6dc9)
  • Node.js: v22.22.1
  • OS: Linux 6.17.0-19-generic x86_64
  • Discord config: single guild, single bot, requireMention: false on one channel
  • Gateway: systemd user service, Restart=on-failure

Discord config (redacted)

{
  "discord": {
    "enabled": true,
    "groupPolicy": "allowlist",
    "streaming": "partial",
    "dmPolicy": "allowlist",
    "guilds": {
      "<guild-id>": {
        "requireMention": true,
        "channels": {
          "<channel-id>": {
            "allow": true,
            "requireMention": false
          }
        }
      }
    }
  }
}

No reconnect-related configuration options found in the Discord channel schema.

Suggested fix

  1. Set a reasonable default for maxReconnectAttempts (e.g., 5-10) with exponential backoff
  2. Catch the reconnect failure gracefully — mark channel as degraded, don't crash the process
  3. Optionally: expose discord.reconnectAttempts / discord.reconnectBackoff as user-configurable options

Steps to reproduce

  1. Configure and enable a Discord channel
  2. Start the gateway
  3. Wait — no user action needed
  4. The Discord WebSocket silently drops (code 1005) after ~30 min of inactivity
  5. health-monitor detects stale-socket and attempts channel restart
  6. SafeGatewayPlugin.handleReconnectionAttempt throws (maxReconnectAttempts=0)
  7. Uncaught exception → full gateway process crash

Expected behavior

  • Discord WS drop → automatic reconnect with exponential backoff
  • If reconnect fails after N attempts → gracefully disable the Discord channel only, gateway continues running
  • health-monitor restart should never cause an uncaught exception

Actual behavior

  • Discord WS drops silently (code 1005, no close frame)
  • health-monitor detects stale-socket after ~30 min
  • SafeGatewayPlugin.handleReconnectionAttempt throws: "Max reconnect attempts (0) reached after code 1005"
  • Uncaught exception → entire gateway process crashes (all channels: Telegram, Discord, WS)
  • systemd restarts gateway → cycle repeats every ~34 minutes
  • 12-21 full gateway crashes per day observed over 3+ days

OpenClaw version

OpenClaw 2026.3.24 (cff6dc9)

Operating system

openSUSE Leap 16

Install method

No response

Model

google/gemini-2.5-flash

Provider / routing chain

Google API

Additional provider/model setup details

Single discord bot. One guild, one channel (requireMention: false), DM policy: allowlist. No Discord-specific reconnect or backoff configuration — none available in schema. Gateway runs as systemd user service with Restart=on-failure. The crash affects the entire gateway process, not just the Discord-bound agent — all 7 agents across Telegram + Discord go down on each crash.

Logs, screenshots, and evidence

### Crash sequence (from journalctl + gateway log file)

**1. Discord WS drops:**

[discord] gateway: Attempting resume with backoff: 1000ms
[discord] gateway: WebSocket connection closed with code 1005


**2. ~30 min later, health-monitor detects stale socket:**

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)


**3. Reconnect fails → uncaught exception:**

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)
    at WebSocket.emit (node:events:519:28)
    at WebSocket.emitClose (ws/lib/websocket.js:273:10)
    at TLSSocket.socketOnClose (ws/lib/websocket.js:1346:15)


**4. systemd restart cycle:**

openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
openclaw-gateway.service: Failed with result 'exit-code'.
openclaw-gateway.service: Scheduled restart job, restart counter is at 16.


### Crash frequency (3 days observed)

| Date       | Crashes | Source            |
|------------|---------|-------------------|
| 2026-03-26 | 17      | gateway log file  |
| 2026-03-27 | 21      | gateway log file  |
| 2026-03-28 | 12      | (as of 07:30 UTC, ongoing) |

### Crash times on 2026-03-28 (~34 min interval)


00:48 → 01:22 → 01:57 → 02:31 → 03:05 → 03:40 → 04:14 → 04:49 → 05:48 → 06:22 → 06:57 → 07:31

Impact and severity

Affected: All channels — the entire gateway process crashes, taking down Telegram (7 bots), Discord, and WebSocket connections. Every agent is affected, not just the Discord-bound one.

Severity: Blocks workflow. Gateway is completely unavailable for ~5-10 seconds per crash cycle (systemd restart time). During overnight hours, this repeats every ~34 minutes indefinitely.

Frequency: Always. 100% reproducible with zero user interaction. 12-21 crashes per day observed over 3 consecutive days. Occurs whenever the Discord WebSocket connection drops (code 1005), which happens regularly during low-activity periods.

Consequence:

  • Missed inbound messages during restart windows (Telegram + Discord)
  • Cron jobs may miss their execution windows or run against a freshly restarted gateway with no warm state
  • systemd restart counter reached 16+ in a single day
  • No data loss observed, but session continuity is disrupted

Additional information

Discord was enabled recently; no prior version tested with Discord active to compare against.

Workaround: Disabling the Discord channel in openclaw.json eliminates the crash loop entirely, since the root cause is the Discord WebSocket lifecycle. All other channels (Telegram, WS) are stable on their own.

The maxReconnectAttempts=0 value appears to be a hardcoded default in SafeGatewayPlugin, not user-configurable. No Discord-specific reconnect options exist in the openclaw.json schema.

extent analysis

Fix Plan

To resolve the issue, we need to modify the SafeGatewayPlugin to handle reconnect attempts with exponential backoff. Here are the steps:

  • Set a reasonable default for maxReconnectAttempts (e.g., 5-10)
  • Implement exponential backoff for reconnect attempts
  • Catch the reconnect failure gracefully and mark the channel as degraded instead of crashing the process

Example code changes:

// Set default maxReconnectAttempts
const maxReconnectAttempts = 5;

// Implement exponential backoff
const backoffDelay = (attempt) => {
  return Math.min(30000, 1000 * Math.pow(2, attempt));
};

// Handle reconnect attempts with exponential backoff
async function handleReconnectionAttempt(attempt) {
  if (attempt >= maxReconnectAttempts) {
    // Mark channel as degraded and log error
    console.error(`Max reconnect attempts reached for Discord channel`);
    return;
  }

  try {
    // Attempt to reconnect
    await reconnectDiscordChannel();
  } catch (error) {
    // Wait for backoff delay before next attempt
    await new Promise((resolve) => setTimeout(resolve, backoffDelay(attempt)));
    // Recursively call handleReconnectionAttempt with next attempt
    handleReconnectionAttempt(attempt + 1);
  }
}

Verification

To verify the fix, you can:

  • Enable the Discord channel and monitor the gateway logs for reconnect attempts
  • Simulate a WebSocket connection drop (code 1005) and verify that the gateway attempts to reconnect with exponential backoff
  • Check that the channel is marked as degraded after max reconnect attempts are reached, and the gateway process does not crash

Extra Tips

  • Consider exposing discord.reconnectAttempts and discord.reconnectBackoff as user-configurable options in the openclaw.json schema
  • Monitor the gateway logs and adjust the maxReconnectAttempts value as needed to balance reconnect attempts with channel availability
  • Test the fix thoroughly to ensure that it resolves the issue and does not introduce any new problems.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • Discord WS drop → automatic reconnect with exponential backoff
  • If reconnect fails after N attempts → gracefully disable the Discord channel only, gateway continues running
  • health-monitor restart should never cause an uncaught exception

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING