openclaw - ✅(Solved) Fix Discord WebSocket close (1005) crashes entire gateway — maxReconnectAttempts=0 [2 pull requests, 1 comments, 2 participants]

snowycrabai · 2026-03-26T22:47:23Z

[openclaw] PR 55991: fix discord : stop queued reconnect exhaustion crash - Repository: openclaw/openclaw - Author: vincentkoc - State: closed | merged: True -… # PR #55991: fix(discord): stop queued reconnect exhaustion crash - Repository: openclaw/openclaw - Author: vincentkoc - State: closed | merged: True - Link: https://github.com/openclaw/openclaw/pull/55991 ## Description (problem / solution / changelog) ## Summary - Problem: Discord stale-socket restarts could still crash the whole gateway when `reconnect-exhausted` was buffered before lifecycle teardown flipped `lifecycleStopping`. - Why it matters: a Discord WebSocket close with code 1005 could kill the full gateway process instead of letting the channel health monitor restart Discord cleanly. - What changed: `drainPendingGatewayErrors()` now treats queued `reconnect-exhausted` the same way as other expected shutdown events, and the lifecycle regression test now locks in graceful completion instead of a thrown crash. - What did NOT change (scope boundary): this does not change Carbon reconnect policy, reconnect backoff, or broader Discord supervision behavior outside the queued-before-teardown race. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #55403 - Related #55421 - Related #55443 - [x] This PR fixes a bug or regression ## Root Cause / Regression History (if applicable) - Root cause: `provider.lifecycle.ts` only suppressed `reconnect-exhausted` when `lifecycleStopping` was already true, but the supervisor can drain a buffered `reconnect-exhausted` event before teardown flips that flag. - Missing detection / guardrail: the regression test encoded the crash path as expected behavior for the queued-before-shutdown window. - Prior context (`git blame`, prior PR, issue, or refactor if known): earlier fixes in #55324 and #55373 covered adjacent shutdown/teardown races, but not the buffered pre-teardown drain path. - Why this regressed now: Discord gateway lifecycle handling was split/refined across recent cleanup work, leaving one buffered-event branch still treating the intentional shutdown signal as fatal. - If unknown, what was ruled out: N/A ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [ ] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: `extensions/discord/src/monitor/provider.lifecycle.test.ts` - Scenario the test should lock in: a queued `reconnect-exhausted` event plus abort should resolve the lifecycle cleanly instead of throwing. - Why this is the smallest reliable guardrail: it hits the exact `drainPendingGatewayErrors()` branch that was still crashing. - Existing test that already covers this (if any): the same test existed but asserted the buggy rejection behavior. - If no new test is added, why not: N/A ## User-visible / Behavior Changes Discord stale-socket restarts no longer crash the full gateway when the reconnect-exhausted event was buffered just before teardown. ## Security Impact (required) - New permissions/capabilities? (`Yes/No`) No - Secrets/tokens handling changed? (`Yes/No`) No - New/changed network calls? (`Yes/No`) No - Command/tool execution surface changed? (`Yes/No`) No - Data access scope changed? (`Yes/No`) No - If any `Yes`, explain risk + mitigation: ## Repro + Verification ### Environment - OS: macOS - Runtime/container: Node 25.5.0 + pnpm 10.32.1 - Model/provider: N/A - Integration/channel (if any): Discord - Relevant config (redacted): N/A ### Steps 1. Start the Discord gateway lifecycle. 2. Queue a `reconnect-exhausted` gateway event before teardown flips `lifecycleStopping`. 3. Abort the lifecycle. ### Expected - Lifecycle resolves cleanly and the health monitor owns recovery. ### Actual - Before this patch, the buffered event threw `Max reconnect attempts (0) reached after code 1005` and crashed the process. ## Evidence Attach at least one: - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) ## Human Verification (required) - Verified scenarios: ran `pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts extensions/discord/src/monitor/gateway-supervisor.test.ts` after the patch. - Edge cases checked: kept the existing supervisor lane green; preserved current logging behavior for the queued event while removing the crash. - What you did **not** verify: a live Discord bot against production Discord WS. ## Review Conversations - [x] I replied to or resolved every bot review conversation I addressed in this PR. - [x] I left unresolved only the conversations that

openclaw2026-03-26 22:47:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#55403•Fetched 2026-04-08 01:39:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

cross-referenced ×3assigned ×1closed ×1commented ×1

Error Message

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (file:///…/dist/provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (file:///…/dist/provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (file:///…/dist/provider-CAlWEl41.js:3307:9) at WebSocket.emit (node:events:508:20) at WebSocket.emitClose (…/ws/lib/websocket.js:273:10) at TLSSocket.socketOnClose (…/ws/lib/websocket.js:1346:15)

Code Example

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (file:///…/dist/provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (file:///…/dist/provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (file:///…/dist/provider-CAlWEl41.js:3307:9)
    at WebSocket.emit (node:events:508:20)
    at WebSocket.emitClose (…/ws/lib/websocket.js:273:10)
    at TLSSocket.socketOnClose (…/ws/lib/websocket.js:1346:15)

RAW_BUFFERClick to expand / collapse

Bug Summary

Discord periodically closes WebSocket connections with code 1005 (normal behavior — load balancing, session rotation). When this happens, SafeGatewayPlugin.handleReconnectionAttempt sees maxReconnectAttempts = 0 and throws an error instead of reconnecting. This bubbles up as an uncaught exception and kills the entire gateway process (all channels — Discord, Telegram, webchat).

Reproduction

Runs consistently — Discord drops the WS every few hours. Gateway crashes every time.

Crash Log

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (file:///…/dist/provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (file:///…/dist/provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (file:///…/dist/provider-CAlWEl41.js:3307:9)
    at WebSocket.emit (node:events:508:20)
    at WebSocket.emitClose (…/ws/lib/websocket.js:273:10)
    at TLSSocket.socketOnClose (…/ws/lib/websocket.js:1346:15)

Timeline (48h window on a single machine)

Time	Event
Mar 25, 10:54 AM	health-monitor detects `stale-socket` → restart → 💥 crash
Mar 25, 11:10 PM	LaunchAgent restarts gateway
Mar 26, 12:10 AM	`stale-socket` → restart → 💥 crash
Mar 26, 10:23 AM	LaunchAgent restarts
Mar 26, 11:01 AM	`WebSocket closed code 1005` logged
Mar 26, 11:58 AM	`stale-socket` → 💥 crash
Mar 26, 12:46 PM	💥 crash (same pattern)
Mar 26, 3:46 PM	LaunchAgent restarts

The health-monitor correctly detects stale-socket but the restart path triggers the same fatal error. Each crash takes out ALL channels for hours until LaunchAgent retries.

Also related: the [discord] gateway: WebSocket connection closed with code 1005 log line at 11:01 AM suggests Discord is the initiator.

Expected Behavior

Discord WS close code 1005 should trigger a reconnect with exponential backoff (resume if possible)
maxReconnectAttempts should not be 0 by default
Even if reconnection ultimately fails, it should not crash the entire gateway — isolate the failure to the Discord provider

Environment

OpenClaw 2026.3.24 (cff6dc9)
macOS arm64, Node v25.5.0
Two Discord bot accounts (default + kiwi), both affected
LaunchAgent managed

Workaround

None currently. The gateway crashes and stays down until LaunchAgent restarts it (which can take hours).

extent analysis

Fix Plan

To fix the issue, we need to modify the SafeGatewayPlugin.handleReconnectionAttempt function to handle the maxReconnectAttempts = 0 case and implement exponential backoff for reconnecting.

Step-by-Step Solution

Set a default value for maxReconnectAttempts (e.g., 5) to prevent it from being 0.
Implement exponential backoff for reconnecting using a library like backoff.
Catch and handle the error in SafeGatewayPlugin.handleReconnectionAttempt to prevent it from bubbling up and crashing the gateway process.

Example Code

const backoff = require('backoff');

// Set default maxReconnectAttempts
const maxReconnectAttempts = 5;

// Implement exponential backoff
const backoffStrategy = backoff.exponential({
  initialDelay: 1000, // 1 second
  maxDelay: 30000, // 30 seconds
});

SafeGatewayPlugin.handleReconnectionAttempt = async function() {
  // ...

  if (maxReconnectAttempts === 0) {
    // Handle maxReconnectAttempts = 0 case
    console.error('Max reconnect attempts reached. Giving up.');
    return;
  }

  try {
    // Reconnect with exponential backoff
    await backoffStrategy.retry(async () => {
      // Reconnect logic here
    });
  } catch (error) {
    // Catch and handle error to prevent crashing the gateway process
    console.error('Error reconnecting:', error);
  }
};

Verification

To verify the fix, monitor the gateway logs for reconnect attempts and ensure that the process does not crash when Discord closes the WebSocket connection with code 1005.

Extra Tips

Consider adding a circuit breaker to prevent cascading failures in case of repeated reconnect failures.
Review the health-monitor and LaunchAgent configurations to ensure they are not contributing to the issue.
Test the fix thoroughly to ensure it works as expected in different scenarios.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#dependency conflict #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Discord WebSocket close (1005) crashes entire gateway — maxReconnectAttempts=0 [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Workaround

PR fix notes

PR #55991: fix(discord): stop queued reconnect exhaustion crash

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause / Regression History (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Test Results

Changed files

Code Example

Bug Summary

Reproduction

Crash Log

Timeline (48h window on a single machine)

Expected Behavior

Environment

Workaround

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING