openclaw - ✅(Solved) Fix Discord WebSocket close (1005) crashes entire gateway — maxReconnectAttempts=0 [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#55403Fetched 2026-04-08 01:39:58
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
1
Assignees
Timeline (top)
cross-referenced ×3assigned ×1closed ×1commented ×1

Error Message

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005 at SafeGatewayPlugin.handleReconnectionAttempt (file:///…/dist/provider-CAlWEl41.js:3318:47) at SafeGatewayPlugin.handleClose (file:///…/dist/provider-CAlWEl41.js:3364:8) at WebSocket.<anonymous> (file:///…/dist/provider-CAlWEl41.js:3307:9) at WebSocket.emit (node:events:508:20) at WebSocket.emitClose (…/ws/lib/websocket.js:273:10) at TLSSocket.socketOnClose (…/ws/lib/websocket.js:1346:15)

Fix Action

Workaround

None currently. The gateway crashes and stays down until LaunchAgent restarts it (which can take hours).

PR fix notes

PR #55991: fix(discord): stop queued reconnect exhaustion crash

Description (problem / solution / changelog)

Summary

  • Problem: Discord stale-socket restarts could still crash the whole gateway when reconnect-exhausted was buffered before lifecycle teardown flipped lifecycleStopping.
  • Why it matters: a Discord WebSocket close with code 1005 could kill the full gateway process instead of letting the channel health monitor restart Discord cleanly.
  • What changed: drainPendingGatewayErrors() now treats queued reconnect-exhausted the same way as other expected shutdown events, and the lifecycle regression test now locks in graceful completion instead of a thrown crash.
  • What did NOT change (scope boundary): this does not change Carbon reconnect policy, reconnect backoff, or broader Discord supervision behavior outside the queued-before-teardown race.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #55403
  • Related #55421
  • Related #55443
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: provider.lifecycle.ts only suppressed reconnect-exhausted when lifecycleStopping was already true, but the supervisor can drain a buffered reconnect-exhausted event before teardown flips that flag.
  • Missing detection / guardrail: the regression test encoded the crash path as expected behavior for the queued-before-shutdown window.
  • Prior context (git blame, prior PR, issue, or refactor if known): earlier fixes in #55324 and #55373 covered adjacent shutdown/teardown races, but not the buffered pre-teardown drain path.
  • Why this regressed now: Discord gateway lifecycle handling was split/refined across recent cleanup work, leaving one buffered-event branch still treating the intentional shutdown signal as fatal.
  • If unknown, what was ruled out: N/A

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/discord/src/monitor/provider.lifecycle.test.ts
  • Scenario the test should lock in: a queued reconnect-exhausted event plus abort should resolve the lifecycle cleanly instead of throwing.
  • Why this is the smallest reliable guardrail: it hits the exact drainPendingGatewayErrors() branch that was still crashing.
  • Existing test that already covers this (if any): the same test existed but asserted the buggy rejection behavior.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Discord stale-socket restarts no longer crash the full gateway when the reconnect-exhausted event was buffered just before teardown.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 25.5.0 + pnpm 10.32.1
  • Model/provider: N/A
  • Integration/channel (if any): Discord
  • Relevant config (redacted): N/A

Steps

  1. Start the Discord gateway lifecycle.
  2. Queue a reconnect-exhausted gateway event before teardown flips lifecycleStopping.
  3. Abort the lifecycle.

Expected

  • Lifecycle resolves cleanly and the health monitor owns recovery.

Actual

  • Before this patch, the buffered event threw Max reconnect attempts (0) reached after code 1005 and crashed the process.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: ran pnpm test -- extensions/discord/src/monitor/provider.lifecycle.test.ts extensions/discord/src/monitor/gateway-supervisor.test.ts after the patch.
  • Edge cases checked: kept the existing supervisor lane green; preserved current logging behavior for the queued event while removing the crash.
  • What you did not verify: a live Discord bot against production Discord WS.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this commit.
  • Files/config to restore: extensions/discord/src/monitor/provider.lifecycle.ts
  • Known bad symptoms reviewers should watch for: genuine reconnect exhaustion being swallowed unexpectedly instead of surfacing through the lifecycle.

Risks and Mitigations

  • Risk: treating every queued reconnect-exhausted as a graceful stop could hide a real fatal queued reconnect exhaustion in this narrow pre-teardown path.
  • Mitigation: the change is limited to the buffered drain branch; active lifecycle and supervisor handling remain unchanged, and the health monitor already owns reconnect recovery for this path.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +4/-6)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +4/-7)

PR #55000: fix(discord): prevent gateway crash on abort by fixing supervisor teardown ordering

Description (problem / solution / changelog)

Summary

A transient Discord WebSocket disconnection (e.g. close code 1005) or an intentional health-monitor restart can trigger an uncaught Max reconnect attempts (0) exception that crashes the entire gateway process. This kills all channels (Feishu, WhatsApp, Telegram, etc.), not just Discord.

This PR fixes the root cause of these gateway crashes by correcting the supervisor phase transition ordering during an abort.

Fixes #54931 Fixes #54894

Root Cause

The gateway crash happens due to an error routing mismatch during teardown:

  1. onAbort() sets maxAttempts to 0 and calls gateway.disconnect().
  2. disconnect() synchronously triggers @buape/carbon's handleClosehandleReconnectionAttempt, which emits a Max reconnect attempts (0) error on the gateway emitter.
  3. The gateway-supervisor is still in the active phase at this point, so it routes the error to the lifecycle handler instead of suppressing it.
  4. The error surfaces as an uncaught exception, causing the entire gateway process to exit.

The supervisor already has the correct teardown suppression logic (it logs and swallows late errors during the teardown phase). The bug is simply that onAbort() never transitions the supervisor to the teardown phase before disconnecting.

Changes

1. Fix teardown ordering in onAbort()

Call params.gatewaySupervisor.detachLifecycle() before gateway.disconnect(). This ensures the supervisor enters the teardown phase and correctly suppresses the synchronous error emitted during disconnect.

2. Add lifecycleStopping safety net

In the catch block, when lifecycleStopping is already true, we no longer re-throw errors. This acts as a defense-in-depth guard for any edge case where an error might still escape during an intentional shutdown.

Test Results

  • The extension-fast (extension-fast-discord, discord) CI check passes.
  • Verified locally that triggering an abort correctly suppresses the disconnect error and allows the gateway to restart gracefully without crashing the main process.

Changed files

  • extensions/discord/src/monitor/provider.lifecycle.test.ts (modified, +100/-4)
  • extensions/discord/src/monitor/provider.lifecycle.ts (modified, +25/-0)

Code Example

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (file:///…/dist/provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (file:///…/dist/provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (file:///…/dist/provider-CAlWEl41.js:3307:9)
    at WebSocket.emit (node:events:508:20)
    at WebSocket.emitClose (/ws/lib/websocket.js:273:10)
    at TLSSocket.socketOnClose (/ws/lib/websocket.js:1346:15)
RAW_BUFFERClick to expand / collapse

Bug Summary

Discord periodically closes WebSocket connections with code 1005 (normal behavior — load balancing, session rotation). When this happens, SafeGatewayPlugin.handleReconnectionAttempt sees maxReconnectAttempts = 0 and throws an error instead of reconnecting. This bubbles up as an uncaught exception and kills the entire gateway process (all channels — Discord, Telegram, webchat).

Reproduction

Runs consistently — Discord drops the WS every few hours. Gateway crashes every time.

Crash Log

[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (file:///…/dist/provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (file:///…/dist/provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (file:///…/dist/provider-CAlWEl41.js:3307:9)
    at WebSocket.emit (node:events:508:20)
    at WebSocket.emitClose (…/ws/lib/websocket.js:273:10)
    at TLSSocket.socketOnClose (…/ws/lib/websocket.js:1346:15)

Timeline (48h window on a single machine)

TimeEvent
Mar 25, 10:54 AMhealth-monitor detects stale-socket → restart → 💥 crash
Mar 25, 11:10 PMLaunchAgent restarts gateway
Mar 26, 12:10 AMstale-socket → restart → 💥 crash
Mar 26, 10:23 AMLaunchAgent restarts
Mar 26, 11:01 AMWebSocket closed code 1005 logged
Mar 26, 11:58 AMstale-socket → 💥 crash
Mar 26, 12:46 PM💥 crash (same pattern)
Mar 26, 3:46 PMLaunchAgent restarts

The health-monitor correctly detects stale-socket but the restart path triggers the same fatal error. Each crash takes out ALL channels for hours until LaunchAgent retries.

Also related: the [discord] gateway: WebSocket connection closed with code 1005 log line at 11:01 AM suggests Discord is the initiator.

Expected Behavior

  • Discord WS close code 1005 should trigger a reconnect with exponential backoff (resume if possible)
  • maxReconnectAttempts should not be 0 by default
  • Even if reconnection ultimately fails, it should not crash the entire gateway — isolate the failure to the Discord provider

Environment

  • OpenClaw 2026.3.24 (cff6dc9)
  • macOS arm64, Node v25.5.0
  • Two Discord bot accounts (default + kiwi), both affected
  • LaunchAgent managed

Workaround

None currently. The gateway crashes and stays down until LaunchAgent restarts it (which can take hours).

extent analysis

Fix Plan

To fix the issue, we need to modify the SafeGatewayPlugin.handleReconnectionAttempt function to handle the maxReconnectAttempts = 0 case and implement exponential backoff for reconnecting.

Step-by-Step Solution

  • Set a default value for maxReconnectAttempts (e.g., 5) to prevent it from being 0.
  • Implement exponential backoff for reconnecting using a library like backoff.
  • Catch and handle the error in SafeGatewayPlugin.handleReconnectionAttempt to prevent it from bubbling up and crashing the gateway process.

Example Code

const backoff = require('backoff');

// Set default maxReconnectAttempts
const maxReconnectAttempts = 5;

// Implement exponential backoff
const backoffStrategy = backoff.exponential({
  initialDelay: 1000, // 1 second
  maxDelay: 30000, // 30 seconds
});

SafeGatewayPlugin.handleReconnectionAttempt = async function() {
  // ...

  if (maxReconnectAttempts === 0) {
    // Handle maxReconnectAttempts = 0 case
    console.error('Max reconnect attempts reached. Giving up.');
    return;
  }

  try {
    // Reconnect with exponential backoff
    await backoffStrategy.retry(async () => {
      // Reconnect logic here
    });
  } catch (error) {
    // Catch and handle error to prevent crashing the gateway process
    console.error('Error reconnecting:', error);
  }
};

Verification

To verify the fix, monitor the gateway logs for reconnect attempts and ensure that the process does not crash when Discord closes the WebSocket connection with code 1005.

Extra Tips

  • Consider adding a circuit breaker to prevent cascading failures in case of repeated reconnect failures.
  • Review the health-monitor and LaunchAgent configurations to ensure they are not contributing to the issue.
  • Test the fix thoroughly to ensure it works as expected in different scenarios.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING